sklearn中留一法交叉验证的混淆矩阵

Posted

技术标签:

【中文标题】sklearn中留一法交叉验证的混淆矩阵【英文标题】:Confusion Matrix for Leave-One-Out Cross Validation in sklearn 【发布时间】:2019-04-13 14:48:57 【问题描述】:

I know how to draw confusion matrix when I use the train and test split using sklearn 但我不知道在使用留一法交叉验证时如何创建混淆矩阵as shown in this example:

# Evaluate using Leave One Out Cross Validation
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
num_instances = len(X)
loocv = model_selection.LeaveOneOut()
model = LogisticRegression()
results = model_selection.cross_val_score(model, X, Y, cv=loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

我应该如何为 LOOCV 创建混淆矩阵以可视化每个类别的准确度?

【问题讨论】:

【参考方案1】:

从here 借用您的方法,您可以通过创建一个在迭代期间接收元数据的自定义记分器来解决该问题。这些元数据可用于查找:F1 Score、Precision、Recall、Accuracy 以及混淆矩阵!


这里我们需要另一个使用GridSearchCV 的技巧,它接受自定义记分器,所以我们开始吧!


这是一个示例,您可以根据自己的绝对要求进行更多操作:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, accuracy_score, confusion_matrix
from sklearn.model_selection import GridSearchCV, StratifiedKFold


# Your method from the link you provided
def cm_analysis(y_true, y_pred, labels, ymap=None, figsize=(10,10)):
    if ymap is not None:
        y_pred = [ymap[yi] for yi in y_pred]
        y_true = [ymap[yi] for yi in y_true]
        labels = [ymap[yi] for yi in labels]
    cm = confusion_matrix(y_true, y_pred, labels=labels)
    cm_sum = np.sum(cm, axis=1, keepdims=True)
    cm_perc = cm / cm_sum.astype(float) * 100
    annot = np.empty_like(cm).astype(str)
    nrows, ncols = cm.shape
    for i in range(nrows):
        for j in range(ncols):
            c = cm[i, j]
            p = cm_perc[i, j]
            if i == j:
                s = cm_sum[i]
                annot[i, j] = '%.1f%%\n%d/%d' % (p, c, s)
            elif c == 0:
                annot[i, j] = ''
            else:
                annot[i, j] = '%.1f%%\n%d' % (p, c)
    cm = pd.DataFrame(cm, index=labels, columns=labels)
    cm.index.name = 'Actual'
    cm.columns.name = 'Predicted'
    fig, ax = plt.subplots(figsize=figsize)
    sns.heatmap(cm, annot=annot, fmt='', ax=ax)
    #plt.savefig(filename)
    plt.show()


# Custom Scorer
def my_scorer(y_true, y_pred):
    acc = accuracy_score(y_true, y_pred)
    # you can either save  y_true, y_pred and accuracy into a file
    # for later use with the info in clf.cv_results_
    # or plot the confusion matrix right here!
    # for labels, you can create a class attribute to make it more dynamic
    # i.e. changes automatically with every new dataset!
    cm_analysis(y_true, y_pred, labels=[0,1], ymap=None, figsize=(10, 10))
    # N.B as long as you have y_true and y_pred from every round here, you can
    # do with them all the metrics that want such as F1 Score, Precision, Recall, A
    # ccuracy and the Confusion Matrix!
    return acc


url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(url, names=names)
array = df.values
X = np.array(array[:,0:8])
Y = np.array(array[:,8]).astype(int)

# I'll make it two just for submitting the result here!
num_folds = 2
skf = StratifiedKFold(n_splits=num_folds, random_state=0)

# this is just a trick because the list contains 
# the default parameter only (i.e. useless)
param_grid = 'C': [1.0]
model = LogisticRegression()
# create custom scorer
custom_scorer = make_scorer(my_scorer)
# pass it to the GridSearchCV
clf = GridSearchCV(model, param_grid, scoring=custom_scorer, cv=skf, return_train_score=True)
# Fit and Go
clf.fit(X,Y)

# cv_results_ is a dict with all CV results during the iterations!
# IDK, you may need it to combine its content with the metrics ..etc
print(clf.cv_results_)

结果

'mean_score_time': array([0.09023476]), 'split0_train_score': 
 array([0.79166667]), 'mean_train_score': array([0.77864583]), 
'params': ['C': 1.0], 'std_test_score': array([0.01953125]), 
'mean_fit_time': array([0.00235796]), 
'param_C': masked_array(data=[1.0], mask=[False], fill_value='?',
dtype=object), 'rank_test_score': array([1], dtype=int32), 
'split1_test_score': array([0.7734375]), 
'std_fit_time': array([0.00032902]), 'mean_test_score': array([0.75390625]), 
'std_score_time': array([0.00237632]), 'split1_train_score': array([0.765625]), 
'split0_test_score': array([0.734375]), 'std_train_score': array([0.01302083])

分割 0

拆分 1


编辑

如果你严格要求LOOCV,那么你可以在上面的代码中应用它,只需将StratifiedKFold替换为LeaveOneOut函数即可;但请记住,LeaveOneOut 将迭代 684 次!所以它在计算上非常昂贵。但是,这会在迭代期间为您提供详细的混淆矩阵(即元数据)。

尽管如此,如果您正在寻找整个(即最终)过程的混淆矩阵,那么您仍然需要使用GridSearchCV,但如下所示:

......
loocv = LeaveOneOut()
clf = GridSearchCV(model, param_grid, scoring='accuracy', cv=loocv)
clf.fit(X,Y)

y_pred = clf.best_estimator_.predict(X)
cm_analysis(Y, y_pred, labels=[0, 1], ymap=None, figsize=(10,10))

结果

【讨论】:

好吧,我已经提到我正在寻找 LOOCV 的混淆矩阵。你能修改你的答案,让它回答它所问的吗? 我看不到连接。你为什么不真正为所要求的内容创建混淆矩阵?我不熟悉您所说的内容,并且答案中没有足够的解释来显示联系

以上是关于sklearn中留一法交叉验证的混淆矩阵的主要内容,如果未能解决你的问题,请参考以下文章

交叉验证、留一交叉验证、自助法

Python为给定模型执行留一法交叉验证实战LOOCV(leave-one-out cross-validation)

留一法交叉验证 Leave-One-Out Cross Validation

留一法交叉验证 Leave-One-Out Cross Validation

python实现jacknife交叉验证

交叉验证iris数据集