gridsearch.predict_proba 结果是列表而不是数组

Posted 2023-03-12

技术标签:

【中文标题】gridsearch.predict_proba 结果是列表而不是数组【英文标题】：gridsearch.predict_proba results in list rather than array 【发布时间】：2021-10-09 01:08:24 【问题描述】：

我关注example 并尝试使用带有随机森林分类器的网格搜索来生成 roc_auc_score，但是，y_prob=model.predict_proba(X_test) 我生成的是列表（两个数组）而不是一个。所以我想知道这里出了什么问题。

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
from sklearn.metrics import roc_auc_score

X = np.random.rand(50,10)
y = np.random.permutation([1] * 25 + [0] * 25)

y= label_binarize(y, classes=[0, 1])
y= np.hstack((1-y, y))

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=7)  
index_split = sss.split(X, y)
train_index = []
test_index = []
for train_ind, test_ind in index_split:
train_index.extend(train_ind)
test_index.extend(test_ind)

data_train = X[train_index]
out_train = y[train_index]
data_test = X[test_index]
out_test = y[test_index]

rf = RandomForestClassifier()
grids = 
     'n_estimators': [10, 50, 100, 200],   
     'max_features': ['auto', 'sqrt', 'log2'], 
     'criterion': ['gini', 'entropy']
        
rf_grids_searched = GridSearchCV(rf, 
                                grids, 
                                scoring = "roc_auc",
                                n_jobs = -1,
                                refit=True,
                                cv = 5,
                                verbose=10)

rf_grids_searched.fit(data_train, out_train)
rf_best = rf_grids_searched.best_estimator_

y_prob=rf_best.predict_proba(data_test)
print(roc_auc_score(out_test, y_prob))

我的结果：

array([[0.5, 0.5],
    [0.5, 0.5],
    [0.7, 0.3],
    [0.3, 0.7],
    [0.7, 0.3],
    [0.5, 0.5],
    [0.1, 0.9],
    [0.6, 0.4],
    [0.6, 0.4],
    [0.4, 0.6]]), array([[0.5, 0.5],
    [0.5, 0.5],
    [0.3, 0.7],
    [0.7, 0.3],
    [0.3, 0.7],
    [0.5, 0.5],
    [0.9, 0.1],
    [0.4, 0.6],
    [0.4, 0.6],
    [0.6, 0.4]])]

概率为 [0,1] 的预期结果：

    array([[0.5, 0.5],
    [0.5, 0.5],
    [0.7, 0.3],
    [0.3, 0.7],
    [0.7, 0.3],
    [0.5, 0.5],
    [0.1, 0.9],
    [0.6, 0.4],
    [0.6, 0.4],

我还尝试不首先对 y 进行二值化，然后训练 gridsearch 以获取以下数组 y_prob。后来，我对y_test进行二值化，匹配y_prob的维度，得到分数。我想知道顺序是否正确？代码：

  out_test1= label_binarize(out_test, classes=[0, 1])
  out_test1= np.hstack((1-out_test1, out_test1))
  print(roc_auc_score(out_test1, y_prob))   

   array([[0.6, 0.4],
   [0.5, 0.5],
   [0.6, 0.4],
   [0.5, 0.5],
   [0.7, 0.3],
   [0.3, 0.7],
   [0.8, 0.2],
   [0.4, 0.6],
   [0.8, 0.2],
   [0.4, 0.6]])

【问题讨论】：

【参考方案1】：

网格搜索的predict_proba 方法只是对最佳估计器predict_proba 的分派。从the docstring 到RandomForestClassifier.predict_proba（强调添加）：

返回

p ：形状的ndarray（n_samples，n_classes），或n_outputs列表如果 n_outputs > 1，则为此类数组。 ...

由于您指定了两个输出（y 中的两列），因此您将获得两个目标中每个目标的两个类别的预测概率。

【讨论】：

谢谢。我还尝试不首先对 y 进行二值化，然后训练 gridsearch 以获得以下数组 y_prob。后来我对y_test进行二值化，匹配y_prob的维度，得到分数。但我想知道 roc_auc_score 以这种方式是否正确？结果附在帖子中一般情况下，不要二值化，从predict_proba拉出第二列用于auc分数。打印(roc_auc_score(out_test, y_prob[:, 1]))=0.48？这是否意味着第 1 类的预测概率为 0.48？虽然数据是假设性的，但是 roc 的预测概率能看起来不错，是的。随机分类器的 auc 为 0.5，但这是渐近的；对于你的小例子，略高于或低于最终并不是不合理的。

以上是关于gridsearch.predict_proba 结果是列表而不是数组的主要内容，如果未能解决你的问题，请参考以下文章