在 scikit-learn 中获得二元概率分类器的最大准确度
Posted
技术标签:
【中文标题】在 scikit-learn 中获得二元概率分类器的最大准确度【英文标题】:Getting the maximum accuracy for a binary probabilistic classifier in scikit-learn 【发布时间】:2015-10-07 22:10:45 【问题描述】:scikit-learn 中是否有任何内置函数可以为二进制概率分类器获得最大准确度?
例如获得最高的 F1 分数:
# AUCPR
precision, recall, thresholds = sklearn.metrics.precision_recall_curve(y_true, y_score)
auprc = sklearn.metrics.auc(recall, precision)
max_f1 = 0
for r, p, t in zip(recall, precision, thresholds):
if p + r == 0: continue
if (2*p*r)/(p + r) > max_f1:
max_f1 = (2*p*r)/(p + r)
max_f1_threshold = t
我可以用类似的方式计算最大准确度:
accuracies = []
thresholds = np.arange(0,1,0.1)
for threshold in thresholds:
y_pred = np.greater(y_score, threshold).astype(int)
accuracy = sklearn.metrics.accuracy_score(y_true, y_pred)
accuracies.append(accuracy)
accuracies = np.array(accuracies)
max_accuracy = accuracies.max()
max_accuracy_threshold = thresholds[accuracies.argmax()]
但我想知道是否有任何内置功能。
【问题讨论】:
嗨弗兰克,你找到它的内置函数了吗,因为我现在正在搜索。 @GeorgeSolymosi 我没有找到它的内置函数。 感谢您的信息,请注意,accuracy = np.array(accuracy)
行应该已更改为 accuracy = np.array(accuracies)
或类似的:)
@GeorgeSolymosi 谢谢好收获!
yw,顺便说一句漂亮、清晰、透明的代码 Franck!
【参考方案1】:
我开始改进解决方案,将thresholds = np.arange(0,1,0.1)
转换为一种更智能的二分法来找到最大值
然后我意识到,经过 2 小时的工作,获得 所有准确度 比仅仅找到最大值要便宜得多! (是的,这完全违反直觉)。
我在下面写了很多 cmets 来解释我的代码。随意删除所有这些以使代码更具可读性。
import numpy as np
# Definition : we predict True if y_score > threshold
def ROC_curve_data(y_true, y_score):
y_true = np.asarray(y_true, dtype=np.bool_)
y_score = np.asarray(y_score, dtype=np.float_)
assert(y_score.size == y_true.size)
order = np.argsort(y_score) # Just ordering stuffs
y_true = y_true[order]
# The thresholds to consider are just the values of score, and 0 (accept everything)
thresholds = np.insert(y_score[order],0,0)
TP = [sum(y_true)] # Number of True Positives (For Threshold = 0 => We accept everything => TP[0] = # of postive in true y)
FP = [sum(~y_true)] # Number of True Positives (For Threshold = 0 => We accept everything => TP[0] = # of postive in true y)
TN = [0] # Number of True Negatives (For Threshold = 0 => We accept everything => we don't have negatives !)
FN = [0] # Number of True Negatives (For Threshold = 0 => We accept everything => we don't have negatives !)
for i in range(1, thresholds.size) : # "-1" because the last threshold
# At this step, we stop predicting y_score[i-1] as True, but as False.... what y_true value say about it ?
# if y_true was True, that step was a mistake !
TP.append(TP[-1] - int(y_true[i-1]))
FN.append(FN[-1] + int(y_true[i-1]))
# if y_true was False, that step was good !
FP.append(FP[-1] - int(~y_true[i-1]))
TN.append(TN[-1] + int(~y_true[i-1]))
TP = np.asarray(TP, dtype=np.int_)
FP = np.asarray(FP, dtype=np.int_)
TN = np.asarray(TN, dtype=np.int_)
FN = np.asarray(FN, dtype=np.int_)
accuracy = (TP + TN) / (TP + FP + TN + FN)
sensitivity = TP / (TP + FN)
specificity = TN / (FP + TN)
return((thresholds, TP, FP, TN, FN))
所有过程只是一个循环,算法很简单。
事实上,这个愚蠢的简单函数比我之前提出的解决方案快 10 倍(计算 thresholds = np.arange(0,1,0.1)
的准确度),比我之前的 smart-ass-dychotomous-algorithm 快 30 倍......
然后您可以轻松地计算出您想要的ANY KPI,例如:
def max_accuracy(thresholds, TP, FP, TN, FN) :
accuracy = (TP + TN) / (TP + FP + TN + FN)
return(max(accuracy))
def max_min_sensitivity_specificity(thresholds, TP, FP, TN, FN) :
sensitivity = TP / (TP + FN)
specificity = TN / (FP + TN)
return(max(np.minimum(sensitivity, specificity)))
如果你想测试它:
y_score = np.random.uniform(size = 100)
y_true = [np.random.binomial(1, p) for p in y_score]
data = ROC_curve_data(y_true, y_score)
%matplotlib inline # Because I personnaly use Jupyter, you can remove it otherwise
import matplotlib.pyplot as plt
plt.step(data[0], data[1])
plt.step(data[0], data[2])
plt.step(data[0], data[3])
plt.step(data[0], data[4])
plt.show()
print("Max accuracy is", max_accuracy(*data))
print("Max of Min(Sensitivity, Specificity) is", max_min_sensitivity_specificity(*data))
享受;)
【讨论】:
这样做的缺点是,特别是对于不平衡的数据集,分数的大部分变化可能位于第一个或最后一个 bin 中。更好的方法为每个唯一的 (tp,fp,fn,tn) 计算阈值,tp,fp,fn,tn。这可以在一次通过中有效地完成(scikit 在计算 AUCROC 时在内部执行此操作。)【参考方案2】:from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_true, probs)
accuracy_scores = []
for thresh in thresholds:
accuracy_scores.append(accuracy_score(y_true, [m > thresh for m in probs]))
accuracies = np.array(accuracy_scores)
max_accuracy = accuracies.max()
max_accuracy_threshold = thresholds[accuracies.argmax()]
【讨论】:
以上是关于在 scikit-learn 中获得二元概率分类器的最大准确度的主要内容,如果未能解决你的问题,请参考以下文章
如何在 scikit-learn 中使用交叉验证获得预测概率