Roc 曲线和截止点。 Python

Posted 2023-02-16

技术标签:

【中文标题】Roc 曲线和截止点。 Python【英文标题】：Roc curve and cut off point. Python 【发布时间】：2015-04-27 10:47:30 【问题描述】：

我运行了一个逻辑回归模型并预测了 logit 值。我用它来获取 ROC 曲线上的点：

 from sklearn import metrics
 fpr, tpr, thresholds = metrics.roc_curve(Y_test,p)

我知道metrics.roc_auc_score 给出了 ROC 曲线下的面积。谁能告诉我哪个命令会找到最佳截止点（阈值）？

【问题讨论】：

你的问题的答案很简单，np.argmax(tpr - fpr) 如果你想要阈值，它只是阈值[np.argmax(tpr - fpr)]。其他的都是冗长的。谁能说出thresholds[np.argmax(tpr - fpr)] 和最受好评的答案功能threshold = Find_Optimal_Cutoff(data['true'], data['pred']) 之间的区别？阈值很接近，但在我进行实际计算时会有所不同。我认为要找到最佳点，您正在寻找敏感性和特异性的平衡点，或者 tpr 和 1-fpr。如果您有特定的理由不在敏感性和特异性之间存在最小差异，我可以理解。不过对我来说，阈值的最佳点是thresholds[np.argmin(abs(tpr-(1-fpr)))] 如果您认为最佳阈值是曲线上最靠近 ROC-AUC 图左上角的点，您可以使用thresholds[np.argmin((1 - tpr) ** 2 + fpr ** 2)]。但是@cgnorthcutt 的解决方案最大化了 Youden 的 J 统计量，这似乎是更被接受的方法。对于您的情况而言，真正“最佳”的方法取决于误报和误报的相对成本。 【参考方案1】：

你可以这样做using the epi package in R，但是我在 Python 中找不到类似的包或示例。

最佳分界点是“真阳性率”高和“假阳性率”低。基于这个逻辑，我在下面提取了一个示例来找到最佳阈值。

Python 代码：

import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
from sklearn.metrics import roc_curve, auc

# read the data in
df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")

# rename the 'rank' column because there is also a DataFrame method called 'rank'
df.columns = ["admit", "gre", "gpa", "prestige"]
# dummify rank
dummy_ranks = pd.get_dummies(df['prestige'], prefix='prestige')
# create a clean data frame for the regression
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(dummy_ranks.iloc[:, 'prestige_2':])

# manually add the intercept
data['intercept'] = 1.0

train_cols = data.columns[1:]
# fit the model
result = sm.Logit(data['admit'], data[train_cols]).fit()
print result.summary()

# Add prediction to dataframe
data['pred'] = result.predict(data[train_cols])

fpr, tpr, thresholds =roc_curve(data['admit'], data['pred'])
roc_auc = auc(fpr, tpr)
print("Area under the ROC curve : %f" % roc_auc)

####################################
# The optimal cut off would be where tpr is high and fpr is low
# tpr - (1-fpr) is zero or near to zero is the optimal cut off point
####################################
i = np.arange(len(tpr)) # index for df
roc = pd.DataFrame('fpr' : pd.Series(fpr, index=i),'tpr' : pd.Series(tpr, index = i), '1-fpr' : pd.Series(1-fpr, index = i), 'tf' : pd.Series(tpr - (1-fpr), index = i), 'thresholds' : pd.Series(thresholds, index = i))
roc.iloc[(roc.tf-0).abs().argsort()[:1]]

# Plot tpr vs 1-fpr
fig, ax = pl.subplots()
pl.plot(roc['tpr'])
pl.plot(roc['1-fpr'], color = 'red')
pl.xlabel('1-False Positive Rate')
pl.ylabel('True Positive Rate')
pl.title('Receiver operating characteristic')
ax.set_xticklabels([])

最佳截止点是 0.317628，因此高于此值的任何值都可以标记为 1，否则为 0。您可以从输出/图表中看到，当 TPR 与 1-FPR 交叉时，TPR 为 63%，FPR 为 36%，并且在当前示例中，TPR-(1-FPR) 最接近于零。

输出：

        1-fpr       fpr        tf     thresholds       tpr
  171  0.637363  0.362637  0.000433    0.317628     0.637795

希望这有帮助。

编辑

为了简化和带来可重用性，我制作了一个函数来找到最佳概率截止点。

Python 代码：

def Find_Optimal_Cutoff(target, predicted):
    """ Find the optimal probability cutoff point for a classification model related to event rate
    Parameters
    ----------
    target : Matrix with dependent or target data, where rows are observations

    predicted : Matrix with predicted data, where rows are observations

    Returns
    -------     
    list type, with optimal cutoff value
        
    """
    fpr, tpr, threshold = roc_curve(target, predicted)
    i = np.arange(len(tpr)) 
    roc = pd.DataFrame('tf' : pd.Series(tpr-(1-fpr), index=i), 'threshold' : pd.Series(threshold, index=i))
    roc_t = roc.iloc[(roc.tf-0).abs().argsort()[:1]]

    return list(roc_t['threshold']) 


# Add prediction probability to dataframe
data['pred_proba'] = result.predict(data[train_cols])

# Find optimal probability threshold
threshold = Find_Optimal_Cutoff(data['admit'], data['pred_proba'])
print threshold
# [0.31762762459360921]

# Find prediction to the dataframe applying threshold
data['pred'] = data['pred_proba'].map(lambda x: 1 if x > threshold else 0)

# Print confusion Matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(data['admit'], data['pred'])
# array([[175,  98],
#        [ 46,  81]])

【讨论】：

是否有一种简单的方法可以将此阈值应用于预测？还是您只是在data['preds'] 上使用apply 类型函数？ @skmathur，我已将其作为可重用和简化的功能。希望这会有所帮助。 Find_Optimal_Cutoff 函数中的 Youden 指数公式有问题。 roc_curve 返回fpr，这是误报率（1-特异性）。你减去(1-fpr)。您需要将tpr-(1-fpr) 更改为tpr-fpr。 R 中的 Epi 包选择最大化的截止值（特异性 + 灵敏度）。因此，它应该是 tpr + (1-fpr) 而不是代码中给出的 tpr - (1-fpr) @JohnBonfardeci 只有我吗？我觉得 OPs 解决方案产生了错误的结果.. 不应该是 pd.Series(tpr-fpr, index=thresholds, name='tf').idxmax() 吗？【参考方案2】：

Youden 的 J-Score 的 Vanilla Python 实现

def cutoff_youdens_j(fpr,tpr,thresholds):
    j_scores = tpr-fpr
    j_ordered = sorted(zip(j_scores,thresholds))
    return j_ordered[-1][1]

【讨论】：

【参考方案3】：

鉴于您问题中的 tpr、fpr、阈值，最佳阈值的答案是：

optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]

【讨论】：

如果我得到负的optimal_threshold ....，我的输出预测在[0,1]范围内... 按照建议使用 optimize_idx = np.argmax(tpr - fpr) optimization_threshold = thresholds[optimal_idx] 对我不起作用。阈值数组包含负值，但我希望值介于 0 和 1 之间。 @rafaelcaballero“不起作用”？您描述的所有内容听起来都正常工作。它不应该在 0 和 1 之间。它只是一个分数。那我误解了这个问题。我认为阈值在 0 和 1 之间移动，目标是在此范围内找到使 tpr-fpr 最大化的值 @cgnorthcutt 您的代码是正确的。但是 TPR = TP/（实正），FPR = FP/（实负）。 TPR + FPR ！= 1。【参考方案4】：

虽然我迟到了，但您也可以使用几何平均来确定最佳阈值，如下所述：threshold tuning for imbalance classification

可以这样计算：

# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

【讨论】：

这里似乎没有必要使用sqrt。没有它，Argmax 也一样。它给了我们什么优化？对于这个例子，它不会与 Youden 的索引相矛盾：tpr=0.5 & fpr=0.5 和 tpr=0.55 & fpr=0.45？【参考方案5】：

另一种可能的解决方案。

我将创建一些随机数据。

import numpy as np
import pandas as pd
import scipy.stats as sps
from sklearn import linear_model
from sklearn.metrics import roc_curve, RocCurveDisplay, auc
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns

# define data distributions
N0 = 300
N1 = 250

dist0 = sps.gamma(a=8, scale=1/10)
x0 = np.linspace(dist0.ppf(0), dist0.ppf(1-1e-5), 100)
y0 = dist0.pdf(x0)

dist1 = sps.gamma(a=15, scale=1/10)
x1 = np.linspace(dist1.ppf(0), dist1.ppf(1-1e-5), 100)
y1 = dist1.pdf(x1)

with plt.style.context("bmh"):
    plt.plot(x0, y0, label="NEG")
    plt.plot(x1, y1, label="POS")
    plt.legend()
    plt.title("Gamma distributions")

# create a random dataset
rvs0 = dist0.rvs(N0, random_state=0)
rvs1 = dist1.rvs(N1, random_state=1)

with plt.style.context("bmh"):
    plt.hist(rvs0, alpha=.5, label="NEG")
    plt.hist(rvs1, alpha=.5, label="POS")
    plt.legend()
    plt.title("Random dataset")

使用观察值（x 特征和 y 目标）初始化数据框

df = pd.DataFrame(
    "y": np.concatenate(( np.repeat(0, N0) , np.repeat(1, N1) )),
    "x": np.concatenate(( rvs0             , rvs1             )),
)

并用箱形图显示

# plot the data
with plt.style.context("bmh"):
    g = sns.catplot(
        kind="box",
        data=df,
        x="y", y="x"
    )
    ax = g.axes.flat[0]
    sns.stripplot(
        data=df,
        x="y", y="x",
        ax=ax, color='k',
        alpha=.25
    )
    plt.show()

现在，我们可以将数据帧拆分为训练测试、执行逻辑回归、计算 ROC 曲线、AUC、约登指数，找到截止点并绘制所有内容。全部使用pandas

# split dataset into train-test
X_train, X_test, y_train, y_test = train_test_split(
    df[["x"]], df.y.values, test_size=0.5, random_state=1)
# init and fit Logistic Regression on train set
clf = linear_model.LogisticRegression()
clf.fit(X_train, y_train)
# predict probabilities on x test set
y_proba = clf.predict_proba(X_test)
# compute FPR and TPR from y test set and predicted probabilities
fpr, tpr, thresholds = roc_curve(
    y_test, y_proba[:,1], drop_intermediate=False)
# compute ROC AUC
roc_auc = auc(fpr, tpr)
# init a dataframe for results
df_test = pd.DataFrame(
    "x": X_test.x.values.flatten(),
    "y": y_test,
    "proba": y_proba[:,1]
)
# sort it by predicted probabilities
# because thresholds[1:] = y_proba[::-1]
df_test.sort_values(by="proba", inplace=True)
# add reversed TPR and FPR
df_test["tpr"] = tpr[1:][::-1]
df_test["fpr"] = fpr[1:][::-1]
# optional: add thresholds to check
#df_test["thresholds"] = thresholds[1:][::-1]
# add Youden's j index
df_test["youden_j"] = df_test.tpr - df_test.fpr
# define the cut_off and diplay it
cut_off = df_test.sort_values(
    by="youden_j", ascending=False, ignore_index=True).iloc[0]
print("CUT-OFF:")
print(cut_off)

# plot everything
with plt.style.context("bmh"):
    fig, ax = plt.subplots(1, 3, figsize=(15, 5))
    
    RocCurveDisplay(
        fpr=df_test.fpr, tpr=df_test.tpr,
        roc_auc=roc_auc).plot(ax=ax[0])
    ax[0].set_title("ROC curve")
    ax[0].axline(xy1=(0,0), slope=1, color="r", ls=":")
    ax[0].plot(cut_off.fpr, cut_off.tpr, 'ko', ms=10)
    
    df_test.plot(
        x="youden_j", y="proba", ax=ax[1], 
        ylabel="Predicted Probabilities", xlabel="Youden j",
        title="Youden's index", legend=False
    )
    ax[1].axvline(cut_off.youden_j, color="k", ls="--")
    ax[1].axhline(cut_off.proba, color="k", ls="--")
    
    df_test.plot(
        x="x", y="proba", ax=ax[2], 
        ylabel="Predicted Probabilities", xlabel="X Feature",
        title="Cut-Off", legend=False
    )
    ax[2].axvline(cut_off.x, color="k", ls="--")
    ax[2].axhline(cut_off.proba, color="k", ls="--")

    plt.show()

我们得到

CUT-OFF:
x           1.065712
y           1.000000
proba       0.378543
tpr         0.852713
fpr         0.143836
youden_j    0.708878

我们终于可以检查了

# check results
TP = df_test[(df_test.x>=cut_off.x)&(df_test.y==1)].index.size
FP = df_test[(df_test.x>=cut_off.x)&(df_test.y==0)].index.size
TN = df_test[(df_test.x< cut_off.x)&(df_test.y==0)].index.size
FN = df_test[(df_test.x< cut_off.x)&(df_test.y==1)].index.size

print("True Positive Rate: ", TP / (TP + FN))
print("False Positive Rate:", 1 - TN / (TN + FP))

True Positive Rate:  0.8527131782945736
False Positive Rate: 0.14383561643835618

【讨论】：

以上是关于Roc 曲线和截止点。 Python的主要内容，如果未能解决你的问题，请参考以下文章