如何在不平衡的数据中解释具有中等精度和召回率的高 AUC-ROC?

Posted

技术标签:

【中文标题】如何在不平衡的数据中解释具有中等精度和召回率的高 AUC-ROC?【英文标题】:How to explain high AUC-ROC with mediocre precision and recall in unbalanced data? 【发布时间】:2016-06-11 20:45:15 【问题描述】:

我正在尝试理解一些机器学习结果。任务是预测/标记“爱尔兰人”与“非爱尔兰人”。 Python 2.7 的输出:

1= ir
0= non-ir
Class count:
0    4090942
1     940852
Name: ethnicity_scan, dtype: int64
Accuracy: 0.874921350119
Classification report:
             precision    recall  f1-score   support

          0       0.89      0.96      0.93   2045610
          1       0.74      0.51      0.60    470287

avg / total       0.87      0.87      0.87   2515897

Confusion matrix:
[[1961422   84188]
 [ 230497  239790]]
AUC-ir= 0.901238104773

如您所见,准确率和召回率中等,但 AUC-ROC 更高(~0.90)。我试图找出原因,我怀疑这是由于数据不平衡(大约 1:5)。基于混淆矩阵,并使用爱尔兰语作为目标 (+),我计算了 TPR=0.51 和 FPR=0.04。如果我将非爱尔兰人视为 (+),则 TPR=0.96 和 FPR=0.49。那么如何在 FPR=0.04 时 TPR 仅为 0.5 时获得 0.9 AUC?

代码:

try:
    for i in mass[k]:
        df = df_temp # reset df before each loop
        #$$
        #$$ 
        if 1==1:
        ###if i == singleEthnic:
            count+=1
            ethnicity_tar = str(i) # fr, en, ir, sc, others, ab, rus, ch, it, jp
            # fn, metis, inuit; algonquian, iroquoian, athapaskan, wakashan, siouan, salish, tsimshian, kootenay
            ############################################
            ############################################

            def ethnicity_target(row):
                try:
                    if row[ethnicity_var] == ethnicity_tar:
                        return 1
                    else:
                        return 0
                except: return None
            df['ethnicity_scan'] = df.apply(ethnicity_target, axis=1)
            print '1=', ethnicity_tar
            print '0=', 'non-'+ethnicity_tar

            # Random sampling a smaller dataframe for debugging
            rows = df.sample(n=subsample_size, random_state=seed) # Seed gives fixed randomness
            df = DataFrame(rows)
            print 'Class count:'
            print df['ethnicity_scan'].value_counts()

            # Assign X and y variables
            X = df.raw_name.values
            X2 = df.name.values
            X3 = df.gender.values
            X4 = df.location.values
            y = df.ethnicity_scan.values

            # Feature extraction functions
            def feature_full_name(nameString):
                try:
                    full_name = nameString
                    if len(full_name) > 1: # not accept name with only 1 character
                        return full_name
                    else: return '?'
                except: return '?'

            def feature_full_last_name(nameString):
                try:
                    last_name = nameString.rsplit(None, 1)[-1]
                    if len(last_name) > 1: # not accept name with only 1 character
                        return last_name
                    else: return '?'
                except: return '?'

            def feature_full_first_name(nameString):
                try:
                    first_name = nameString.rsplit(' ', 1)[0]
                    if len(first_name) > 1: # not accept name with only 1 character
                        return first_name
                    else: return '?'
                except: return '?'

            # Transform format of X variables, and spit out a numpy array for all features
            my_dict = ['last-name': feature_full_last_name(i) for i in X]
            my_dict5 = ['first-name': feature_full_first_name(i) for i in X]

            all_dict = []
            for i in range(0, len(my_dict)):
                temp_dict = dict(
                    my_dict[i].items() + my_dict5[i].items()
                    )
                all_dict.append(temp_dict)

            newX = dv.fit_transform(all_dict)

            # Separate the training and testing data sets
            X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)

            # Fitting X and y into model, using training data
            classifierUsed2.fit(X_train, y_train)

            # Making predictions using trained data
            y_train_predictions = classifierUsed2.predict(X_train)
            y_test_predictions = classifierUsed2.predict(X_test)

插入的重采样代码:

try:
    for i in mass[k]:
        df = df_temp # reset df before each loop
        #$$
        #$$ 
        if 1==1:
        ###if i == singleEthnic:
            count+=1
            ethnicity_tar = str(i) # fr, en, ir, sc, others, ab, rus, ch, it, jp
            # fn, metis, inuit; algonquian, iroquoian, athapaskan, wakashan, siouan, salish, tsimshian, kootenay
            ############################################
            ############################################

            def ethnicity_target(row):
                try:
                    if row[ethnicity_var] == ethnicity_tar:
                        return 1
                    else:
                        return 0
                except: return None
            df['ethnicity_scan'] = df.apply(ethnicity_target, axis=1)
            print '1=', ethnicity_tar
            print '0=', 'non-'+ethnicity_tar

            # Resampled
            df_resampled = df.append(df[df.ethnicity_scan==0].sample(len(df)*5, replace=True))

            # Random sampling a smaller dataframe for debugging
            rows = df_resampled.sample(n=subsample_size, random_state=seed) # Seed gives fixed randomness
            df = DataFrame(rows)
            print 'Class count:'
            print df['ethnicity_scan'].value_counts()

            # Assign X and y variables
            X = df.raw_name.values
            X2 = df.name.values
            X3 = df.gender.values
            X4 = df.location.values
            y = df.ethnicity_scan.values

            # Feature extraction functions
            def feature_full_name(nameString):
                try:
                    full_name = nameString
                    if len(full_name) > 1: # not accept name with only 1 character
                        return full_name
                    else: return '?'
                except: return '?'

            def feature_full_last_name(nameString):
                try:
                    last_name = nameString.rsplit(None, 1)[-1]
                    if len(last_name) > 1: # not accept name with only 1 character
                        return last_name
                    else: return '?'
                except: return '?'

            def feature_full_first_name(nameString):
                try:
                    first_name = nameString.rsplit(' ', 1)[0]
                    if len(first_name) > 1: # not accept name with only 1 character
                        return first_name
                    else: return '?'
                except: return '?'

            # Transform format of X variables, and spit out a numpy array for all features
            my_dict = ['last-name': feature_full_last_name(i) for i in X]
            my_dict5 = ['first-name': feature_full_first_name(i) for i in X]

            all_dict = []
            for i in range(0, len(my_dict)):
                temp_dict = dict(
                    my_dict[i].items() + my_dict5[i].items()
                    )
                all_dict.append(temp_dict)

            newX = dv.fit_transform(all_dict)

            # Separate the training and testing data sets
            X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)

            # Fitting X and y into model, using training data
            classifierUsed2.fit(X_train, y_train)

            # Making predictions using trained data
            y_train_predictions = classifierUsed2.predict(X_train)
            y_test_predictions = classifierUsed2.predict(X_test)

【问题讨论】:

Good ROC curve but poor precision-recall curve的可能重复 【参考方案1】:

您的模型会为它得分的测试集中的每一行输出一个概率 P(介于 0 和 1 之间)。汇总统计信息(精度、召回率等)是针对单个 P 值作为预测阈值的,可能 P=0.5,除非您在代码中更改了此值。但是 ROC 包含更多信息,其想法是您可能不想使用此默认值作为预测阈值,因此通过计算真阳性与假阳性的比率来绘制 ROC,跨越 0 和 0 之间的每个预测阈值1.

如果您在数据中对非爱尔兰人的抽样不足,那么 AUC 和精度将被高估是正确的;如果您的数据集只有 5000 行,那么在更大的训练集上运行模型将没有问题;只需重新平衡您的数据集(通过引导抽样来增加您的非爱尔兰人),直到您准确地反映您的样本人口。

【讨论】:

我没有改P,所以应该是0.5。出于报告目的,我可以按原样报告现有的精度、召回率和 ROC(同时使用默认 P=0.5)吗? 不,这肯定不行,你会夸大你的模型的效果,不要这样做! 请帮助我了解您来自哪里,您似乎暗示有效性的可能“夸大”是由于数据不平衡造成的。但我正在使用应该对其敏感的性能指标(即 F1 分数、精度和召回率)。那么为什么报告甚至 F1 分数、精度和召回率都夸大了性能呢? (注意:我听说过用于不平衡数据的过采样/欠采样技术,但它们有自己的缺陷,例如丢失信息或与重复噪声建模过于紧密等) 或者你只是说它只会夸大ROC度量?如果是这样,制作 AUC-ROC 图有帮助吗? 如果您报告您的准确率是 89%,那么您就是在说“我的模型可以正确预测 89% 的时间”。但是,由于您对非爱尔兰人进行了欠采样,因此您夸大了模型的性能,如果您在未欠采样的新测试集上重新运行模型,则精度会差很多,可能只有 30%。想一想,如果我走进一个房间,把一大群非爱尔兰人踢出去,我突然就能更好地判断人们是否是爱尔兰人,即使我只是随机猜测!

以上是关于如何在不平衡的数据中解释具有中等精度和召回率的高 AUC-ROC?的主要内容,如果未能解决你的问题,请参考以下文章

如何在 Python 中提高不平衡数据集的精度和召回率

牢记分类指标:准确率、精确率、召回率、F1 score以及ROC

精度评定中的准确率(Precision)和召回率(Recall)

如何解释和调查不平衡数据中的完美准确度、精确度、召回率、F1 和 AUC(我不信任)

F1 小于 Scikit-learn 中的精度和召回率

多分类问题的准确率,召回率怎么计算