如何在 python scikit-learn 中更改精度和召回的阈值?

Posted

技术标签:

【中文标题】如何在 python scikit-learn 中更改精度和召回的阈值?【英文标题】:How to change threshold for precision and recall in python scikit-learn? 【发布时间】:2016-06-12 01:11:49 【问题描述】:

我听说有人说您可以调整阈值以调整精度和召回率之间的权衡,但我找不到如何做到这一点的实际示例。

我的代码:

for i in mass[k]:
    df = df_temp # reset df before each loop
    #$$
    #$$ 
    if 1==1:
    ###if i == singleEthnic:
        count+=1
        ethnicity_tar = str(i) # fr, en, ir, sc, others, ab, rus, ch, it, jp
        # fn, metis, inuit; algonquian, iroquoian, athapaskan, wakashan, siouan, salish, tsimshian, kootenay
        ############################################
        ############################################

        def ethnicity_target(row):
            try:
                if row[ethnicity_var] == ethnicity_tar:
                    return 1
                else:
                    return 0
            except: return None
        df['ethnicity_scan'] = df.apply(ethnicity_target, axis=1)
        print '1=', ethnicity_tar
        print '0=', 'non-'+ethnicity_tar

        # Random sampling a smaller dataframe for debugging
        rows = df.sample(n=subsample_size, random_state=seed) # Seed gives fixed randomness
        df = DataFrame(rows)
        print 'Class count:'
        print df['ethnicity_scan'].value_counts()

        # Assign X and y variables
        X = df.raw_name.values
        X2 = df.name.values
        X3 = df.gender.values
        X4 = df.location.values
        y = df.ethnicity_scan.values

        # Feature extraction functions
        def feature_full_name(nameString):
            try:
                full_name = nameString
                if len(full_name) > 1: # not accept name with only 1 character
                    return full_name
                else: return '?'
            except: return '?'

        def feature_full_last_name(nameString):
            try:
                last_name = nameString.rsplit(None, 1)[-1]
                if len(last_name) > 1: # not accept name with only 1 character
                    return last_name
                else: return '?'
            except: return '?'

        def feature_full_first_name(nameString):
            try:
                first_name = nameString.rsplit(' ', 1)[0]
                if len(first_name) > 1: # not accept name with only 1 character
                    return first_name
                else: return '?'
            except: return '?'

        # Transform format of X variables, and spit out a numpy array for all features
        my_dict = ['last-name': feature_full_last_name(i) for i in X]
        my_dict5 = ['first-name': feature_full_first_name(i) for i in X]

        all_dict = []
        for i in range(0, len(my_dict)):
            temp_dict = dict(
                my_dict[i].items() + my_dict5[i].items()
                )
            all_dict.append(temp_dict)

        newX = dv.fit_transform(all_dict)

        # Separate the training and testing data sets
        X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)

        # Fitting X and y into model, using training data
        classifierUsed2.fit(X_train, y_train)

        # Making predictions using trained data
        y_train_predictions = classifierUsed2.predict(X_train)
        y_test_predictions = classifierUsed2.predict(X_test)

我尝试替换 "y_test_predictions = classifierUsed2.predict(X_test)" with "y_test_predictions = classifierUsed2.predict(X_test) > 0.8""y_test_predictions = classifierUsed2.predict(X_test) > 0.01" 行,没有发生太大变化。

【问题讨论】:

谢谢 DoughnutZombie,你能告诉我如何灰色突出显示文本吗? 要标记内联代码,请在开始和结束处使用反引号`。另见***.com/editing-help,例如在最底部的“评论格式”。 对您的问题:您使用什么分类器?分类器有predict_proba,而不是predict?因为 predict 通常只输出 1 和 0。 predict_proba 输出一个可以设置阈值的浮点数。 我使用了logistic reg和svm 【参考方案1】:

classifierUsed2.predict(X_test) 只输出每个样本的预测类别(最有可能是 0 和 1)。你想要的是classifierUsed2.predict_proba(X_test),它输出一个二维数组,每个样本的每个类都有概率。要进行阈值处理,您可以执行以下操作:

y_test_probabilities = classifierUsed2.predict_proba(X_test)
# y_test_probabilities has shape = [n_samples, n_classes]

y_test_predictions_high_precision = y_test_probabilities[:,1] > 0.8
y_test_predictions_high_recall = y_test_probabilities[:,1] > 0.1

y_test_predictions_high_precision 将包含相当肯定属于第 1 类的样本,而y_test_predictions_high_recall 将更频繁地预测第 1 类(并实现更高的召回率),但也会包含许多误报。

predict_proba 被您使用的两个分类器支持,逻辑回归和 SVM。

【讨论】:

以上是关于如何在 python scikit-learn 中更改精度和召回的阈值?的主要内容,如果未能解决你的问题,请参考以下文章

如何在 Python 中遍历 C++ 集?

如何在 Spark 中使用 Python 查找 DataFrame 中的分区数以及如何在 Spark 中使用 Python 在 DataFrame 中创建分区

如何搭建python环境

如何在 python 脚本中使用 awscli?

如何在vscode中更改python的执行者?

python pop() ,如何在Python的列表或数组中移除元素