如何在 python scikit-learn 中更改精度和召回的阈值?
Posted
技术标签:
【中文标题】如何在 python scikit-learn 中更改精度和召回的阈值?【英文标题】:How to change threshold for precision and recall in python scikit-learn? 【发布时间】:2016-06-12 01:11:49 【问题描述】:我听说有人说您可以调整阈值以调整精度和召回率之间的权衡,但我找不到如何做到这一点的实际示例。
我的代码:
for i in mass[k]:
df = df_temp # reset df before each loop
#$$
#$$
if 1==1:
###if i == singleEthnic:
count+=1
ethnicity_tar = str(i) # fr, en, ir, sc, others, ab, rus, ch, it, jp
# fn, metis, inuit; algonquian, iroquoian, athapaskan, wakashan, siouan, salish, tsimshian, kootenay
############################################
############################################
def ethnicity_target(row):
try:
if row[ethnicity_var] == ethnicity_tar:
return 1
else:
return 0
except: return None
df['ethnicity_scan'] = df.apply(ethnicity_target, axis=1)
print '1=', ethnicity_tar
print '0=', 'non-'+ethnicity_tar
# Random sampling a smaller dataframe for debugging
rows = df.sample(n=subsample_size, random_state=seed) # Seed gives fixed randomness
df = DataFrame(rows)
print 'Class count:'
print df['ethnicity_scan'].value_counts()
# Assign X and y variables
X = df.raw_name.values
X2 = df.name.values
X3 = df.gender.values
X4 = df.location.values
y = df.ethnicity_scan.values
# Feature extraction functions
def feature_full_name(nameString):
try:
full_name = nameString
if len(full_name) > 1: # not accept name with only 1 character
return full_name
else: return '?'
except: return '?'
def feature_full_last_name(nameString):
try:
last_name = nameString.rsplit(None, 1)[-1]
if len(last_name) > 1: # not accept name with only 1 character
return last_name
else: return '?'
except: return '?'
def feature_full_first_name(nameString):
try:
first_name = nameString.rsplit(' ', 1)[0]
if len(first_name) > 1: # not accept name with only 1 character
return first_name
else: return '?'
except: return '?'
# Transform format of X variables, and spit out a numpy array for all features
my_dict = ['last-name': feature_full_last_name(i) for i in X]
my_dict5 = ['first-name': feature_full_first_name(i) for i in X]
all_dict = []
for i in range(0, len(my_dict)):
temp_dict = dict(
my_dict[i].items() + my_dict5[i].items()
)
all_dict.append(temp_dict)
newX = dv.fit_transform(all_dict)
# Separate the training and testing data sets
X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)
# Fitting X and y into model, using training data
classifierUsed2.fit(X_train, y_train)
# Making predictions using trained data
y_train_predictions = classifierUsed2.predict(X_train)
y_test_predictions = classifierUsed2.predict(X_test)
我尝试替换 "y_test_predictions = classifierUsed2.predict(X_test)" with "y_test_predictions = classifierUsed2.predict(X_test) > 0.8"
和 "y_test_predictions = classifierUsed2.predict(X_test) > 0.01"
行,没有发生太大变化。
【问题讨论】:
谢谢 DoughnutZombie,你能告诉我如何灰色突出显示文本吗? 要标记内联代码,请在开始和结束处使用反引号`。另见***.com/editing-help,例如在最底部的“评论格式”。 对您的问题:您使用什么分类器?分类器有predict_proba
,而不是predict
?因为 predict 通常只输出 1 和 0。 predict_proba
输出一个可以设置阈值的浮点数。
我使用了logistic reg和svm
【参考方案1】:
classifierUsed2.predict(X_test)
只输出每个样本的预测类别(最有可能是 0 和 1)。你想要的是classifierUsed2.predict_proba(X_test)
,它输出一个二维数组,每个样本的每个类都有概率。要进行阈值处理,您可以执行以下操作:
y_test_probabilities = classifierUsed2.predict_proba(X_test)
# y_test_probabilities has shape = [n_samples, n_classes]
y_test_predictions_high_precision = y_test_probabilities[:,1] > 0.8
y_test_predictions_high_recall = y_test_probabilities[:,1] > 0.1
y_test_predictions_high_precision
将包含相当肯定属于第 1 类的样本,而y_test_predictions_high_recall
将更频繁地预测第 1 类(并实现更高的召回率),但也会包含许多误报。
predict_proba
被您使用的两个分类器支持,逻辑回归和 SVM。
【讨论】:
以上是关于如何在 python scikit-learn 中更改精度和召回的阈值?的主要内容,如果未能解决你的问题,请参考以下文章
如何在 python 虚拟环境中导入 scikit-learn?
如何在 Python 中使用带有 Keras 的 scikit-learn 评估指标函数?
如何在 Python scikit-learn 中输出随机森林中每棵树的回归预测?
python - 如何在python scikit-learn中进行字典向量化后预测单个新样本?