如何在不平衡的数据中解释具有中等精度和召回率的高 AUC-ROC?
Posted
技术标签:
【中文标题】如何在不平衡的数据中解释具有中等精度和召回率的高 AUC-ROC?【英文标题】:How to explain high AUC-ROC with mediocre precision and recall in unbalanced data? 【发布时间】:2016-06-11 20:45:15 【问题描述】:我正在尝试理解一些机器学习结果。任务是预测/标记“爱尔兰人”与“非爱尔兰人”。 Python 2.7 的输出:
1= ir
0= non-ir
Class count:
0 4090942
1 940852
Name: ethnicity_scan, dtype: int64
Accuracy: 0.874921350119
Classification report:
precision recall f1-score support
0 0.89 0.96 0.93 2045610
1 0.74 0.51 0.60 470287
avg / total 0.87 0.87 0.87 2515897
Confusion matrix:
[[1961422 84188]
[ 230497 239790]]
AUC-ir= 0.901238104773
如您所见,准确率和召回率中等,但 AUC-ROC 更高(~0.90)。我试图找出原因,我怀疑这是由于数据不平衡(大约 1:5)。基于混淆矩阵,并使用爱尔兰语作为目标 (+),我计算了 TPR=0.51 和 FPR=0.04。如果我将非爱尔兰人视为 (+),则 TPR=0.96 和 FPR=0.49。那么如何在 FPR=0.04 时 TPR 仅为 0.5 时获得 0.9 AUC?
代码:
try:
for i in mass[k]:
df = df_temp # reset df before each loop
#$$
#$$
if 1==1:
###if i == singleEthnic:
count+=1
ethnicity_tar = str(i) # fr, en, ir, sc, others, ab, rus, ch, it, jp
# fn, metis, inuit; algonquian, iroquoian, athapaskan, wakashan, siouan, salish, tsimshian, kootenay
############################################
############################################
def ethnicity_target(row):
try:
if row[ethnicity_var] == ethnicity_tar:
return 1
else:
return 0
except: return None
df['ethnicity_scan'] = df.apply(ethnicity_target, axis=1)
print '1=', ethnicity_tar
print '0=', 'non-'+ethnicity_tar
# Random sampling a smaller dataframe for debugging
rows = df.sample(n=subsample_size, random_state=seed) # Seed gives fixed randomness
df = DataFrame(rows)
print 'Class count:'
print df['ethnicity_scan'].value_counts()
# Assign X and y variables
X = df.raw_name.values
X2 = df.name.values
X3 = df.gender.values
X4 = df.location.values
y = df.ethnicity_scan.values
# Feature extraction functions
def feature_full_name(nameString):
try:
full_name = nameString
if len(full_name) > 1: # not accept name with only 1 character
return full_name
else: return '?'
except: return '?'
def feature_full_last_name(nameString):
try:
last_name = nameString.rsplit(None, 1)[-1]
if len(last_name) > 1: # not accept name with only 1 character
return last_name
else: return '?'
except: return '?'
def feature_full_first_name(nameString):
try:
first_name = nameString.rsplit(' ', 1)[0]
if len(first_name) > 1: # not accept name with only 1 character
return first_name
else: return '?'
except: return '?'
# Transform format of X variables, and spit out a numpy array for all features
my_dict = ['last-name': feature_full_last_name(i) for i in X]
my_dict5 = ['first-name': feature_full_first_name(i) for i in X]
all_dict = []
for i in range(0, len(my_dict)):
temp_dict = dict(
my_dict[i].items() + my_dict5[i].items()
)
all_dict.append(temp_dict)
newX = dv.fit_transform(all_dict)
# Separate the training and testing data sets
X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)
# Fitting X and y into model, using training data
classifierUsed2.fit(X_train, y_train)
# Making predictions using trained data
y_train_predictions = classifierUsed2.predict(X_train)
y_test_predictions = classifierUsed2.predict(X_test)
插入的重采样代码:
try:
for i in mass[k]:
df = df_temp # reset df before each loop
#$$
#$$
if 1==1:
###if i == singleEthnic:
count+=1
ethnicity_tar = str(i) # fr, en, ir, sc, others, ab, rus, ch, it, jp
# fn, metis, inuit; algonquian, iroquoian, athapaskan, wakashan, siouan, salish, tsimshian, kootenay
############################################
############################################
def ethnicity_target(row):
try:
if row[ethnicity_var] == ethnicity_tar:
return 1
else:
return 0
except: return None
df['ethnicity_scan'] = df.apply(ethnicity_target, axis=1)
print '1=', ethnicity_tar
print '0=', 'non-'+ethnicity_tar
# Resampled
df_resampled = df.append(df[df.ethnicity_scan==0].sample(len(df)*5, replace=True))
# Random sampling a smaller dataframe for debugging
rows = df_resampled.sample(n=subsample_size, random_state=seed) # Seed gives fixed randomness
df = DataFrame(rows)
print 'Class count:'
print df['ethnicity_scan'].value_counts()
# Assign X and y variables
X = df.raw_name.values
X2 = df.name.values
X3 = df.gender.values
X4 = df.location.values
y = df.ethnicity_scan.values
# Feature extraction functions
def feature_full_name(nameString):
try:
full_name = nameString
if len(full_name) > 1: # not accept name with only 1 character
return full_name
else: return '?'
except: return '?'
def feature_full_last_name(nameString):
try:
last_name = nameString.rsplit(None, 1)[-1]
if len(last_name) > 1: # not accept name with only 1 character
return last_name
else: return '?'
except: return '?'
def feature_full_first_name(nameString):
try:
first_name = nameString.rsplit(' ', 1)[0]
if len(first_name) > 1: # not accept name with only 1 character
return first_name
else: return '?'
except: return '?'
# Transform format of X variables, and spit out a numpy array for all features
my_dict = ['last-name': feature_full_last_name(i) for i in X]
my_dict5 = ['first-name': feature_full_first_name(i) for i in X]
all_dict = []
for i in range(0, len(my_dict)):
temp_dict = dict(
my_dict[i].items() + my_dict5[i].items()
)
all_dict.append(temp_dict)
newX = dv.fit_transform(all_dict)
# Separate the training and testing data sets
X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)
# Fitting X and y into model, using training data
classifierUsed2.fit(X_train, y_train)
# Making predictions using trained data
y_train_predictions = classifierUsed2.predict(X_train)
y_test_predictions = classifierUsed2.predict(X_test)
【问题讨论】:
Good ROC curve but poor precision-recall curve的可能重复 【参考方案1】:您的模型会为它得分的测试集中的每一行输出一个概率 P(介于 0 和 1 之间)。汇总统计信息(精度、召回率等)是针对单个 P 值作为预测阈值的,可能 P=0.5,除非您在代码中更改了此值。但是 ROC 包含更多信息,其想法是您可能不想使用此默认值作为预测阈值,因此通过计算真阳性与假阳性的比率来绘制 ROC,跨越 0 和 0 之间的每个预测阈值1.
如果您在数据中对非爱尔兰人的抽样不足,那么 AUC 和精度将被高估是正确的;如果您的数据集只有 5000 行,那么在更大的训练集上运行模型将没有问题;只需重新平衡您的数据集(通过引导抽样来增加您的非爱尔兰人),直到您准确地反映您的样本人口。
【讨论】:
我没有改P,所以应该是0.5。出于报告目的,我可以按原样报告现有的精度、召回率和 ROC(同时使用默认 P=0.5)吗? 不,这肯定不行,你会夸大你的模型的效果,不要这样做! 请帮助我了解您来自哪里,您似乎暗示有效性的可能“夸大”是由于数据不平衡造成的。但我正在使用应该对其敏感的性能指标(即 F1 分数、精度和召回率)。那么为什么报告甚至 F1 分数、精度和召回率都夸大了性能呢? (注意:我听说过用于不平衡数据的过采样/欠采样技术,但它们有自己的缺陷,例如丢失信息或与重复噪声建模过于紧密等) 或者你只是说它只会夸大ROC度量?如果是这样,制作 AUC-ROC 图有帮助吗? 如果您报告您的准确率是 89%,那么您就是在说“我的模型可以正确预测 89% 的时间”。但是,由于您对非爱尔兰人进行了欠采样,因此您夸大了模型的性能,如果您在未欠采样的新测试集上重新运行模型,则精度会差很多,可能只有 30%。想一想,如果我走进一个房间,把一大群非爱尔兰人踢出去,我突然就能更好地判断人们是否是爱尔兰人,即使我只是随机猜测!以上是关于如何在不平衡的数据中解释具有中等精度和召回率的高 AUC-ROC?的主要内容,如果未能解决你的问题,请参考以下文章
牢记分类指标:准确率、精确率、召回率、F1 score以及ROC
精度评定中的准确率(Precision)和召回率(Recall)