Scikit-learn 脚本给出的结果与教程大不相同,当我更改数据框时会出错
Posted
技术标签:
【中文标题】Scikit-learn 脚本给出的结果与教程大不相同,当我更改数据框时会出错【英文标题】:Scikit-learn script giving vastly different results than the tutorial, and gives an error when I change the dataframes 【发布时间】:2017-01-04 09:36:10 【问题描述】:我正在学习包含此部分的教程:
>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.linear_model.logistic import LogisticRegression
>>> from sklearn.cross_validation import train_test_split, cross_val_score
>>> df = pd.read_csv('data/sms.csv')
>>> X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
>>> vectorizer = TfidfVectorizer()
>>> X_train = vectorizer.fit_transform(X_train_raw)
>>> X_test = vectorizer.transform(X_test_raw)
>>> classifier = LogisticRegression()
>>> classifier.fit(X_train, y_train)
>>> precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
>>> print 'Precision', np.mean(precisions), precisions
>>> recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
>>> print 'Recalls', np.mean(recalls), recalls
然后我复制了一些修改:
ddir = (sys.argv[1])
df = pd.read_csv(ddir + '/SMSSpamCollection', sep='\t', quoting=csv.QUOTE_NONE, names=["label", "message"])
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['label'], df['message'])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
print 'Precision', np.mean(precisions), precisions
print 'Recalls', np.mean(recalls), recalls
然而,尽管代码几乎没有差异,但书中的结果比我的要好得多:
书:Precision 0.992137651822 [ 0.98717949 0.98666667 1. 0.98684211 1. ]
Recall 0.677114261885 [ 0.7 0.67272727 0.6 0.68807339 0.72477064]
我的:Precision 0.108435683974 [ 2.33542342e-06 1.22271611e-03 1.68918919e-02 1.97530864e-01 3.26530612e-01]
Recalls 0.235220281632 [ 0.00152053 0.03370787 0.125 0.44444444 0.57142857]
回到脚本看看哪里出了问题,我以为第 18 行:
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['label'], df['message'])
是罪魁祸首,将(df['label'], df['message'])
更改为(df['message'], df['label'])
。但这给了我一个错误:
Traceback (most recent call last):
File "Chapter4[B-FLGTLG]C[Y-BCPM][G-PAR--[00].py", line 30, in <module>
precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
File "/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py", line 1433, in cross_val_score
for train, test in cv)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 800, in __call__
while self.dispatch_one_batch(iterator):
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 658, in dispatch_one_batch
self._dispatch(tasks)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 566, in _dispatch
job = ImmediateComputeBatch(batch)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 180, in __init__
self.results = batch()
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py", line 1550, in _fit_and_score
test_score = _score(estimator, X_test, y_test, scorer)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py", line 1606, in _score
score = scorer(estimator, X_test, y_test)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/scorer.py", line 90, in __call__
**self._kwargs)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 1203, in precision_score
sample_weight=sample_weight)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py", line 984, in precision_recall_fscore_support
(pos_label, present_labels))
ValueError: pos_label=1 is not a valid label: array(['ham', 'spam'],
dtype='|S4')
这可能是什么问题?数据在这里:http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection,以防有人想尝试。
【问题讨论】:
教程用来获取结果的数据是否相同? 是的。出处:We will use the SMS Spam Classification Data Set from the UCI Machine Learning Repository. The dataset can be downloaded from http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. First, let's explore the data set and calculate some basic summary statistics using pandas
为什么要包含sep="\t"
参数(以及read_csv
调用中的其他参数)?您是否检查过数据是否正确导入?如果教程使用相同的数据,但没有使用"\t"
,那么数据可能是逗号分隔的,而不是制表符分隔的。
实际上,那条线有问题,所以我用这里的一条替换了书中的一条:radimrehurek.com/data_science_python。 SMSSpamCollection
中的行是空格分隔的,但是内容和sms.csv一样,我没有。
【参考方案1】:
堆栈跟踪末尾的错误是了解这里发生了什么的关键。
ValueError: pos_label=1 is not a valid label: array(['ham', 'spam'], dtype='|S4')
您正在尝试使用精确度和召回率对模型进行评分。回想一下,这些评分方法是根据真阳性、假阳性和假阴性来制定的。但是sklearn
怎么知道什么是积极的,什么是消极的?是“火腿”还是“垃圾邮件”?我们需要一种方法来告诉sklearn
,我们认为“垃圾邮件”是正面标签,而“垃圾邮件”是负面标签。根据sklearn
文档,精确度和召回率评分器默认期望1
的正标签,因此错误消息的pos_label=1
部分。
至少有 3 种方法可以解决此问题。
1.直接从数据源将“ham”和“spam”值编码为 0 和 1,以适应精确度/召回率评分器:
# Map dataframe to encode values and put values into a numpy array
encoded_labels = df['label'].map(lambda x: 1 if x == 'spam' else 0).values # ham will be 0 and spam will be 1
# Continue as normal
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], encoded_labels)
2.使用 sklearn
的内置函数 (label_binarize
) 将分类数据转换为编码整数,以适应精确度/召回率评分器:
这会将您的分类数据转换为整数。
# Encode labels
from sklearn.preprocessing import label_binarize
encoded_column_vector = label_binarize(df['label'], classes=['ham','spam']) # ham will be 0 and spam will be 1
encoded_labels = np.ravel(encoded_column_vector) # Reshape array
# Continue as normal
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], encoded_labels)
3. 使用pos_label
的自定义参数创建记分器对象:
正如文档所述,默认情况下,精确度和召回率分数具有 pos_label
参数 1
,但可以更改此参数以告知记分员哪个字符串代表正标签。您可以使用make_scorer
构造具有不同参数的记分器对象。
# Start out as you did originally with string labels
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
# Fit classifier as normal ...
# Get precision and recall
from sklearn.metrics import precision_score, recall_score, make_scorer
# Precision
precision_scorer = make_scorer(precision_score, pos_label='spam')
precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring=precision_scorer)
print 'Precision', np.mean(precisions), precisions
# Recall
recall_scorer = make_scorer(recall_score, pos_label='spam')
recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring=recall_scorer)
print 'Recalls', np.mean(recalls), recalls
在对您的代码进行任何这些更改后,我的平均准确率和召回率得分约为 0.990
和 0.704
,与书中的数字一致。
在所有 3 个选项中,我最推荐第 3 个选项,因为它更难出错。
【讨论】:
非常感谢,这真的困扰着我。不过,我可以问如何吗?数据没有任何标题:pastebin.com/rm3gyaGE 当您将 csv 作为 pandas 数据框导入并指定每列的名称时,数据会获取标题,如 names=["label", "message"]。第一列的标题为“标签”,第二列的标题为“消息”。 当您在 test_train_split 中切换标签和消息参数时,我不确定它为什么会起作用。这根本不应该工作。它基本上是要求模型预测,如果我有一封我知道是“垃圾邮件”的电子邮件,那么该邮件是“一些非常长的邮件”的可能性有多大。没有多大意义,对吧?那里正在发生一些非常糟糕的事情。更改后,您正确设置了参数。然而,目标标签的格式是记分员无法理解的。准确率和召回率评分者需要知道正面标签是什么。以上是关于Scikit-learn 脚本给出的结果与教程大不相同,当我更改数据框时会出错的主要内容,如果未能解决你的问题,请参考以下文章
Scikit-learn 转换器管道产生的结果与单独运行不同
使用 Scikit-Learn GridSearchCV 与 PredefinedSplit 进行交叉验证 - 可疑的交叉验证结果