如何使用 scikit-learn 计算用于情感分析的分类报告
Posted
技术标签:
【中文标题】如何使用 scikit-learn 计算用于情感分析的分类报告【英文标题】:how to compute the classification report for sentiment analysis with scikit-learn 【发布时间】:2018-09-26 12:23:42 【问题描述】:如何获得分类报告测量精度、召回率、准确度和对 3 类分类的支持,类别为“正”、“负”和“中性”。下面是代码:
vec_clf = Pipeline([('vectorizer', vec), ('pac', svm_clf)])
print vec_clf.fit(X_train.values.astype('U'),y_train.values.astype('U'))
y_pred = vec_clf.predict(X_test.values.astype('U'))
print "SVM Accuracy-",metrics.accuracy_score(y_test, y_pred)
print "confuson metrics :\n", metrics.confusion_matrix(y_test, y_pred, labels=["positive","negative","neutral"])
print(metrics.classification_report(y_test, y_pred))
它给出的错误是:
SVM Accuracy- 0.850318471338
confuson metrics :
[[206 9 67]
[ 4 373 122]
[ 9 21 756]]
Traceback (most recent call last):
File "<ipython-input-62-e6ab3066790e>", line 1, in <module>
runfile('C:/Users/HP/abc16.py', wdir='C:/Users/HP')
File "C:\ProgramData\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
execfile(filename, namespace)
File "C:\ProgramData\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/HP/abc16.py", line 133, in <module>
print(metrics.classification_report(y_test, y_pred))
File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\metrics\classification.py", line 1391, in classification_report
labels = unique_labels(y_true, y_pred)
File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\utils\multiclass.py", line 104, in unique_labels
raise ValueError("Mix of label input types (string and number)")
ValueError: Mix of label input types (string and number)
请指导我哪里出错了
编辑 1:这就是 y_true 和 y_pred 的样子
print "y_true :" ,y_test
print "y_pred :",y_pred
y_true : 5985 neutral
899 positive
2403 neutral
3963 neutral
3457 neutral
5345 neutral
3779 neutral
299 neutral
5712 neutral
5511 neutral
234 neutral
1684 negative
3701 negative
2886 neutral
.
.
.
2623 positive
3549 neutral
4574 neutral
4972 positive
Name: sentiment, Length: 1570, dtype: object
y_pred : [u'neutral' u'positive' u'neutral' ..., u'neutral' u'neutral' u'negative']
编辑 2:type(y_true) 和 type(y_pred) 的输出
type(y_true): <class 'pandas.core.series.Series'>
type(y_pred): <type 'numpy.ndarray'>
【问题讨论】:
请分享您的y_pred
和y_true
的样本
能否也给出type(y_pred)
& type(y_true)
的输出(不在 cmets 中,编辑问题)?
无法重现您的错误(请参阅下面的答案) - 请参阅How to create a Minimal, Complete, and Verifiable example
【参考方案1】:
无法重现您的错误:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# toy data, similar to yours:
data = 'id':[5985,899,2403, 1684], 'sentiment':['neutral', 'positive', 'neutral', 'negative']
y_true = pd.Series(data['sentiment'], index=data['id'], name='sentiment')
y_true
# 5985 neutral
# 899 positive
# 2403 neutral
# 1684 negative
# Name: sentiment, dtype: object
type(y_true)
# pandas.core.series.Series
y_pred = np.array(['neutral', 'positive', 'negative', 'neutral'])
# all metrics working fine:
accuracy_score(y_true, y_pred)
# 0.5
confusion_matrix(y_true, y_pred)
# array([[0, 1, 0],
# [1, 1, 0],
# [0, 0, 1]], dtype=int64)
classification_report(y_true, y_pred)
# result:
precision recall f1-score support
negative 0.00 0.00 0.00 1
neutral 0.50 0.50 0.50 2
positive 1.00 1.00 1.00 1
total 0.50 0.50 0.50 4
【讨论】:
以上是关于如何使用 scikit-learn 计算用于情感分析的分类报告的主要内容,如果未能解决你的问题,请参考以下文章
我可以在 scikit-learn 中使用 CountVectorizer 来计算未用于提取标记的文档的频率吗?
如何使用 2 个数据集,1 个用于训练,1 个用于在 WEKA 上进行情感分析测试