如何使用 scikit-learn 计算用于情感分析的分类报告



【中文标题】如何使用 scikit-learn 计算用于情感分析的分类报告【英文标题】:how to compute the classification report for sentiment analysis with scikit-learn 【发布时间】:2018-09-26 12:23:42 【问题描述】:

如何获得分类报告测量精度、召回率、准确度和对 3 类分类的支持,类别为“正”、“负”和“中性”。下面是代码:

vec_clf = Pipeline([('vectorizer', vec), ('pac', svm_clf)])
print vec_clf.fit(X_train.values.astype('U'),y_train.values.astype('U'))

y_pred = vec_clf.predict(X_test.values.astype('U'))
print "SVM Accuracy-",metrics.accuracy_score(y_test, y_pred)

print "confuson metrics :\n", metrics.confusion_matrix(y_test, y_pred, labels=["positive","negative","neutral"])
print(metrics.classification_report(y_test, y_pred))


SVM Accuracy- 0.850318471338
confuson metrics :
[[206   9  67]
 [  4 373 122]
 [  9  21 756]]
Traceback (most recent call last):

  File "<ipython-input-62-e6ab3066790e>", line 1, in <module>
    runfile('C:/Users/HP/abc16.py', wdir='C:/Users/HP')

  File "C:\ProgramData\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
    execfile(filename, namespace)

  File "C:\ProgramData\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "C:/Users/HP/abc16.py", line 133, in <module>
    print(metrics.classification_report(y_test, y_pred))

  File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\metrics\classification.py", line 1391, in classification_report
    labels = unique_labels(y_true, y_pred)

  File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\utils\multiclass.py", line 104, in unique_labels
    raise ValueError("Mix of label input types (string and number)")

ValueError: Mix of label input types (string and number)


编辑 1:这就是 y_true 和 y_pred 的样子

        print "y_true :" ,y_test
        print "y_pred :",y_pred
        y_true : 5985     neutral
        899     positive
        2403     neutral
        3963     neutral
        3457     neutral
        5345     neutral
        3779     neutral
        299      neutral
        5712     neutral
        5511     neutral
        234      neutral
        1684    negative
        3701    negative
        2886     neutral
        2623    positive
        3549     neutral
        4574     neutral
        4972    positive
        Name: sentiment, Length: 1570, dtype: object
        y_pred : [u'neutral' u'positive' u'neutral' ..., u'neutral' u'neutral' u'negative']

编辑 2:type(y_true) 和 type(y_pred) 的输出

type(y_true):  <class 'pandas.core.series.Series'>
type(y_pred):  <type 'numpy.ndarray'>


请分享您的y_predy_true 的样本 能否也给出type(y_pred) & type(y_true) 的输出(不在 cmets 中,编辑问题)? 无法重现您的错误(请参阅下面的答案) - 请参阅How to create a Minimal, Complete, and Verifiable example 【参考方案1】:


import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# toy data, similar to yours:
data = 'id':[5985,899,2403, 1684], 'sentiment':['neutral', 'positive', 'neutral', 'negative']
y_true = pd.Series(data['sentiment'], index=data['id'], name='sentiment')
# 5985     neutral
# 899     positive
# 2403     neutral
# 1684    negative
# Name: sentiment, dtype: object
# pandas.core.series.Series
y_pred = np.array(['neutral', 'positive', 'negative', 'neutral'])

# all metrics working fine:

accuracy_score(y_true, y_pred)
# 0.5

confusion_matrix(y_true, y_pred)
# array([[0, 1, 0],
#        [1, 1, 0],
#        [0, 0, 1]], dtype=int64)

classification_report(y_true, y_pred)
# result:
             precision    recall  f1-score   support

   negative       0.00      0.00      0.00         1
   neutral        0.50      0.50      0.50         2
   positive       1.00      1.00      1.00         1
      total       0.50      0.50      0.50         4


