如何使用 scikit-learn 计算用于情感分析的分类报告

Posted

技术标签:

【中文标题】如何使用 scikit-learn 计算用于情感分析的分类报告【英文标题】:how to compute the classification report for sentiment analysis with scikit-learn 【发布时间】:2018-09-26 12:23:42 【问题描述】:

如何获得分类报告测量精度、召回率、准确度和对 3 类分类的支持,类别为“正”、“负”和“中性”。下面是代码:

vec_clf = Pipeline([('vectorizer', vec), ('pac', svm_clf)])
print vec_clf.fit(X_train.values.astype('U'),y_train.values.astype('U'))

y_pred = vec_clf.predict(X_test.values.astype('U'))
print "SVM Accuracy-",metrics.accuracy_score(y_test, y_pred)

print "confuson metrics :\n", metrics.confusion_matrix(y_test, y_pred, labels=["positive","negative","neutral"])
print(metrics.classification_report(y_test, y_pred))

它给出的错误是:

SVM Accuracy- 0.850318471338
confuson metrics :
[[206   9  67]
 [  4 373 122]
 [  9  21 756]]
Traceback (most recent call last):

  File "<ipython-input-62-e6ab3066790e>", line 1, in <module>
    runfile('C:/Users/HP/abc16.py', wdir='C:/Users/HP')

  File "C:\ProgramData\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
    execfile(filename, namespace)

  File "C:\ProgramData\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "C:/Users/HP/abc16.py", line 133, in <module>
    print(metrics.classification_report(y_test, y_pred))

  File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\metrics\classification.py", line 1391, in classification_report
    labels = unique_labels(y_true, y_pred)

  File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\utils\multiclass.py", line 104, in unique_labels
    raise ValueError("Mix of label input types (string and number)")

ValueError: Mix of label input types (string and number)

请指导我哪里出错了

编辑 1:这就是 y_true 和 y_pred 的样子

        print "y_true :" ,y_test
        print "y_pred :",y_pred
        y_true : 5985     neutral
        899     positive
        2403     neutral
        3963     neutral
        3457     neutral
        5345     neutral
        3779     neutral
        299      neutral
        5712     neutral
        5511     neutral
        234      neutral
        1684    negative
        3701    negative
        2886     neutral
        .
        .
        .
        2623    positive
        3549     neutral
        4574     neutral
        4972    positive
        Name: sentiment, Length: 1570, dtype: object
        y_pred : [u'neutral' u'positive' u'neutral' ..., u'neutral' u'neutral' u'negative']

编辑 2:type(y_true) 和 type(y_pred) 的输出

type(y_true):  <class 'pandas.core.series.Series'>
type(y_pred):  <type 'numpy.ndarray'>

【问题讨论】:

请分享您的y_predy_true 的样本 能否也给出type(y_pred) & type(y_true) 的输出(不在 cmets 中,编辑问题)? 无法重现您的错误(请参阅下面的答案) - 请参阅How to create a Minimal, Complete, and Verifiable example 【参考方案1】:

无法重现您的错误:

import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# toy data, similar to yours:
data = 'id':[5985,899,2403, 1684], 'sentiment':['neutral', 'positive', 'neutral', 'negative']
y_true = pd.Series(data['sentiment'], index=data['id'], name='sentiment')
y_true
# 5985     neutral
# 899     positive
# 2403     neutral
# 1684    negative
# Name: sentiment, dtype: object
type(y_true)
# pandas.core.series.Series
y_pred = np.array(['neutral', 'positive', 'negative', 'neutral'])

# all metrics working fine:

accuracy_score(y_true, y_pred)
# 0.5

confusion_matrix(y_true, y_pred)
# array([[0, 1, 0],
#        [1, 1, 0],
#        [0, 0, 1]], dtype=int64)

classification_report(y_true, y_pred)
# result:
             precision    recall  f1-score   support

   negative       0.00      0.00      0.00         1
   neutral        0.50      0.50      0.50         2
   positive       1.00      1.00      1.00         1
      total       0.50      0.50      0.50         4

【讨论】:

以上是关于如何使用 scikit-learn 计算用于情感分析的分类报告的主要内容,如果未能解决你的问题,请参考以下文章

怎样用python实现SVM分类器,用于情感分析的二分类

我可以在 scikit-learn 中使用 CountVectorizer 来计算未用于提取标记的文档的频率吗?

如何使用 2 个数据集,1 个用于训练,1 个用于在 WEKA 上进行情感分析测试

如何使用 Python (scikit-learn) 计算 FactorAnalysis 分数?

如何构建和标记用于情感分析的非英语数据集

Python NLTK 中用于情感分析的德语词干