如何在情感分析中添加混淆矩阵和k-fold 10折
Posted
技术标签:
【中文标题】如何在情感分析中添加混淆矩阵和k-fold 10折【英文标题】:how to add confusion matrix and k-fold 10 fold in sentiment analysis 【发布时间】:2019-09-28 00:39:45 【问题描述】:我想使用交叉验证和混淆矩阵 k-fold (k = 10) 方法添加评估模型,但我很困惑 数据集:https://github.com/fadholifh/dats/blob/master/cpas.txt
使用 Pyhon 3.7
import sklearn.metrics
import sen
import csv
import os
import re
import nltk
import scipy
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn.externals import joblib
from sklearn.pipeline import Pipeline
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
factorys = StemmerFactory()
stemmer = factorys.create_stemmer()
if __name__ == "__main__":
main()
结果是混淆矩阵,对于 k-fold,每个折叠都有 F1 分数、精确度和召回率的百分比
【问题讨论】:
我无法打开数据,我可以从您那里获取数据以便测试代码吗?谢谢你 【参考方案1】:df = pd.read_csv("cpas.txt", header=None, delimiter="\t")
X = df[1].values
y = df[0].values
stop_words = stopwords.words('english')
stemmer = PorterStemmer()
def clean_text(text, stop_words, stemmer):
return " ".join([stemmer.stem(word) for word in word_tokenize(text)
if word not in stop_words and not word.isnumeric()])
X = np.array([clean_text(text, stop_words, stemmer) for text in X])
kfold = KFold(3, shuffle=True, random_state=33)
i = 1
for train_idx, test_idx in kfold.split(X):
X_train = X[train_idx]
y_train = y[train_idx]
X_test = X[test_idx]
y_test = y[test_idx]
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
model = LinearSVC()
model.fit(X_train, y_train)
print ("Fold : 0".format(i))
i += 1
print (classification_report(y_test, model.predict(X_test)))
您使用交叉验证的原因是在数据较少时进行参数调整。可以使用带有 CV 的网格搜索来做到这一点。
df = pd.read_csv("cpas.txt", header=None, delimiter="\t")
X = df[1].values
labels = df[0].values
text = np.array([clean_text(text, stop_words, stemmer) for text in X])
idx = np.arange(len(text))
np.random.shuffle(idx)
text = text[idx]
labels = labels[idx]
pipeline = Pipeline([
('vectorizer', TfidfVectorizer()),
('svm', LinearSVC())])
params =
'vectorizer__ngram_range' : [(1,1),(1,2),(2,2)],
'vectorizer__lowercase' : [True, False],
'vectorizer__norm' : ['l1','l2']
model = GridSearchCV(pipeline, params, cv=3, verbose=1)
model.fit(text, y)
【讨论】:
以上是关于如何在情感分析中添加混淆矩阵和k-fold 10折的主要内容,如果未能解决你的问题,请参考以下文章
应用分层10折交叉验证时如何在python中获取所有混淆矩阵的聚合
XGBoost文本分类,多分类二分类10-Fold(K-Fold)
在 KFold 交叉验证的情况下如何显示平均分类报告和混淆矩阵
交叉验证(cross validation)是什么?K折交叉验证(k-fold crossValidation)是什么?