使用Logistic Regression进行文本分类
Posted bitcarmanlee
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了使用Logistic Regression进行文本分类相关的知识,希望对你有一定的参考价值。
1.文本格式
sentence,label
游戏太坑,暴率太低,太克金,平民不能玩,negative
让人失望,negative
能解决一下服务器问题?网络正常老掉线,换手机也一样。。。,negative
期待,positive
一星也不想给,这特么简直龟速,炫舞老年版?,negative
衣服不好看游戏内容无特色,界面乱糟糟的,negative
喜欢喜欢,positive
从有了这个手游就一直玩,很喜欢呀,希望更多漂漂衣服,positive
因违反评价条例规定被折叠,negative
2.数据预处理过程
import jieba
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
def get_stop_words():
filename = "your stop words file path"
stop_word_list = []
with open(filename, encoding='utf-8') as f:
for line in f.readlines():
stop_word_list.append(line.strip())
return stop_word_list
def processing_sentence(x, stop_words):
cut_word = jieba.cut(str(x).strip())
words = [word for word in cut_word if word not in stop_words and word != ' ']
return ' '.join(words)
def data_processing():
train_file = “your train file path"
df = pd.read_csv(train_file)
x_train, x_test, y_train, y_test = train_test_split(df['sentence'], df['label'], test_size=0.1)
stop_words = get_stop_words()
x_train = x_train.apply(lambda x: processing_sentence(x, stop_words))
x_test = x_test.apply(lambda x: processing_sentence(x, stop_words))
tf = TfidfVectorizer()
x_train = tf.fit_transform(x_train)
x_test = tf.transform(x_test)
x_train_weight = x_train.toarray()
x_test_weight = x_test.toarray()
return x_train_weight, x_test_weight, y_train, y_test
整体还是将文本分词,然后将其转化为tf-idf特征。
3.构建LR模型
def model_train():
x_train_weight, x_test_weight, y_train, y_test = data_processing()
lr = LogisticRegression(C=1.0, penalty='l2', tol=0.01)
lr.fit(x_train_weight, y_train)
train_score = lr.score(x_train_weight, y_train)
print("训练集准确率: ", train_score)
y_predict = lr.predict(x_test_weight)
confusion_mat = metrics.confusion_matrix(y_test, y_predict)
print('测试集准确率:', metrics.accuracy_score(y_test, y_predict))
print("confusion_matrix is: ", confusion_mat)
print('分类报告:', metrics.classification_report(y_test, y_predict))
最后代码输出的训练过程与结果为
训练集准确率: 0.8926945588554086
测试集准确率: 0.746588693957115
confusion_matrix is: [[177 64]
[ 66 206]]
分类报告: precision recall f1-score support
negative 0.73 0.73 0.73 241
positive 0.76 0.76 0.76 272
accuracy 0.75 513
macro avg 0.75 0.75 0.75 513
weighted avg 0.75 0.75 0.75 513
以上是关于使用Logistic Regression进行文本分类的主要内容,如果未能解决你的问题,请参考以下文章
使用Logistic Regression Algorithm进行多分类数字识别的Octave仿真
使用聚类算法(Kmeans)进行数据降维并作为分类算法逻辑回归(logistic Regression)的数据预处理步骤实战
如何调整 scale scikit-learn Logistic Regression coeffs 以对非缩放数据集进行评分?