使用Logistic Regression进行文本分类

Posted bitcarmanlee

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了使用Logistic Regression进行文本分类相关的知识,希望对你有一定的参考价值。

1.文本格式

sentence,label
游戏太坑,暴率太低,太克金,平民不能玩,negative
让人失望,negative
能解决一下服务器问题?网络正常老掉线,换手机也一样。。。,negative
期待,positive
一星也不想给,这特么简直龟速,炫舞老年版?,negative
衣服不好看游戏内容无特色,界面乱糟糟的,negative
喜欢喜欢,positive
从有了这个手游就一直玩,很喜欢呀,希望更多漂漂衣服,positive
因违反评价条例规定被折叠,negative

2.数据预处理过程

import jieba
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics


def get_stop_words():
    filename = "your stop words file path"
    stop_word_list = []
    with open(filename, encoding='utf-8') as f:
        for line in f.readlines():
            stop_word_list.append(line.strip())
    return stop_word_list


def processing_sentence(x, stop_words):
    cut_word = jieba.cut(str(x).strip())
    words = [word for word in cut_word if word not in stop_words and word != ' ']
    return ' '.join(words)


def data_processing():
    train_file = “your train file path"
    df = pd.read_csv(train_file)
    x_train, x_test, y_train, y_test = train_test_split(df['sentence'], df['label'], test_size=0.1)
    stop_words = get_stop_words()
    x_train = x_train.apply(lambda x: processing_sentence(x, stop_words))
    x_test = x_test.apply(lambda x: processing_sentence(x, stop_words))

    tf = TfidfVectorizer()
    x_train = tf.fit_transform(x_train)
    x_test = tf.transform(x_test)
    x_train_weight = x_train.toarray()
    x_test_weight = x_test.toarray()

    return x_train_weight, x_test_weight, y_train, y_test

整体还是将文本分词,然后将其转化为tf-idf特征。

3.构建LR模型

def model_train():
    x_train_weight, x_test_weight, y_train, y_test = data_processing()
    lr = LogisticRegression(C=1.0, penalty='l2', tol=0.01)
    lr.fit(x_train_weight, y_train)

    train_score = lr.score(x_train_weight, y_train)
    print("训练集准确率: ", train_score)

    y_predict = lr.predict(x_test_weight)

    confusion_mat = metrics.confusion_matrix(y_test, y_predict)
    print('测试集准确率:', metrics.accuracy_score(y_test, y_predict))
    print("confusion_matrix is: ", confusion_mat)
    print('分类报告:', metrics.classification_report(y_test, y_predict))

最后代码输出的训练过程与结果为

训练集准确率:  0.8926945588554086
测试集准确率: 0.746588693957115
confusion_matrix is:  [[177  64]
 [ 66 206]]
分类报告:               precision    recall  f1-score   support

    negative       0.73      0.73      0.73       241
    positive       0.76      0.76      0.76       272

    accuracy                           0.75       513
   macro avg       0.75      0.75      0.75       513
weighted avg       0.75      0.75      0.75       513

以上是关于使用Logistic Regression进行文本分类的主要内容,如果未能解决你的问题,请参考以下文章

使用Logistic Regression Algorithm进行多分类数字识别的Octave仿真

分类---Logistic Regression

使用聚类算法(Kmeans)进行数据降维并作为分类算法逻辑回归(logistic Regression)的数据预处理步骤实战

logistic Regression

Logistic Regression Algorithm

如何调整 scale scikit-learn Logistic Regression coeffs 以对非缩放数据集进行评分?