机器学习/NLP 文本分类:从文本文件的语料库中训练模型 - scikit learn

Posted

技术标签:

【中文标题】机器学习/NLP 文本分类:从文本文件的语料库中训练模型 - scikit learn【英文标题】:Machine Learning/NLP text classification: training a model from corpus of text files - scikit learn 【发布时间】:2019-01-12 03:23:26 【问题描述】:

我对机器学习非常陌生,我想知道是否有人可以带我浏览这段代码以及为什么它不起作用。这是我自己的 scikit-learn 教程的变体:http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html,这基本上是我想要做的。我需要用带标签的训练集训练模型,这样当我使用我的测试集时,它可以预测测试集的标签。如果有人能告诉我如何保存和加载模型,那也将非常有用。非常感谢。这是我目前所拥有的:

import codecs
import os

import numpy as np
import pandas as pd

from Text_Pre_Processing import Pre_Processing

filenames = os.listdir(
    "...scikit-machine-learning/training_set")
files = []
array_data = []
array_label = []
for file in filenames:
    with codecs.open("...scikit-machine-learning/training_set/" + file, "r",
                     encoding='utf-8', errors='ignore') as file_data:
        open_file = file_data.read()
        open_file = Pre_Processing.lower_case(open_file)
        open_file = Pre_Processing.remove_punctuation(open_file)
        open_file = Pre_Processing.clean_text(open_file)
        files.append(open_file)
# ----------------------------------------------------
# PUTTING LABELS INTO LIST
for file in files:
    if 'inheritance' in file:
        array_data.append(file)
        array_label.append('Inheritance (object-oriented programming)')
    elif 'pagerank' in file:
        array_data.append(file)
        array_label.append('PageRank')
    elif 'vector space model' in file:
        array_data.append(file)
        array_label.append('Vector Space Model')
    elif 'bayes' in file:
        array_data.append(file)
        array_label.append('Bayes' + "'" + ' Theorem')
    else:
        array_data.append(file)
        array_label.append('Dynamic programming')
#----------------------------------------------------------

csv_array = []
for i in range(0, len(array_data)):
    csv_array.append([array_data[i], array_label[i]])

# format of array [[string, label], [string, label], [string, label]]
import csv

with open('data.csv', 'w') as target:
    writer = csv.writer(target)
    writer.writerows(zip(test_array))

data = pd.read_csv('data.csv')
numpy_array = data.as_matrix()

X = numpy_array[:, 0]
Y = numpy_array[:, 1]

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state=42)

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import Pipeline

text_clf = Pipeline(['vect', CountVectorizer(stop_words='english'), 'tfidf', TfidfTransformer(),
                     'clf', MultinomialNB()])

text_clf = text_clf.fit(X_train, Y_train)

predicted = text_clf.predict(X_test)
np.mean(predicted == Y_test)

我看到网上有人使用 csv 文件输入数据,所以我也尝试过,我可能不需要它,如果不正确,我深表歉意。

显示错误:

C:.../scikit-machine-learning/train.py:63: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  numpy_array = data.as_matrix()
Traceback (most recent call last):
  File "C:/...scikit-machine-learning/train.py", line 66, in <module>
    Y = numpy_array[:,1]
IndexError: index 1 is out of bounds for axis 1 with size 1

非常感谢您的帮助,如果您需要进一步解释,请告诉我。

csv 中两个条目的示例:

"['dynamic programming is an algorithmic technique used to solve certain optimization problems where the object is to find the best solution from a number of possibilities it uses a so called bottomup approach meaning that the problem is solved as a set of subproblems which in turn are made up of subsubproblemssubproblems are then selected and used to solve the overall problem these subproblems are only solved once and the solutions are saved so that they will not need to be recalculated again whilst calculated individually they may also overlap when any subproblem is met again it can be found and reused to solve another problem since it searches all possibilities it is also very accurate this method is far more efficient than recalculating and therefore considerably reduces computation it is widely used in computer science and can be applied for example to compress data in high density bar codes dynamic programming is most effective and therefore most often used on objects that are ordered from left to right and whose order cannot be rearranged this means it works well on character chains for example ', 'Dynamic programming']"

"['inheritance is one of the basic concepts of object oriented programming its objective is to add more detail to preexisting classes whilst still allowing the methods and variables of these classes to be reused the easiest way to look at inheritance is as an is a kind of relationship for example a guitar is a kind of string instrument electric acoustic and steel stringed are all types of guitar the further down an inheritance tree you get the more specific the classes become an example here would be books books generally fall into two categories fiction and nonfiction each of these can then be subdivided into more groups fiction for example can be split into fantasy horror romance and many more nonfiction splits the same way into other topics such as history geography cooking etc history of course can be subdivided into time periods like the romans the elizabethans the world wars and so on', 'Inheritance (object-oriented programming)']"

【问题讨论】:

请在 data.csv 上给我们一个样本数据。谢谢。 @âńōŋŷXmoůŜ 已添加 :) 谢谢 您可以使用 Pickle 保存和加载模型。 【参考方案1】:

您需要从 csv 中删除字符 [' 和 '],因为 read_csv 将它们视为字符串(一列)而不是两列数据框。 行 text_clf = Pipeline 上还有一个拼写错误,所以我也修复了它。祝你好运!

data = pd.read_csv('data.csv', header=None)
numpy_array = data.as_matrix()

strarr = numpy_array[:, 0]
X=[strarr[i].split(",")[0].replace("[",'').replace("'",'') for i in range(len(strarr))]
Y=[strarr[i].split(",")[1].replace("]",'').replace("'",'') for i in range(len(strarr))]

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state=42)

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

text_clf = text_clf.fit(X_train, Y_train)

predicted = text_clf.predict(X_test)
np.mean(predicted == Y_test)

【讨论】:

这很完美!非常感谢 - 当您想在 predict 方法中使用自己的文本时,您需要在它周围加上方括号:str = "hello world" text_clf.predict([str])跨度>

以上是关于机器学习/NLP 文本分类:从文本文件的语料库中训练模型 - scikit learn的主要内容,如果未能解决你的问题,请参考以下文章

做项目一定用得到的NLP资源分类版

做项目一定用得到的NLP资源分类版

1.中文NLP的完整机器处理流程

深度学习之Pytorch——如何使用张量处理文本数据集(语料库数据集)

NLP之文本分类

[NLP]文本分类-textCNN