如何在 scikit-learn 中正确加载文本数据?

Posted

技术标签:

【中文标题】如何在 scikit-learn 中正确加载文本数据?【英文标题】:How to load text data correctly in scikit-learn? 【发布时间】:2016-03-17 19:38:53 【问题描述】:

我正在关注this example,为 scikit-learn 中的文本数据创建多项朴素贝叶斯分类器。但是,混淆矩阵和分类器 F-1 分数的输出是不正确的。我认为这些错误与我使用的输入数据格式有关。每个训练示例我有一个 csv 文件。 csv 文件包含一行,其中包含“blah, blahblah, andsoon”等特征。每个文件都被分类为正面或负面。怎样才能正确读取这些文件?

这是我的代码:

import numpy
import csv
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import KFold
from sklearn.metrics import confusion_matrix, f1_score

NEWLINE = '\n'

NEGATIVE = 'negative'
POSITIVE = 'positive'

SOURCES = [
    ('negative\\', NEGATIVE),
    ('positive\\', POSITIVE)
]

SKIP_FILES = 'cmds'


def build_data_frame(policies, path, classification):
    rows = []
    index = []

    for policy in policies:

        current_csv = path + policy + '.csv'

        # check if file exists
        if (os.path.isfile(current_csv)):

            with open(current_csv, 'r') as csvfile:

                reader = csv.reader(csvfile, delimiter=',', quotechar='"')

                # get each row in policy
                for row in reader:
                    # remove all commas from inside the text lists
                    clean_row = ' '.join(row)
                    rows.append('text': clean_row, 'class': classification)
                    index.append(current_csv)

    data_frame = DataFrame(rows, index=index)
    return data_frame


def policy_analyzer_main(policies, write_pol_path):
    data = DataFrame('text': [], 'class': [])
    for path, classification in SOURCES:
        data = data.append(build_data_frame(policies, write_pol_path + path, classification))
    classify(data)

pipeline = Pipeline([
    ('count_vectorizer',   CountVectorizer()),
    ('classifier',         MultinomialNB())
])

def classify(data):

    k_fold = KFold(n=len(data), n_folds=10)
    scores = []
    confusion = numpy.array([[0, 0], [0, 0]])
    for train_indices, test_indices in k_fold:
        train_text = data.iloc[train_indices]['text'].values
        train_y = data.iloc[train_indices]['class'].values.astype(str)

        test_text = data.iloc[test_indices]['text'].values
        test_y = data.iloc[test_indices]['class'].values.astype(str)

        pipeline.fit(train_text, train_y)
        predictions = pipeline.predict(test_text)

        confusion += confusion_matrix(test_y, predictions)
        score = f1_score(test_y, predictions, pos_label=POSITIVE)
        scores.append(score)

    print('Total emails classified:', len(data))
    print('Score:', sum(scores)/len(scores))
    print('Confusion matrix:')
    print(confusion)

这是我收到的警告消息的示例:

UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
 ('Total emails classified:', 75)
 ('Score:', 0.025000000000000001)
Confusion matrix:
[[39 35]
 [46 24]]

【问题讨论】:

【参考方案1】:

在训练测试拆分的每次迭代中查看您的predictions。因为该警告意味着当测试集中的某些样本为阳性时,您的算法将所有测试样本标记为阴性(可能只有其中 1 个为阳性,但无论如何它会发出该警告)。

还要查看您对数据集的拆分,因为某些测试拆分可能仅包含 1 个正样本,但您的分类器将其分类错误。

例如,在这种情况下它会发出警告(为了清楚您的代码中发生了什么):

from sklearn.metrics import f1_score

# here we have only 4 labels of 4 samples
f1_score([0,0,1,0],[0,0,0,0])
/usr/local/lib/python3.4/dist-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
'precision', 'predicted', average, warn_for)

【讨论】:

谢谢。这确实是数据集的问题。我添加了随机洗牌,现在它可以工作了。 k_fold = KFold(n=len(data), n_folds=10, shuffle=True) @You_got_it,您可以另外查看 StratifiedKFold,它在生成拆分时会考虑标签。

以上是关于如何在 scikit-learn 中正确加载文本数据?的主要内容,如果未能解决你的问题,请参考以下文章

如何在 scikit-learn 中正确执行交叉验证?

如何在 scikit-learn 中使用正确的 pyprint?

如何在 scikit-learn 中计算正确的交叉验证分数?

如何在 PyQt5 中正确异步加载图像?

如何在当前的词袋分类中添加另一个文本特征?在 Scikit-learn 中

如何在 scikit-learn 管道中的 CountVectorizer 之前包含 SimpleImputer?