使用 scikit-learn 分类到多个类别

Posted

技术标签:

【中文标题】使用 scikit-learn 分类到多个类别【英文标题】:Use scikit-learn to classify into multiple categories 【发布时间】:2012-05-18 14:06:01 【问题描述】:

我正在尝试使用 scikit-learn 的一种监督学习方法将文本片段分类为一个或多个类别。我尝试过的所有算法的预测函数都只返回一个匹配项。

比如我有一段文字:

"Theaters in New York compared to those in London"

我已经训练算法为我输入的每个文本选择一个位置。

在上面的示例中,我希望它返回 New YorkLondon,但它只返回 New York

是否可以使用 scikit-learn 返回多个结果?或者甚至返回具有下一个最高概率的标签?

感谢您的帮助。

---更新

我尝试使用OneVsRestClassifier,但每条文本仍然只有一个选项。下面是我正在使用的示例代码

y_train = ('New York','London')


train_set = ("new york nyc big apple", "london uk great britain")
vocab = 'new york' :0,'nyc':1,'big apple':2,'london' : 3, 'uk': 4, 'great britain' : 5
count = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=2),vocabulary=vocab)
test_set = ('nice day in nyc','london town','hello welcome to the big apple. enjoy it here and london too')

X_vectorized = count.transform(train_set).todense()
smatrix2  = count.transform(test_set).todense()


base_clf = MultinomialNB(alpha=1)

clf = OneVsRestClassifier(base_clf).fit(X_vectorized, y_train)
Y_pred = clf.predict(smatrix2)
print Y_pred

结果:['New York' 'London' 'London']

【问题讨论】:

【参考方案1】:

几个多分类示例如下:-

示例 1:-

import numpy as np
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()

arr2d = np.array([1, 2, 3,4,5,6,7,8,9,10,11,12,13,14,1])
transfomed_label = encoder.fit_transform(arr2d)
print(transfomed_label)

输出是

[[1 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0]]

示例 2:-

import numpy as np
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()

arr2d = np.array(['Leopard','Lion','Tiger', 'Lion'])
transfomed_label = encoder.fit_transform(arr2d)
print(transfomed_label)

输出是

[[1 0 0]
 [0 1 0]
 [0 0 1]
 [0 1 0]]

【讨论】:

【参考方案2】:

编辑:按照建议使用 MultiLabelBinarizer 为 Python 3、scikit-learn 0.18.1 更新。

我也一直在努力解决这个问题,并对 mwv 的优秀答案做了一些改进,这可能有用。它将文本标签而不是二进制标签作为输入,并使用 MultiLabelBinarizer 对其进行编码。

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"],
                ["new york"],["london"],["london"],["london"],["london"],
                ["london"],["london"],["new york","london"],["new york","london"]]

X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'london is rainy',
                   'it is raining in britian',
                   'it is raining in britian and the big apple',
                   'it is raining in britian and nyc',
                   'hello welcome to new york. enjoy it here and london too'])
target_names = ['New York', 'London']

mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(y_train_text)

classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = mlb.inverse_transform(predicted)

for item, labels in zip(X_test, all_labels):
    print('0 => 1'.format(item, ', '.join(labels)))

这给了我以下输出:

nice day in nyc => new york
welcome to london => london
london is rainy => london
it is raining in britian => london
it is raining in britian and the big apple => new york
it is raining in britian and nyc => london, new york
hello welcome to new york. enjoy it here and london too => london, new york

【讨论】:

labelBinarizer 已过时。请改用lb = preprocessing.MultiLabelBinarizer() 它没有给出英国,因为唯一的输出标签是New YorkLondon 根据scikit-learn One-Vs-All 支持除 sklearn.svm.SVC 之外的所有线性模型,并且还支持多标签:决策树、随机森林、最近邻,所以我不会t 将 LinearSVC() 用于此类任务(也就是我假设您想要使用的多标签分类) 仅供参考,@mindstorm 提到的 One-Vs-All 对应于 scikit-learn 类“OneVsRestClassifier”(注意“Rest”而不是“all”)。 This scikit-learn help page 澄清它。 正如@mindstorm 提到的,在this page,文档确实提到:“One-Vs-All:除 sklearn.svm.SVC 之外的所有线性模型”。然而another multilabel example from the scikit-learn documentation 显示了一个带有classif = OneVsRestClassifier(SVC(kernel='linear')) 行的多标签示例。困惑。【参考方案3】:

更改此行以使其在新版本的 python 中工作

# lb = preprocessing.LabelBinarizer()
lb = preprocessing.MultiLabelBinarizer()

【讨论】:

【参考方案4】:

我也遇到了这个问题,对我来说问题是我的 y_Train 是一个字符串序列,而不是一个字符串序列。显然,OneVsRestClassifier 会根据输入的标签格式来决定是使用多类还是多标签。所以改变:

y_train = ('New York','London')

y_train = (['New York'],['London'])

显然这将在未来消失,因为所有标签的中断都是相同的:https://github.com/scikit-learn/scikit-learn/pull/1987

【讨论】:

【参考方案5】:

你想要的叫做多标签分类。 Scikits-learn 可以做到这一点。见这里:http://scikit-learn.org/dev/modules/multiclass.html

我不确定您的示例中出了什么问题,我的 sklearn 版本显然没有 WordNGramAnalyzer。也许这是使用更多训练示例或尝试不同分类器的问题?但请注意,多标签分类器希望目标是元组列表/标签列表。

以下对我有用:

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york"])
y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[0,1],[0,1]]
X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'hello welcome to new york. enjoy it here and london too'])   
target_names = ['New York', 'London']

classifier = Pipeline([
    ('vectorizer', CountVectorizer(min_n=1,max_n=2)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
for item, labels in zip(X_test, predicted):
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))

对我来说,这会产生输出:

nice day in nyc => New York
welcome to london => London
hello welcome to new york. enjoy it here and london too => New York, London

希望这会有所帮助。

【讨论】:

我尝试删除最后两个结合城市名称的训练示例,我得到:你好,欢迎来到纽约。在这里和伦敦也享受它 => 纽约 它不再返回两个标签。对我来说,如果我训练两个城市的组合,它只会返回两个标签。我错过了什么吗?再次感谢您的所有帮助 这只是一个玩具数据集,我不会从中得出太多结论。您是否在真实数据上尝试过此过程? @CodeMonkeyB:你真的应该接受这个答案,从编程的角度来看它是正确的。它在实践中是否有效取决于您的数据,而不是代码。 还有其他人遇到min_nmax_n 的问题吗?我需要将它们更改为ngram_range=(1,2) 才能工作 它给出了这个错误:ValueError:您似乎正在使用旧的多标签数据表示。不再支持序列序列;请改用二进制数组或稀疏矩阵。

以上是关于使用 scikit-learn 分类到多个类别的主要内容,如果未能解决你的问题,请参考以下文章

Scikit-Learn 中的分类数据转换

将一个句子分为多个类别

scikit-learn 中常用的评估模型

需要帮助将 scikit-learn 应用于这个不平衡的文本分类任务

多标签分类器中的拟合概率

如何使用 scikit-learn 计算用于情感分析的分类报告