为啥一个频繁出现的词会被错误分类？

Posted 2023-03-12

技术标签:

【中文标题】为啥一个频繁出现的词会被错误分类？【英文标题】：Why a frequent word gets misclassified?为什么一个频繁出现的词会被错误分类？ 【发布时间】：2019-12-21 19:53:07 【问题描述】：

我正在练习 NLP 并使用以下函数检查每个类别最常见的单词是什么，然后观察一些句子是如何分类的。结果出人意料地错误（您是否必须建议另一种方法来执行此有助于查找每个类别最常用词的有用部分？）：

#The function
def show_top10(classifier, vectorizer, categories):
...     feature_names = np.asarray(vectorizer.get_feature_names())
...     for i, category in enumerate(categories):
...         top10 = np.argsort(classifier.coef_[i])[-10:]
...         print("%s: %s" % (category, " ".join(feature_names[top10])))

#Using the function on the data
show_top10(clf, vectorizer, newsgroups_train.target_names)

#The results seem to be logical
#the most frequent words by category are these:
rec.autos: think know engine don new good just like cars car
rec.motorcycles: riding helmet don know ride bikes dod like just bike
sci.space: don earth think orbit launch moon just like nasa space

#Now, testing these sentences, we see that they are classified wrong and not based 
#on the above most frequent words

texts = ["wheelie", 
    "stars are shining",
    "galaxy"]
text_features = vectorizer.transform(texts)
predictions = clf.predict(text_features)
for text, predicted in zip(texts, predictions):
   print('""'.format(text))
   print("  - Predicted as: ''".format(newsgroup_train.target_names[predicted]))
   print("")

结果是：

"wheelie"
  - Predicted as: 'rec.motorcycles'

"stars are shining"
  - Predicted as: 'sci.space'

"galaxy"
  - Predicted as: 'rec.motorcycles'

galaxy这个词在空间文本中被多次提及。为什么不能正确分类？

分类的代码如有需要可以看下面。

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics


cats = ['sci.space','rec.autos','rec.motorcycles']
newsgroups_train = fetch_20newsgroups(subset='train',
                           remove=('headers', 'footers', 'quotes'), categories = cats)
newsgroups_test = fetch_20newsgroups(subset='test',
                           remove=('headers', 'footers', 'quotes'), categories = cats)

vectorizer = TfidfVectorizer(max_features = 1000,max_df = 0.5,
                            min_df = 5, stop_words='english')


vectors = vectorizer.fit_transform(newsgroups_train.data)

vectors_test = vectorizer.transform(newsgroups_test.data)

clf = MultinomialNB(alpha=.01)
clf.fit(vectors, newsgroups_train.target)
vectors_test = vectorizer.transform(newsgroups_test.data)
pred = clf.predict(vectors_test)

可能是因为准确度得分为 0.77，这导致一些错误分类。您如何建议使模型表现更好？实际上，SVM 是我想使用的，但结果更差，并且在每个类别中都给出了更频繁的单词“00”。

【问题讨论】：

这与您之前的问题非常相似。你会为每个学期问一个单独的问题吗？您是否查看过 rec.motorcycles 帖子中出现“星系”的频率？（快速谷歌搜索表明它在频繁贡献者的电子邮件地址中很普遍，但我没有正确调查。）哈哈哈。不要让代码相同的事实欺骗您。另一个问题是关于一个错误的变量。我不知道您说的电子邮件是否重要，因为我已经删除了页眉、页脚等，我正在使用教程学习，我希望能提供一些帮助，了解可以在此处改进哪些实践以提高准确性。 @tripleee 你能提出一些提高支持向量机准确性的方法吗？如果这个词不是一个很好的区分指标，那么破解算法只会让整体结果变得更糟。首先查看您的数据。删除这些是数据清理过程的一部分，建议在 scikit learn 的文档中。我在发布之前检查了数据，但奇怪的是只有空间类别存在。我什至尝试了“仙女座星系”，它又被归入了“摩托车”类。 【参考方案1】：

频繁词被错误分类是因为您没有使用频繁词进行训练，而是使用矢量化文档进行训练。您描述的问题是文本分类问题，即为一段文本（此处为新闻文章）分配标签。

您的训练方式是使用tf-idf 为每个文档制作一个向量，并且您已指定max_features = 1000。

在训练期间自然会向矢量化器提供大量文本，这会产生密集矢量。

在测试期间，您试图从两三个词中提取 1000 个特征！这导致非常稀疏的向量。即使它们包括权重可能很高的频繁词，但频繁词对 1000 个特征的权重的总质量贡献不足。所以分类器没有足够的高权重特征来预测。

我认为这解释了为什么文本被错误分类。如果您想进行实验，我建议您减少编号。特征，而不是在训练期间提供整个文本，而是提供前 n 个常用词。

【讨论】：

我尝试减少这个数字，但它的准确性变得更差了。你有没有试过看看你的理论是否能改善结果？你可以很容易地对此进行测试，因为我提供的示例是可重现的。

以上是关于为啥一个频繁出现的词会被错误分类？的主要内容，如果未能解决你的问题，请参考以下文章