将单词添加到 scikit-learn CountVectorizer 停止列表

Posted 2023-03-12

技术标签:

【中文标题】将单词添加到 scikit-learn CountVectorizer 停止列表【英文标题】：Adding words to scikit-learn's CountVectorizer's stop list 【发布时间】：2014-08-14 16:58:22 【问题描述】：

Scikit-learn 的 CountVectorizer 类允许您将字符串“english”传递给参数 stop_words。我想在这个预定义列表中添加一些东西。谁能告诉我该怎么做？

【问题讨论】：

你的意思是你想要默认的'english'stop_words 加上你自己的一些额外的？这篇文章救了你一命。 【参考方案1】：

根据sklearn.feature_extraction.text 的source code，ENGLISH_STOP_WORDS 的完整列表（实际上是frozenset，来自stop_words）通过__all__ 公开。因此，如果您想使用该列表以及更多项目，您可以执行以下操作：

from sklearn.feature_extraction import text 

stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)

（其中my_additional_stop_words 是任何字符串序列）并将结果用作stop_words 参数。 CountVectorizer.__init__ 的这个输入由 _check_stop_list 解析，它将直接传递新的 frozenset。

【讨论】：

有趣的是，该集中只有 318 个停用词。也许这些预先提供的停用词需要由使用它的人扩展。与 CountVectorizer 配合得非常好(stop_words = text.ENGLISH_STOP_WORDS.union(array_example))

以上是关于将单词添加到 scikit-learn CountVectorizer 停止列表的主要内容，如果未能解决你的问题，请参考以下文章

scikit-learn TfidfVectorizer 忽略某些单词

scikit-learn，将特征添加到矢量化文档集

是否可以将 TransformedTargetRegressor 添加到 scikit-learn 管道中？

根据文本语料库中的出现列出词汇表中的单词，使用 Scikit-Learn CountVectorizer

将 scikit-learn (sklearn) 预测添加到 pandas 数据帧

将 pandas/scikit-learn 包添加到您的项目以在 AWS lambda 中使用的正确方法是啥