我可以在 scikit-learn 中使用 CountVectorizer 来计算未用于提取标记的文档的频率吗？

Posted 2023-03-12

技术标签:

【中文标题】我可以在 scikit-learn 中使用 CountVectorizer 来计算未用于提取标记的文档的频率吗？【英文标题】：Can I use CountVectorizer in scikit-learn to count frequency of documents that were not used to extract the tokens? 【发布时间】：2014-05-20 04:34:53 【问题描述】：

我一直在使用 scikit-learn 中的 CountVectorizer 类。

我了解，如果以如下所示的方式使用，最终输出将包含一个包含特征计数或标记的数组。

这些标记是从一组关键字中提取的，即

tags = [
  "python, tools",
  "linux, tools, ubuntu",
  "distributed systems, linux, networking, tools",
]

下一步是：

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer=tokenize)
data = vec.fit_transform(tags).toarray()
print data

我们在哪里

[[0 0 0 1 1 0]
 [0 1 0 0 1 1]
 [1 1 1 0 1 0]]

这很好，但我的情况有点不同。

我想以与上述相同的方式提取特征，但我不希望 data 中的行与提取特征的文档相同。

换句话说，我怎样才能获得另一组文档的计数，例如，

list_of_new_documents = [
  ["python, chicken"],
  ["linux, cow, ubuntu"],
  ["machine learning, bird, fish, pig"]
]

得到：

[[0 0 0 1 0 0]
 [0 1 0 0 0 1]
 [0 0 0 0 0 0]]

我已经阅读了CountVectorizer 类的文档，并且遇到了vocabulary 参数，它是术语到特征索引的映射。然而，我似乎无法得到这个论点来帮助我。

感谢任何建议。 PS：对于我上面使用的示例，所有功劳归功于Matthias Friedrich's Blog。

【问题讨论】：

【参考方案1】：

你说得对，vocabulary 就是你想要的。它的工作原理是这样的：

>>> cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=['hot', 'cold', 'old'])
>>> cv.fit_transform(['pease porridge hot', 'pease porridge cold', 'pease porridge in the pot', 'nine days old']).toarray()
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 0],
       [0, 0, 1]], dtype=int64)

所以你将你想要的特性作为键传递给它。

如果您在一组文档上使用了CountVectorizer，然后您想将这些文档中的一组特征用于新的集合，请使用原始 CountVectorizer 的 vocabulary_ 属性并将其传递给新的。所以在你的例子中，你可以这样做

newVec = CountVectorizer(vocabulary=vec.vocabulary_)

使用您的第一个词汇创建一个新的分词器。

【讨论】：

谢谢，这看起来很棒！对于第一个解决方案：词汇表是否应该始终是字典，而不是列表？如果我错了，请纠正我，但计数 (0, 1, 2) 似乎无关紧要。您概述的第二种方法看起来可能更清晰一些。 @MattO'Brien：你说得对，它可以是一个列表，我误读了文档。我编辑了我的答案。不过，在第二种方法中，它是一个 dict，因为这就是拟合矢量化器的 vocabulary_ 方法。 BrenBarn，您的回答为我节省了很多时间。严重地。感谢您访问此网站。也许我不明白什么，与其用原始词汇初始化一个新的CountVectorizer，你能不能只用原始矢量化器在新文档集上调用.transform()？跨度> 我有一个大的字符串列表 (n > 10000)，其中每个字符串包含 100K 到 110K 单词。如何使该 countVectorizer 快速处理此类数据。这是否通过使用所有内核来工作【参考方案2】：

您应该在原始词汇源上调用 fit_transform 或只调用 fit，以便矢量化器学习词汇。

然后您可以通过transform() 方法在任何新数据源上使用此fit 矢量化器。

您可以通过vectorizer.vocabulary_（假设您将您的CountVectorizer 命名为名称vectorizer）获取拟合生成的词汇表（即单词到标记ID 的映射）。

【讨论】：

【参考方案3】：

>>> tags = [
  "python, tools",
  "linux, tools, ubuntu",
  "distributed systems, linux, networking, tools",
]

>>> list_of_new_documents = [
  ["python, chicken"],
  ["linux, cow, ubuntu"],
  ["machine learning, bird, fish, pig"]

]

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> vect = CountVectorizer()
>>> tags = vect.fit_transform(tags)

# vocabulary learned by CountVectorizer (vect)
>>> print(vect.vocabulary_)
'python': 3, 'tools': 5, 'linux': 1, 'ubuntu': 6, 'distributed': 0, 'systems': 4, 'networking': 2

# counts for tags
>>> tags.toarray()
array([[0, 0, 0, 1, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 1],
       [1, 1, 1, 0, 1, 1, 0]], dtype=int64)

# to use `transform`, `list_of_new_documents` should be a list of strings 
# `itertools.chain` flattens shallow lists more efficiently than list comprehensions

>>> from itertools import chain
>>> new_docs = list(chain.from_iterable(list_of_new_documents)
>>> new_docs = vect.transform(new_docs)

# finally, counts for new_docs!
>>> new_docs.toarray()
array([[0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0]])

要验证CountVectorizer 是否正在使用从new_docs 上的tags 学习的词汇：再次打印vect.vocabulary_ 或将new_docs.toarray() 的输出与tags.toarray() 的输出进行比较

【讨论】：

以上是关于我可以在 scikit-learn 中使用 CountVectorizer 来计算未用于提取标记的文档的频率吗？的主要内容，如果未能解决你的问题，请参考以下文章

在 java 程序中使用 scikit-learn 分类器

在 scikit-learn 中，DBSCAN 可以使用稀疏矩阵吗？

在 scikit-learn 管道中插入或删除步骤

Scikit-learn使用总结

scikit-learn：在管道中使用 SelectKBest 时获取选定的功能

如何在 Scikit-Learn 中重用 LabelBinarizer 进行输入预测