CountVectorizer 但用于文本组

Posted 2023-03-12

技术标签:

【中文标题】CountVectorizer 但用于文本组【英文标题】：CountVectorizer but for group of text 【发布时间】：2022-01-21 02:18:29 【问题描述】：

使用以下代码，CountVectorizer 将“风干肉”分解为 3 个不同的向量。但我想要的是将“风干肉”保留为 1 个向量。我该怎么做？

我运行的代码：

from sklearn.feature_extraction.text import CountVectorizer
food_names = ['Air-dried meat', 'Almonds', 'Amaranth']
count_vect = CountVectorizer(binary=True)
bow_rep = count_vect.fit(food_names)
#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect.vocabulary_)

当前输出：

Our vocabulary:  'air': 0, 'dried': 3, 'meat': 4, 'almonds': 1, 'amaranth': 2

期望的输出：

Our vocabulary:  'air-dried meat': 3, 'almonds': 1, 'amaranth': 2

【问题讨论】：

如果您使用'Air_dried_meat'，那么它会将其视为单个单词。但它可能对其他代码没有用。查看CountVectorizer 中的选项 - 即。 token_pattern。如果您使用CountVectorizer(binary=True, token_pattern='.+')，那么它会将列表中的每个元素视为单个单词。 @furas 如果您不介意，我有一个后续问题：在拟合您显示的数据后，我尝试拟合一个句子temp = count_vect.transform(["Almonds of Germany"]) print("Almonds of Germany", temp.toarray())，结果是 [[0, 0 , 0]] 预期结果为 [0, 1, 0]，因为包括世界“杏仁”。我该怎么做？如果您使用token_pattern='.+'，那么它似乎也使用它来拆分"Almonds of Germany"，并将Almonds of Germany视为一个词。您可以手动将文本拆分为列表["Almonds", "of", "Germany"])，但它会为每个单词提供分隔结果 - [0 1 0] [0 0 0] [0 0 0]。您可能必须在'"Air-dried meat"' 中使用tokenizer=shlex.split 和" " 我找到了其他方法-您可以将food_names转换为lower()并直接用作词汇表-CountVectorizer(binary=True, vocabulary=food_names)-但稍后当您使用fit()时它不会添加新元素。但它会将Almonds of Germany 拆分为transform() 中的单词。但transform() 会将Air-dried meat 视为三个字。 【参考方案1】：

您可以使用CountVectorizer 中的选项来更改行为 - 即。 token_pattern 或 tokenizer。

如果你使用token_pattern='.+'

CountVectorizer(binary=True, token_pattern='.+')

然后它将列表中的每个元素视为单个单词。

from sklearn.feature_extraction.text import CountVectorizer

food_names = ['Air-dried meat', 'Almonds', 'Amaranth']

count_vect = CountVectorizer(binary=True, token_pattern='.+')
bow_rep = count_vect.fit(food_names)

print("Our vocabulary:", count_vect.vocabulary_)

结果：

Our vocabulary: 'air-dried meat': 0, 'almonds': 1, 'amaranth': 2

如果你使用tokenizer=shlex.split

CountVectorizer(binary=True, tokenizer=shlex.split)

那么你可以使用" " 对字符串中的单词进行分组

from sklearn.feature_extraction.text import CountVectorizer
import shlex

food_names = ['"Air-dried meat" other words', 'Almonds', 'Amaranth']

count_vect = CountVectorizer(binary=True, tokenizer=shlex.split)
bow_rep = count_vect.fit(food_names)

print("Our vocabulary:", count_vect.vocabulary_)

结果：

Our vocabulary: 'air-dried meat': 0, 'other': 3, 'words': 4, 'almonds': 1, 'amaranth': 2

顺便说一句：DataScience 门户网站上的类似问题

how to avoid tokenizing w/ sklearn feature extraction

编辑：

您还可以将food_names 转换为lower() 并用作vocabulary

vocabulary = [x.lower() for x in food_names]

count_vect = CountVectorizer(binary=True, vocabulary=vocabulary)

它也将其视为词汇中的单个元素

from sklearn.feature_extraction.text import CountVectorizer

food_names = ["Air-dried meat", "Almonds", "Amaranth"]
vocabulary = [x.lower() for x in food_names]

count_vect = CountVectorizer(binary=True, vocabulary=vocabulary)

bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)

问题是当您想将这些方法与transform() 一起使用时，因为只有tokenizer=shlex.split 会在转换后的文本中拆分文本。但它也需要" "在文本中捕捉Air-dried meat

from sklearn.feature_extraction.text import CountVectorizer
import shlex

food_names = ['"Air-dried meat" Almonds Amaranth']

count_vect = CountVectorizer(binary=True, tokenizer=shlex.split)
bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)

text = 'Almonds of Germany'
temp = count_vect.transform([text])
print(text, temp.toarray())

text = '"Air-dried meat"'
temp = count_vect.transform([text])
print(text, temp.toarray())

text = 'Air-dried meat'
temp = count_vect.transform([text])
print(text, temp.toarray())

【讨论】：

以上是关于CountVectorizer 但用于文本组的主要内容，如果未能解决你的问题，请参考以下文章

CountVectorizer，Tf-idfVectorizer和word2vec构建词向量的区别

文本和音频代码的小结

word2vec

CountVectorizer：transform 方法在单个文本行上返回多维数组

如何在必要的预处理后使用 nltk 文本分析库预测特定文本或文本组

Spark CountVectorizer