NLP：将 CountVectorizer 应用于包含特征列表的列

Posted 2023-03-12

技术标签:

【中文标题】NLP：将 CountVectorizer 应用于包含特征列表的列【英文标题】：NLP: Apply CountVectorizer to column containing a list of features 【发布时间】：2020-08-18 09:54:20 【问题描述】：

我想将CountVectorizer 应用于包含单词和短语列表的列。换句话说，语料库不是一个字符串，而是一个列表。问题是CountVectorizer 或我遇到的任何其他相关函数都需要一个字符串作为输入。将列表加入一个字符串并进行标记是没有意义的，因为某些短语包含 2 个单词。有任何想法吗？

示例：

ID      corpus
1       ["Harry Potter","Batman"]
2       ["Batman", "Superman", "Lord of the Rings"]

想要的结果：

ID   Harry Potter    Batman    Superman    Lord of the Rings
1    1               1         0           0
2    0               1         1           1

【问题讨论】：

【参考方案1】：

由于您已经对句子进行了标记，因此可能不需要CountVectorizer。

我写了MultiLabelCounter()here，可以解决你的问题。

import pandas as pd
x = [["Harry Potter","Batman"], ["Batman", "Superman", "Lord of the Rings"]]

mlc = MultiLabelCounter()
mlc.fit_transform(x)
# [[1, 1, 0, 0], [1, 0, 1, 1]]

mlc.classes_
# ['Batman', 'Harry Potter', 'Lord of the Rings', 'Superman']

【讨论】：

以上是关于NLP：将 CountVectorizer 应用于包含特征列表的列的主要内容，如果未能解决你的问题，请参考以下文章