有重复项时如何合并两个计数向量化器？

Posted 2023-03-12

技术标签:

【中文标题】有重复项时如何合并两个计数向量化器？【英文标题】：how to merge two countvectorizers when there are duplicates? 【发布时间】：2021-09-17 19:38:14 【问题描述】：

考虑这个简单的例子

data = pd.DataFrame('text1' : ['hello world', 'hello universe'],
                     'text2': ['good morning', 'hello two three'])
    
data
Out[489]: 
            text1            text2
0     hello world     good morning
1  hello universe  hello two three

如您所见，text1 和 text2 共享一个完全相同的词：hello。我正在尝试为 text1 和 text2 分别创建 ngram，并且我想将结果连接到一个 countvectorizer 对象中。

我的想法是我想为这两个变量分别创建 ngram，并将它们用作 ML 算法中的特征。但是，我确实想要通过将字符串连接在一起来创建额外的 ngram，例如 hello world good morning 中的 world good。这就是我将 ngram 创建分开的原因。

问题在于这样做，生成的（稀疏）向量将包含重复的hello 列。

看这里：

vector = CountVectorizer(ngram_range=(1, 2))

v1 = vector.fit_transform(data.text1.values) 
print(vector.get_feature_names())

['hello', 'hello universe', 'hello world', 'universe', 'world']

v2 = vector.fit_transform(data.text2.values)
print(vector.get_feature_names())

['good', 'good morning', 'hello', 'hello two', 'morning', 'three', 'two', 'two three']

现在连接 v1 和 v2 得到 13 列

from scipy.sparse import hstack
print(hstack((v1, v2)).toarray())

[[1 0 1 0 1 1 1 0 0 1 0 0 0]
 [1 1 0 1 0 0 0 1 1 0 1 1 1]]

正确的文本特征应该是 12：

hello,word,hello word,good,morning,good morning,hello universe,universe,universe,two,@98765344@,@9865434@,@9865434@,@98765434@

我可以在这里做些什么来获得正确的独特词作为特征？谢谢！

【问题讨论】：

【参考方案1】：

我认为解决这个问题的最佳方法是创建一个使用 CountVectorizer 的自定义 Transformer。

我会这样做：

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

class MultiRowsCountVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.verctorizer = CountVectorizer(ngram_range=(1, 2))
    
    def fit(self, X, y = None):
        #concatenate all textual columns into one column
        X_ = np.reshape(X.values, (-1,))
        self.verctorizer.fit(X_)
        return self
    

    def transform(self, X, y = None):
        #join all the textual columns into one column
        X_ = X.apply(' '.join, axis=1)
        return self.verctorizer.transform(X_)
    
    def get_feature_names(self):
        return self.verctorizer.get_feature_names()
    
    
transformer = MultiRowsCountVectorizer()
X_ = transformer.fit_transform(data)
transformer.get_feature_names()

fit() 方法通过独立处理列来拟合 CountVectorizer，而 transform() 将列视为同一行文本。

np.reshape(X.values, (-1,)) 正在将形状为(N, n_columns) 的矩阵转换为大小为(N*n_columns,) 的一维数组。这可确保在fit() 期间独立处理每个文本字段。之后，通过将它们连接在一起，将转换应用于样本的所有文本特征。

此自定义 Transformer 正在返回所需的 12 个功能：

['good', 'good morning', 'hello', 'hello two', 'hello universe', 'hello world', 'morning', 'three', 'two', 'two three', 'universe', 'world']

并返回以下功能：

[[1 1 1 0 0 1 1 0 0 0 0 1]
 [0 0 2 1 1 0 0 1 1 1 1 0]]

注意：此自定义转换器假定 X 是带有 n 文本列的 pd.DataFrame。

编辑：文本字段需要在transform() 期间用空格连接。

【讨论】：

非常有趣，谢谢！你介意多解释一下这里发生了什么吗？ X_ = np.reshape(X.values, (-1,)) 到底在做什么？我假设需要传递的 DF 只包含两个文本列？我修改了我的回答来描述这个操作。此转换器适用于任意数量的文本列。 fit_transform() 由类TransformerMixin 继承，该类简单地将fit_transform() 定义为对fit() 的调用，然后是transform() 函数。见源码here 实现自定义转换器是一个不错的解决方案！但是，也许我遗漏了一些东西，但是结果数组的值是否错误？我的错，确实在转换过程中文本字段需要用空格连接。感谢您的评论。【参考方案2】：

免责声明：这个答案可能不是很复杂，但如果我正确理解你的问题，它应该可以完成它的工作。

# create an additional column by chaining the two text columns with a fake word
data['text3'] = data['text1'] + ' xxxxxxxxxx ' + data['text2']
print(data)
#             text1            text2                                      text3
# 0     hello world     good morning        hello world xxxxxxxxxx good morning
# 1  hello universe  hello two three  hello universe xxxxxxxxxx hello two three

# instantiate CountVectorizer and fit it
vector = CountVectorizer(ngram_range=(1, 2))
v3 = vector.fit_transform(data.text3.values)

# have a look at the resulting column names
all_colnames = vector.get_feature_names()
print(all_colnames)
# ['good', 'good morning', 'hello', 'hello two', 'hello universe', 'hello world', 'morning', 'three', 'two', 'two three', 'universe', 'universe xxxxxxxxxx', 'world', 'world xxxxxxxxxx', 'xxxxxxxxxx', 'xxxxxxxxxx good', 'xxxxxxxxxx hello']

# select only column names of interest
correct_colnames = [e for e in vector.get_feature_names() if 'xxxxxxxxxx' not in e]
print(correct_colnames)
# ['good', 'good morning', 'hello', 'hello two', 'hello universe', 'hello world', 'morning', 'three', 'two', 'two three', 'universe', 'world']

print(len(all_colnames))
# 17
print(len(correct_colnames))
# 12   # the desired length

# select only the array columns where the fake word is absent
arr = v3.toarray()[:, ['xxxxxxxxxx' not in e for e in colnames]]
print(arr.shape)
print(arr)
# (2, 12)
# [[1 1 1 0 0 1 1 0 0 0 0 1]
#  [0 0 2 1 1 0 0 1 1 1 1 0]]

# if you need a pandas.DataFrame as result
new_df = pd.DataFrame(arr, columns=correct_colnames)
print(new_df)
#    good  good morning  hello  hello two  hello universe  hello world  morning  three  two  two three  universe  world
# 0     1             1      1          0               0            1        1      0    0          0         0      1
# 1     0             0      2          1               1            0        0      1    1          1         1      0

其背后的基本原理是：我们插入了一个假词，例如 'xxxxxxxxxx'，它在文本字符串中几乎不可能遇到。该算法会将其视为一个真实的单词，因此将使用它创建 1-gram 和 2-gram。

但是，我们可以在之后消除那些 n-gram，并且所有相等的词（如本例中的 'hello'）不会分别计算两个文本列 - 事实上，您可以在生成的数据框中看到，'hello'这个词在第二行出现了两次，并且没有重复。

【讨论】：

谢谢！这是一个很好的解决方法，但也许有点过分？ R 中有一个简单的解决方案......我不敢相信这是人们第一次想到这个？你能链接R中的简单解决方案吗？也许它也可以在 Python 中以某种方式重现。顺便说一句，另一种解决方案可能是通过定义自定义类来实现自定义 scikit-learn 转换器，但是在定义 .fit 和 .transform 方法时应该小心以避免数据泄漏......我的是最简单的（虽然 hacky，正如你所说）想到的解决方案

以上是关于有重复项时如何合并两个计数向量化器？的主要内容，如果未能解决你的问题，请参考以下文章