word2vec 的自定义转换器和 FeatureUnion

Posted

技术标签:

【中文标题】word2vec 的自定义转换器和 FeatureUnion【英文标题】:Custom Transformer and FeatureUnion for word2vec 【发布时间】:2018-04-26 12:36:11 【问题描述】:

我正在尝试使用多组特征对一组文本文档进行分类。我正在使用sklearn's Feature Union 将不同的功能组合到一个模型中。其中一项功能包括使用gensim's word2vec 的词嵌入。

import numpy as np
from gensim.models.word2vec import Word2Vec
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest

categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
data = fetch_20newsgroups(subset='train', categories=categories)#dummy dataset

w2v_model= Word2Vec(data .data, size=100, window=5, min_count=5, workers=2)
word2vec=w: vec for w, vec in zip(w2v_model.wv.index2word, w2v_model.wv.syn0) #dictionary of word embeddings
feat_select = SelectKBest(score_func=chi2, k=10) #other features
TSVD = TruncatedSVD(n_components=50, algorithm = "randomized", n_iter = 5)
#other features

为了包含 sklearn 中尚不可用的转换器/估计器,我试图将我的 word2vec 结果包装到返回向量平均值的自定义转换器类中。

class w2vTransformer(TransformerMixin):
    """
    Wrapper class for running word2vec into pipelines and FeatureUnions
    """
    def __init__(self,word2vec,**kwargs):
        self.word2vec=word2vec
        self.kwargs=kwargs
        self.dim = len(word2vec.values())
    def fit(self,x, y=None):
        return self

    def transform(self, X):
        return np.array([
        np.mean([self.word2vec[w] for w in words if w in self.word2vec] 
            or [np.zeros(self.dim)], axis=0)
       for words in X
])

但是,当需要拟合模型时,我收到了一个错误。

combined_features = FeatureUnion([("w2v_class",w2vTransformer(word2vec)),
     ("feat",feat_select),("TSVD",TSVD)])#join features into combined_features
#combined_features = FeatureUnion([("feat",feat_select),("TSVD",TSVD)])#runs when word embeddings are not included    
text_clf_svm = Pipeline([('vect', CountVectorizer()),
         ('tfidf', TfidfTransformer()),
         ('feature_selection', combined_features),
          ('clf-svm',  SGDClassifier( loss="modified_huber")),
 ]) 

text_clf_svm_1 = text_clf_svm.fit(data.data,data.target) # fits data

text_clf_svm_1 = text_clf_svm.fit(data.data,data.target) # fits data
Traceback (most recent call last):

  File "<ipython-input-8-a085b7d40f8f>", line 1, in <module>
    text_clf_svm_1 = text_clf_svm.fit(data.data,data.target) # fits data

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 248, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 213, in _fit
    **fit_params_steps[name])

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py", line 362, in __call__
    return self.func(*args, **kwargs)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 581, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 739, in fit_transform
    for name, trans, weight in self._iter())

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 779, in __call__
    while self.dispatch_one_batch(iterator):

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 625, in dispatch_one_batch
    self._dispatch(tasks)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 588, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 111, in apply_async
    result = ImmediateResult(func)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 332, in __init__
    self.results = batch()

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 581, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\base.py", line 520, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)

  File "<ipython-input-6-cbc52cd420cd>", line 16, in transform
    for words in X

  File "<ipython-input-6-cbc52cd420cd>", line 16, in <listcomp>
    for words in X

  File "<ipython-input-6-cbc52cd420cd>", line 14, in <listcomp>
    np.mean([self.word2vec[w] for w in words if w in self.word2vec]

TypeError: unhashable type: 'csr_matrix'

Traceback (most recent call last):

  File "<ipython-input-8-a085b7d40f8f>", line 1, in <module>
    text_clf_svm_1 = text_clf_svm.fit(data.data,data.target) # fits data

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 248, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 213, in _fit
    **fit_params_steps[name])

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py", line 362, in __call__
    return self.func(*args, **kwargs)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 581, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 739, in fit_transform
    for name, trans, weight in self._iter())

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 779, in __call__
    while self.dispatch_one_batch(iterator):

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 625, in dispatch_one_batch
    self._dispatch(tasks)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 588, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 111, in apply_async
    result = ImmediateResult(func)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 332, in __init__
    self.results = batch()

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 581, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\base.py", line 520, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)

  File "<ipython-input-6-cbc52cd420cd>", line 16, in transform
    for words in X

  File "<ipython-input-6-cbc52cd420cd>", line 16, in <listcomp>
    for words in X

  File "<ipython-input-6-cbc52cd420cd>", line 14, in <listcomp>
    np.mean([self.word2vec[w] for w in words if w in self.word2vec]

TypeError: unhashable type: 'csr_matrix'

我知道错误是因为变量“words”是一个csr_matrix,但它需要是一个可迭代的,例如一个列表。我的问题是如何修改转换器类或数据,以便我可以使用词嵌入作为特征来输入 FeatureUnion?这是我的第一个SO帖子,请温柔。

【问题讨论】:

以上代码在我的系统中没有任何错误。您确定这是给您错误的完整且相同的代码吗?还可以尝试升级您正在使用的所有库。 我确实未能在代码中包含几个包依赖项,代码已更新。刚刚更新了我的包,仍然收到同样的错误。 我的系统上仍然没有错误。 我可以在错误跟踪中看到 text_clf_svm.fit(training_set.Abstract,training_set.AbKeep) 行,但您上面给出的代码是针对 text_clf_svm_1 = text_clf_svm.fit(data.data,data.target) 的。 我在 Ubuntu 14 上,使用 Python 2.7.6 和 NumPy - 1.13.3、SciPy - 0.19.1 和 Scikit-Learn - 0.19.0。如果有帮助的话。 【参考方案1】:

您可以使用 Gensim 直接提供的新 scikit-learn API 来避免该错误,而不是您的自定义转换器! https://radimrehurek.com/gensim/sklearn_api/w2vmodel.html

此外,这取决于您的 Gensim 版本,但在我的情况下,我可以使用 word2vec 对象的 wv 属性来解决相同的错误,而不是在对象本身上建立索引。

在 w2vTransformer 类的 transform 方法中:

self.word2vec.wv[w]

而不是

self.word2vec[w]

希望对你有帮助!

【讨论】:

以上是关于word2vec 的自定义转换器和 FeatureUnion的主要内容,如果未能解决你的问题,请参考以下文章

火花 word2vec 窗口大小

Spark 机器学习 ---Word2Vec

组件之间的自定义操作

scikit learn中不同数据类型的自定义管道

带有完成块和属性的自定义 segue 转换

scikit-learn 改变 X 和 Y 的自定义转换器/管道