CountVectorizer:未安装词汇表

Posted

技术标签:

【中文标题】CountVectorizer:未安装词汇表【英文标题】:CountVectorizer: Vocabulary wasn't fitted 【发布时间】:2015-12-16 22:09:06 【问题描述】:

我通过vocabulary 参数传递一个词汇表实例化了一个sklearn.feature_extraction.text.CountVectorizer 对象,但我收到了sklearn.utils.validation.NotFittedError: CountVectorizer - Vocabulary wasn't fitted. 错误消息。为什么?

例子:

import sklearn.feature_extraction
import numpy as np
import pickle

# Save the vocabulary
ngram_size = 1
dictionary_filepath = 'my_unigram_dictionary'
vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), min_df=1)

corpus = ['This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document? This is right.',]

vect = vectorizer.fit(corpus)
print('vect.get_feature_names(): 0'.format(vect.get_feature_names()))
pickle.dump(vect.vocabulary_, open(dictionary_filepath, 'w'))

# Load the vocabulary
vocabulary_to_load = pickle.load(open(dictionary_filepath, 'r'))
loaded_vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), min_df=1, vocabulary=vocabulary_to_load)
print('loaded_vectorizer.get_feature_names(): 0'.format(loaded_vectorizer.get_feature_names()))

输出:

vect.get_feature_names(): [u'and', u'document', u'first', u'is', u'one', u'right', u'second', u'the', u'third', u'this']
Traceback (most recent call last):
  File "C:\Users\Francky\Documents\GitHub\adobe\dstc4\test\CountVectorizerSaveDic.py", line 22, in <module>
    print('loaded_vectorizer.get_feature_names(): 0'.format(loaded_vectorizer.get_feature_names()))
  File "C:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 890, in get_feature_names
    self._check_vocabulary()
  File "C:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 271, in _check_vocabulary
    check_is_fitted(self, 'vocabulary_', msg=msg),
  File "C:\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 627, in check_is_fitted
    raise NotFittedError(msg % 'name': type(estimator).__name__)
sklearn.utils.validation.NotFittedError: CountVectorizer - Vocabulary wasn't fitted.

【问题讨论】:

【参考方案1】:

由于某种原因,即使您将vocabulary=vocabulary_to_load 作为sklearn.feature_extraction.text.CountVectorizer() 的参数传递,您仍然需要调用loaded_vectorizer._validate_vocabulary() 才能调用loaded_vectorizer.get_feature_names()

因此,在您的示例中,您应该在使用您的词汇表创建 CountVectorizer 对象时执行以下操作:

vocabulary_to_load = pickle.load(open(dictionary_filepath, 'r'))
loaded_vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,
                                        ngram_size), min_df=1, vocabulary=vocabulary_to_load)
loaded_vectorizer._validate_vocabulary()
print('loaded_vectorizer.get_feature_names(): 0'.
  format(loaded_vectorizer.get_feature_names()))

【讨论】:

以上是关于CountVectorizer:未安装词汇表的主要内容,如果未能解决你的问题,请参考以下文章

CountVectorizer,Tf-idfVectorizer和word2vec构建词向量的区别

无法编写 Count Vectorizer 词汇表

CountVectorizer 不打印词汇表

如何从 CountVectorizer 保存和加载词汇表?

当我传递自定义词汇表时,Python 中的 CountVectorizer() 返回全零

自定义词汇表上的 Sklearn Countvectorizer