Sklearn 通过句子对文档进行矢量化以进行分类

Posted 2023-03-12

技术标签:

【中文标题】Sklearn 通过句子对文档进行矢量化以进行分类【英文标题】：Sklearn vectoring a document by sentences for classification 【发布时间】：2018-12-06 19:06:07 【问题描述】：

temp = []
for i in chunks:
    vectorizer2 = CountVectorizer()
    vectorizer2.fit_transform(i).todense()
    temp.append(vectorizer2)
    print(vectorizer2.vocabulary_)

x = [LinearSVC_classifier.classify(y) for y in temp ]

我有一个文档，我正试图以正确的格式放入我的分类器。我已将文档分解为单独的列表。所以数据看起来像这样..

chunks = [[ 'sentence1'] , ['sentence2'], ['sentences']]

我编写的函数让我部分到达那里，但后来我得到了这个错误。 ValueError：空词汇；也许文档只包含停用词但也得到这个......

u'and': 4, u'www': 53, u'is': 25, u'some': 44, u'commitment': 10

如果我手动单独运行每个句子，它们每个都会出现 0 个错误，并且分类器可以正常工作。我希望我最后的结果是这样的。

['sentence1', 'no'] , ['senence2', 'yes']

或者无论如何，我可以诚实地看到每个句子分类。我只是不确定错误在哪里，是否可以修复或者我需要一种新方法。任何帮助将不胜感激。

ValueError                                Traceback (most recent call last)
<ipython-input-608-c2fb95ef6621> in <module>()
  4 for i in chunks:
  5     print (i)
----> 6     vectorizer2.fit_transform(i).todense()
  7     temp.append(vectorizer2)
  8     print(vectorizer2.vocabulary_)

C:\Program Files\Anaconda2\lib\site- 
packages\sklearn\feature_extraction\text.pyc in fit_transform(self, 
raw_documents, y)
867 
868         vocabulary, X = self._count_vocab(raw_documents,
--> 869                                           self.fixed_vocabulary_)
870 
871         if self.binary:

C:\Program Files\Anaconda2\lib\site- 
packages\sklearn\feature_extraction\text.pyc in _count_vocab(self, 
raw_documents, fixed_vocab)
809             vocabulary = dict(vocabulary)
810             if not vocabulary:
--> 811                 raise ValueError("empty vocabulary; perhaps the 
documents only"
812                                  " contain stop words")
813 

ValueError: empty vocabulary; perhaps the documents only contain stop words

【问题讨论】：

【参考方案1】：

像这样把初始化放在循环外面，否则每句话都会一遍又一遍地重新初始化，这是不正确的。

temp = []
vectorizer2 = CountVectorizer()   #<--- This needs to be initialized only once
for i in chunks:

    vectorizer2.fit_transform(i).todense()
    temp.append(vectorizer2)
    print(vectorizer2.vocabulary_)

x = [LinearSVC_classifier.classify(y) for y in temp ]

【讨论】：

很遗憾没有修复它。仍然收到停用词错误。我不明白为什么它不能单独出现问题，现在它确实存在。你能帮我一个忙并在`vectorizer2.fit_transform(i).todense()`之前添加print i语句，我想看看它在哪一行失败我在`vectorizer2.fit_transform(i).todense()`上方添加了打印i，我真的看不出有什么不同。我用新代码为您更新了堆栈跟踪。如果我在最后打印 temp 我会为每个句子得到这个... [CountVectorizer(analyzer=u'word', binary=False, decode_error='ignore', dtype=, encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None，strip_accents=None，token_pattern=u'(?u)\\b\\w\\w+\\b'，tokenizer=None，词汇=None)]

以上是关于Sklearn 通过句子对文档进行矢量化以进行分类的主要内容，如果未能解决你的问题，请参考以下文章