如何提前判断 CountVectorizer 是不是会抛出 ValueError: empty words？

Posted 2023-03-12

技术标签:

【中文标题】如何提前判断 CountVectorizer 是不是会抛出 ValueError: empty words？【英文标题】：How to tell in advance if CountVectorizer will throw ValueError: empty vocabulary?如何提前判断 CountVectorizer 是否会抛出 ValueError: empty words？ 【发布时间】：2019-05-31 22:22:13 【问题描述】：

是否可以提前知道CountVectorizer会不会抛出

ValueError：空词汇？

基本上，我有一个文档语料库，我想过滤掉那些不会通过CountVectorizer 的文档（我正在使用stop_words='english'）

谢谢

【问题讨论】：

【参考方案1】：

您可以使用build_analyzer() 识别这些文档。试试这个！

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
    'this is to',
    'she has'
]
analyzer = CountVectorizer(stop_words='english').build_analyzer()
filter_condtn = [True if analyzer(doc) else False for doc in corpus ]

#[True, True, False, True, False, False]

附：：我太困惑了，看不到第三个文档中的所有单词都是停用词。

【讨论】：

以上是关于如何提前判断 CountVectorizer 是不是会抛出 ValueError: empty words？的主要内容，如果未能解决你的问题，请参考以下文章