使用 counts 和 tfidf 作为 scikit learn 的特征

Posted 2023-02-23

技术标签:

【中文标题】使用 counts 和 tfidf 作为 scikit learn 的特征【英文标题】：Using counts and tfidf as features with scikit learn 【发布时间】：2015-01-31 08:54:43 【问题描述】：

我正在尝试同时使用计数和 tfidf 作为多项式 NB 模型的特征。这是我的代码：

text = ["this is spam", "this isn't spam"]
labels = [0,1]
count_vectorizer = CountVectorizer(stop_words="english", min_df=3)

tf_transformer = TfidfTransformer(use_idf=True)
combined_features = FeatureUnion([("counts", self.count_vectorizer), ("tfidf", tf_transformer)]).fit(self.text)

classifier = MultinomialNB()
classifier.fit(combined_features, labels)

但是我遇到了 FeatureUnion 和 tfidf 的错误：

TypeError: no supported conversion for types: (dtype('S18413'),)

知道为什么会发生这种情况吗？不能同时将计数和 tfidf 作为特征吗？

【问题讨论】：

【参考方案1】：

错误不是来自FeatureUnion，而是来自TfidfTransformer

您应该使用 TfidfVectorizer 而不是 TfidfTransformer，转换器需要一个 numpy 数组作为输入而不是纯文本，因此会出现 TypeError

另外，您的测试句子对于 Tfidf 测试来说太小了，因此请尝试使用更大的句子，这是一个示例：

from nltk.corpus import brown

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import FeatureUnion
from sklearn.naive_bayes import MultinomialNB

# Let's get more text from NLTK
text = [" ".join(i) for i in brown.sents()[:100]]
# I'm just gonna assign random tags.
labels = ['yes']*50 + ['no']*50
count_vectorizer = CountVectorizer(stop_words="english", min_df=3)
tf_transformer = TfidfVectorizer(use_idf=True)
combined_features = FeatureUnion([("counts", count_vectorizer), ("tfidf", tf_transformer)]).fit_transform(text)
classifier = MultinomialNB()
classifier.fit(combined_features, labels)

【讨论】：

以上是关于使用 counts 和 tfidf 作为 scikit learn 的特征的主要内容，如果未能解决你的问题，请参考以下文章