如何使用 tf-idf 对新文档进行分类？

Posted 2023-02-23

技术标签:

【中文标题】如何使用 tf-idf 对新文档进行分类？【英文标题】：How to classify new documents with tf-idf? 【发布时间】：2017-02-27 23:18:30 【问题描述】：

如果我使用来自sklearn 的TfidfVectorizer 来生成特征向量：

features = TfidfVectorizer(min_df=0.2, ngram_range=(1,3)).fit_transform(myDocuments)

然后我将如何生成特征向量来对新文档进行分类？由于您无法计算单个文档的 tf-idf。

用以下方式提取特征名称是否是一种正确的方法：

feature_names = TfidfVectorizer.get_feature_names()

然后根据feature_names?统计新文档的词频？

但是我不会得到包含单词重要性信息的权重。

【问题讨论】：

【参考方案1】：

我宁愿使用带有Latent Semantic Indexing 的 gensim 作为原始语料库的包装器：bow->tfidf->lsi

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

那么如果你需要继续训练：

new_tfidf = models.TfidfModel(corpus)
new_corpus_tfidf = new_tfidf[corpus]
lsi.add_documents(another_tfidf_corpus) # now LSI has been trained on corpus_tfidf + another_tfidf_corpus
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space

语料库是词袋正如您在他们的tutorials 中看到的那样：LSI 培训的独特之处在于，我们可以随时继续“培训”，只需提供更多培训文档即可。这是通过对基础模型的增量更新来完成的，这个过程称为在线训练。由于这个特性，输入文档流甚至可能是无限的——只要在 LSI 新文档到达时继续提供它们，同时将计算的转换模型用作只读！如果你喜欢sci-kit，gensim也是compatible with numpy

【讨论】：

【参考方案2】：

您需要保存 TfidfVectorizer 的实例，它会记住用于适应它的术语频率和词汇。如果不是使用fit_transform，而是分别使用fit 和transform，这可能会让事情变得更清楚：

vec = TfidfVectorizer(min_df=0.2, ngram_range=(1,3))
vec.fit(myDocuments)
features = vec.transform(myDocuments)
new_features = fec.transform(myNewDocuments)

【讨论】：

最后一行错字，应该是：new_features = vec.transform(myNewDocuments)

以上是关于如何使用 tf-idf 对新文档进行分类？的主要内容，如果未能解决你的问题，请参考以下文章

如何使用在不同项目中构建的分类模型对新文本进行分类？

如何使用保存的文本分类模型对新的文本数据集进行预测

使用 scikit-learn 进行文本分类：如何从 pickle 模型中获取新文档的表示

如何获得tf-idf分类器的最佳功能？

MATLAB：使用 fitctree 训练的分类器对新数据进行标签预测

R中的分类/预测