使用sklearn对文档进行向量化

Posted 2020-12-07 cxq1126

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了使用sklearn对文档进行向量化相关的知识，希望对你有一定的参考价值。

 1 """
 2 演示内容：文档的向量化
 3 """
 4 from sklearn.feature_extraction.text import CountVectorizer
 5 corpus = [
 6 ‘Jobs was the chairman of Apple Inc., and he was very famous‘,
 7 ‘I like to use apple computer‘,
 8 ‘And I also like to eat apple‘
 9 ] 
10  
11 #未经停用词过滤的文档向量化
12 vectorizer =CountVectorizer()
13 print(vectorizer.fit_transform(corpus).todense())  #转化为完整特征矩阵
14 print(vectorizer.vocabulary_)
15 print(" ")
16  
17 #经过停用词过滤后的文档向量化
18 import nltk
19 nltk.download(‘stopwords‘)
20 stopwords = nltk.corpus.stopwords.words(‘english‘)
21 print (stopwords)
22 
23 print(" ")
24 vectorizer =CountVectorizer(stop_words=‘english‘)
25 print("after stopwords removal:
", vectorizer.fit_transform(corpus).todense())
26 print("after stopwords removal:
", vectorizer.vocabulary_)
27  
28 print(" ")
29 #采用ngram模式进行文档向量化
30 vectorizer =CountVectorizer(ngram_range=(1,2))    #表示从1-2，既包括unigram，也包括bigram
31 print("N-gram mode:
",vectorizer.fit_transform(corpus).todense())  #转化为完整特征矩阵
32 print(" ")
33 print("N-gram mode:
",vectorizer.vocabulary_)

未经停用词过滤的文档向量化：

技术图片

所有的停用词：

技术图片

经过停用词过滤的文档向量化：

技术图片

采用n-gram模式进行文档向量化：

技术图片

以上是关于使用sklearn对文档进行向量化的主要内容，如果未能解决你的问题，请参考以下文章