python 从文本中提取特征

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python 从文本中提取特征相关的知识,希望对你有一定的参考价值。


'''
From text
# + The sklearn.feature_extraction.text submodule gathers utilities to build feature vectors from text documents.
# > **feature_extraction.text.CountVectorizer([…])**	Convert a collection of text documents to a matrix of token counts  
# > **feature_extraction.text.HashingVectorizer([…])**	Convert a collection of text documents to a matrix of token occurrences  
# > **feature_extraction.text.TfidfTransformer([…])**	Transform a count matrix to a normalized tf or tf-idf representation  
# > **feature_extraction.text.TfidfVectorizer([…])**	Convert a collection of raw documents to a matrix of TF-IDF features.  
'''

## **CountVectorizer**
# Convert a collection of text documents to a matrix of token counts  
# http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
  'This is the first document.',
  'This is the second second document.',
  'And the third one.',
  'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names()
X.toarray()   

## TfidfTransformer
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf=False)

##  TfidfVectorizer that combines all the options of CountVectorizer and TfidfTransformer in a single model:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(corpus)

以上是关于python 从文本中提取特征的主要内容,如果未能解决你的问题,请参考以下文章

使用Python的文本挖掘的特征选择/提取

零基础学Python--机器学习:特征提取

python —— 文本特征提取 CountVectorize

文本特征提取专题_以python为工具Python机器学习系列

机器学习之路:python 文本特征提取 CountVectorizer, TfidfVectorizer

文本挖掘从小白到精通--- 7种简单易行的文本特征提取方法