邮件分词去掉停用词

Posted fanfanfan

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了邮件分词去掉停用词相关的知识,希望对你有一定的参考价值。

!pip install nltk

技术分享图片

#读取文件
text = Be assured that individual statistics are not disclosed and this is for internal use only..I am pleased to inform you that you have been accepted to join the workshop scheduled for 22-24 Nov,2008.
import nltk
nltk.download(punkt)
nltk.download(stopwords)
nltk.download(wordnet)
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
 #预处理
def preprocessing(text):
    #text = text.decode("utf-8")
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    stops = stopwords.words(english)
    tokens = [token for token in tokens if token not in stops]
    
    tokens = [token.lower() for token in tokens if len(token) >= 3]
    lmtzr = WordNetLemmatizer()
    tokens = [lmtzr.lemmatize(token) for token in tokens]
    preprocessed_text = ‘‘.join(tokens)
    return preprocessed_text

preprocessing(text)

技术分享图片

#划分数据集
from sklearn.model_selection import train_test_split
# 生成100条数据:100个2维的特征向量,对应100个标签
x = [["feature ","one "]] * 50 + [["feature ","two "]] * 50
y = [1] * 50 + [2] * 50
 # 随机抽取30%的测试集
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3,random_state=0)
print ("train:",len(x_train), "test:",len(x_test))
 # 查看被划分出的测试集
for i in range(len(x_test)):
    print ("".join(x_test[i]), y_test[i])

技术分享图片

 

以上是关于邮件分词去掉停用词的主要内容,如果未能解决你的问题,请参考以下文章

文本向量化的原理

正则表达式:去除Unicode

正则表达式:去除Unicode

用python对单一微博文档进行分词——jieba分词(加保留词和停用词)

python使用jieba实现中文文档分词和去停用词

Studio 爬虫 文本分词 化词云个性化设计