自然语言处理----词干提取器

Posted 不哭的女孩

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了自然语言处理----词干提取器相关的知识,希望对你有一定的参考价值。

这里主要介绍nltk中的一些现成的词干提取器Porter和Lancaster.

1. Porter

>>> import nltk
>>> porter=nltk.PorterStemmer()
>>> raw=‘‘‘Listen, strange women lying in ponds distributing swords is no basis
... for a system of government. Supreme executive power derives from a mandate from
... the masses, not from some farcical aquatic‘‘‘
>>> tokens=nltk.word_tokenize(raw)
>>> [porter.stem(t) for t in tokens]
[listen, ,, ustrang, women, ulie, in, upond, udistribut, usword, is, no, ubasi, for, a, system, of, ugovern, ., usuprem, uexecut, power, uderiv, from,
, umandat, from, the, umass, ,, not, from, some, ufarcic, uaquat]

2. Lancaster

>>> lancaster=nltk.LancasterStemmer()
>>> [lancaster.stem(t) for t in tokens]
[list, ,, strange, wom, lying, in, pond, distribut, sword, is, no, bas, for, a, system, of, govern, ., suprem, execut, pow, der, from, a, mand, from
, the, mass, ,, not, from, som, farc, aqu]

3. 词形归并器:删除词缀产生的词, 常用的有WordNetLemmatier

>>> wnl=nltk.WordNetLemmatizer()
>>> [wnl.lemmatize(t) for t in tokens]
[Listen, ,, strange, uwoman, lying, in, upond, distributing, usword, is, no, basis, for, a, system, of, government, ., Supreme, executive, power, derives, from, a, mandate, from, the, umass, ,, not, from, some, farcical, aquatic]

从上面的运行结果可以看出,Porter词干提取器的效果比较好。

以上是关于自然语言处理----词干提取器的主要内容,如果未能解决你的问题,请参考以下文章

Java 中的分词器、停用词删除、词干提取

465词干提取与词形还原

自然语言处理(NLP)——分词统计itertools.chain—nltk工具

Python NLTK 中用于情感分析的德语词干

如何进行词干提取或词形还原?

自然语言处理典型开源软件