Lemmatisation & Stemming 词干提取

Posted 2020-08-15

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Lemmatisation & Stemming 词干提取相关的知识，希望对你有一定的参考价值。

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications. 1.Stemmer 抽取词的词干或词根形式（不一定能够表达完整语义） Porter Stemmer基于Porter词干提取算法

  >>> from nltk.stem.porter import PorterStemmer  
  >>> porter_stemmer = PorterStemmer()  
  >>> porter_stemmer.stem(‘maximum’)  
  u’maximum’  
  >>> porter_stemmer.stem(‘presumably’)  
  u’presum’  
  >>> porter_stemmer.stem(‘multiply’)  
  u’multipli’  
  >>> porter_stemmer.stem(‘provision’)  
  u’provis’  
  >>> porter_stemmer.stem(‘owed’)  
  u’owe’

Lancaster Stemmer 基于Lancaster 词干提取算法

  >>> from nltk.stem.lancaster import LancasterStemmer  
  >>> lancaster_stemmer = LancasterStemmer()  
  >>> lancaster_stemmer.stem(‘maximum’)  
  ‘maxim’  
  >>> lancaster_stemmer.stem(‘presumably’)  
  ‘presum’  
  >>> lancaster_stemmer.stem(‘presumably’)  
  ‘presum’  
  >>> lancaster_stemmer.stem(‘multiply’)  
  ‘multiply’  
  >>> lancaster_stemmer.stem(‘provision’)  
  u’provid’  
  >>> lancaster_stemmer.stem(‘owed’)  
  ‘ow’

Snowball Stemmer基于Snowball 词干提取算法

  >>> from nltk.stem import SnowballStemmer  
  >>> snowball_stemmer = SnowballStemmer(“english”)  
  >>> snowball_stemmer.stem(‘maximum’)  
  u’maximum’  
  >>> snowball_stemmer.stem(‘presumably’)  
  u’presum’  
  >>> snowball_stemmer.stem(‘multiply’)  
  u’multipli’  
  >>> snowball_stemmer.stem(‘provision’)  
  u’provis’  
  >>> snowball_stemmer.stem(‘owed’)  
  u’owe’

2.Lemmatization 把一个任何形式的语言词汇还原为一般形式，标记词性的前提下效果比较好

  >>> from nltk.stem.wordnet import WordNetLemmatizer
  >>> lmtzr = WordNetLemmatizer()
  >>> lmtzr.lemmatize(‘cars‘)
  ‘car‘
  >>> lmtzr.lemmatize(‘feet‘)
  ‘foot‘
  >>> lmtzr.lemmatize(‘people‘)
  ‘people‘
  >>> lmtzr.lemmatize(‘fantasized‘,pos=“v”) #postag
  ‘fantasize‘

3.MaxMatch 在中文自然语言处理中常常用来进行分词

  from nltk.stem import WordNetLemmatizer  
  from nltk.corpus import words  
    
  wordlist = set(words.words())  
  wordnet_lemmatizer = WordNetLemmatizer()  
    
  def max_match(text):  
      pos2 = len(text)  
      result = ‘‘  
      while len(text) > 0:         
          word = wordnet_lemmatizer.lemmatize(text[0:pos2])  
          if word in wordlist:  
              result = result + text[0:pos2] + ‘ ‘  
              text = text[pos2:]  
              pos2 = len(text)  
          else:  
              pos2 = pos2-1                 
      return result[0:-1]  
      
  >>> string = ‘theyarebirds‘  
  >>> print(max_match(string))  
  they are birds     

https://marcobonzanini.com/2015/01/26/stemming-lemmatisation-and-pos-tagging-with-python-and-nltk/
http://blog.csdn.net/baimafujinji/article/details/51069522

以上是关于Lemmatisation & Stemming 词干提取的主要内容，如果未能解决你的问题，请参考以下文章

阿拉伯语预处理技术中的问题

NLPWords Normalization+PorterStemmer源码解析

成人网站PornHub跨站脚本(XSS)漏洞挖掘记

人民日报报道：伟大复兴！数字货币震撼来袭！

python 模块 SQLalchemy