Python nltk English Detection

Posted 龟窝

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python nltk English Detection相关的知识,希望对你有一定的参考价值。

http://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/

 

>>> from nltk import wordpunct_tokenize

>>> wordpunct_tokenize("That‘s thirty minutes away. I‘ll be there in ten.")

[‘That‘, "‘", ‘s‘, ‘thirty‘, ‘minutes‘, ‘away‘, ‘.‘, ‘I‘, "‘", ‘ll‘, ‘be‘, ‘there‘, ‘in‘, ‘ten‘, ‘.‘]

 

>>> from nltk.corpus import stopwords

>>> stopwords.fileids()

[‘danish‘, ‘dutch‘, ‘english‘, ‘finnish‘, ‘french‘, ‘german‘, ‘hungarian‘, ‘italian‘, ‘norwegian‘, ‘portuguese‘, ‘russian‘, ‘spanish‘, ‘swedish‘, ‘turkish‘]

>>>

>>> stopwords.words(‘english‘)[0:10]

[‘i‘, ‘me‘, ‘my‘, ‘myself‘, ‘we‘, ‘our‘, ‘ours‘, ‘ourselves‘, ‘you‘, ‘your‘]

 

>>> languages_ratios = {}

>>>

>>> tokens = wordpunct_tokenize(text)

>>> words = [word.lower() for word in tokens]

>>> for language in stopwords.fileids():

... stopwords_set = set(stopwords.words(language))

... words_set = set(words)

... common_elements = words_set.intersection(stopwords_set)

...

... languages_ratios[language] = len(common_elements)

# language "score"

>>>

以上是关于Python nltk English Detection的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 nltk 或 python 删除停用词

找不到资源 u'tokenizers/punkt/english.pickle'

python多行代码简化

nltk包返回TypeError:'LazyCorpusLoader'对象不可调用

12.朴素贝叶斯-垃圾邮件分类

linux下python3离线加载nltk_data,不用nltk.download()