NLTK 停用词列表

Posted 2023-03-12

技术标签:

【中文标题】NLTK 停用词列表【英文标题】：NLTK Stopword List 【发布时间】：2014-05-10 21:20:50 【问题描述】：

我有下面的代码，我正在尝试将停用词列表应用于单词列表。然而，结果仍然显示诸如“a”和“the”之类的词，我认为这些词会被此过程删除。任何有问题的想法都会很棒。

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
filtered_words = [w for w in word_list if not w in stopwords.words('english')]
print filtered_words

【问题讨论】：

Stopword removal with NLTK的可能重复 【参考方案1】：

一些注意事项。

如果您要一遍又一遍地根据列表检查成员资格，我会使用集合而不是列表。

stopwords.words('english') 返回小写停用词列表。您的来源很可能包含大写字母，因此不匹配。

您没有正确读取文件，您正在检查文件对象而不是由空格分隔的单词列表。

把它们放在一起：

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
stops = set(stopwords.words('english'))

for line in word_list:
    for w in line.split():
        if w.lower() not in stops:
            print w

【讨论】：

请注意，您仍然没有过滤标点符号，例如，您需要删除 ';"[]/?.,! 之类的内容。太棒了，一定是读取文件不正确，谢谢。

以上是关于NLTK 停用词列表的主要内容，如果未能解决你的问题，请参考以下文章