如何使用 nltk 或 python 删除停用词

Posted 2023-03-12

技术标签:

【中文标题】如何使用 nltk 或 python 删除停用词【英文标题】：How to remove stop words using nltk or python 【发布时间】：2011-07-26 01:45:27 【问题描述】：

所以我有一个数据集，我想从使用中删除停用词

stopwords.words('english')

我正在努力如何在我的代码中使用它来简单地取出这些单词。我已经有了这个数据集中的单词列表，我正在努力的部分是与这个列表进行比较并删除停用词。任何帮助表示赞赏。

【问题讨论】：

你从哪里得到停用词？这是来自 NLTK 的吗？ @MattO'Brien from nltk.corpus import stopwords 供未来的谷歌员工使用还需要运行nltk.download("stopwords") 才能使停用词词典可用。另见***.com/questions/19130512/stopword-removal-with-nltk 注意像“not”这样的词在nltk中也被认为是停用词。如果您进行情绪分析、垃圾邮件过滤等操作，否定可能会改变句子的整个含义，如果您将其从处理阶段中删除，您可能无法获得准确的结果。 【参考方案1】：

我想您有一个要从中删除停用词的单词列表 (word_list)。你可以这样做：

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

【讨论】：

这将比达伦·托马斯的列表理解慢很多...【参考方案2】：

from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]

【讨论】：

感谢这两个答案，尽管我的代码中似乎存在缺陷，阻止停止列表正常工作，但它们都可以工作。这应该是一个新的问题帖子吗？还不确定这里的情况如何！为了提高性能，请考虑使用stops = set(stopwords.words("english"))。 >>> 导入 nltk >>> nltk.download() Source stopwords.words('english') 是小写。因此，请确保在列表中仅使用小写单词，例如[w.lower() for w in word_list]【参考方案3】：

你也可以做一个set diff，例如：

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))

【讨论】：

注意：这会将句子转换为删除所有重复单词的 SET，因此您将无法对结果使用频率计数转换为集合可能会通过抓取多次出现的重要单词来删除句子中的可行信息。【参考方案4】：

   import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
    if x not in list:           # comparing from the list and removing it
        another_list.append(x)  # it is also possible to use .remove
for x in another_list:
     print(x,end=' ')

   # 2) if you want to use .remove more preferred code
    import sys
    print ("enter the string from which you want to remove list of stop words")
    userstring = input().split(" ")
    list =["a","an","the","in"]
    another_list = []
    for x in userstring:
        if x in list:           
            userstring.remove(x)  
    for x in userstring:           
        print(x,end = ' ') 
    #the code will be like this

【讨论】：

最好添加 stopwords.words("english") 而不是指定您需要删除的每个单词。【参考方案5】：

你可以使用这个功能，你应该注意到你需要降低所有的单词

from nltk.corpus import stopwords

def remove_stopwords(word_list):
        processed_word_list = []
        for word in word_list:
            word = word.lower() # in case they arenet all lower cased
            if word not in stopwords.words("english"):
                processed_word_list.append(word)
        return processed_word_list

【讨论】：

【参考方案6】：

使用filter:

from nltk.corpus import stopwords
# ...  
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))

【讨论】：

如果word_list 很大，这段代码很慢。使用前最好将停用词列表转换为集合：.. in set(stopwords.words('english'))。【参考方案7】：

要排除所有类型的停用词，包括 nltk 停用词，您可以执行以下操作：

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

【讨论】：

我得到len(get_stop_words('en')) == 174 vs len(stopwords.words('english')) == 179 遍历列表效率不高。【参考方案8】：

使用 textcleaner 库从数据中删除停用词。

点击此链接：https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds

按照这些步骤使用此库。

pip install textcleaner

安装后：

import textcleaner as tc
data = tc.document(<file_name>) 
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default

使用上面的代码删除停用词。

【讨论】：

【参考方案9】：

为了这个目的，有一个非常简单的轻量级 python 包stop-words。

首先使用以下方式安装软件包： pip install stop-words

然后您可以使用列表理解在一行中删除您的单词：

from stop_words import get_stop_words

filtered_words = [word for word in dataset if word not in get_stop_words('english')]

这个包下载起来非常轻量级（与 nltk 不同），适用于 Python 2 和 Python 3，并且它具有许多其他语言的停用词，例如：

    Arabic
    Bulgarian
    Catalan
    Czech
    Danish
    Dutch
    English
    Finnish
    French
    German
    Hungarian
    Indonesian
    Italian
    Norwegian
    Polish
    Portuguese
    Romanian
    Russian
    Spanish
    Swedish
    Turkish
    Ukrainian

【讨论】：

【参考方案10】：

这是我对此的看法，以防您想立即将答案转换为字符串（而不是过滤后的单词列表）：

STOPWORDS = set(stopwords.words('english'))
text =  ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text

【讨论】：

不要在法语 l' 中使用这种方法，否则不会被捕获。【参考方案11】：

如果您的数据存储为Pandas DataFrame，您可以使用来自textero 的remove_stopwords，它使用default 的NLTK 停用词列表。

import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])

【讨论】：

【参考方案12】：

from nltk.corpus import stopwords 

from nltk.tokenize import word_tokenize 

example_sent = "This is a sample sentence, showing off the stop words filtration."

  
stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(example_sent) 
  
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
  
print(word_tokens) 
print(filtered_sentence)

【讨论】：

【参考方案13】：

我给你举个例子首先，我从数据框（twitter_df）中提取文本数据以进一步处理如下

     from nltk.tokenize import word_tokenize
     tweetText = twitter_df['text']

然后标记化我使用以下方法

     from nltk.tokenize import word_tokenize
     tweetText = tweetText.apply(word_tokenize)

然后，要删除停用词，

     from nltk.corpus import stopwords
     nltk.download('stopwords')

     stop_words = set(stopwords.words('english'))
     tweetText = tweetText.apply(lambda x:[word for word in x if word not in stop_words])
     tweetText.head()

我认为这会对你有所帮助

【讨论】：

【参考方案14】：

虽然问题有点老了，但这里有一个新库，值得一提，可以做额外的任务。

在某些情况下，您不想只删除停用词。相反，您可能希望在文本数据中找到停用词并将其存储在一个列表中，以便您可以找到数据中的噪音并使其更具交互性。

该库名为'textfeatures'。您可以按如下方式使用它：

! pip install textfeatures
import textfeatures as tf
import pandas as pd

例如，假设您有以下一组字符串：

texts = [
    "blue car and blue window",
    "black crow in the window",
    "i see my reflection in the window"]

df = pd.DataFrame(texts) # Convert to a dataframe
df.columns = ['text'] # give a name to the column
df

现在，调用 stopwords() 函数并传递你想要的参数：

tf.stopwords(df,"text","stopwords") # extract stop words
df[["text","stopwords"]].head() # give names to columns

结果将是：

    text                                 stopwords
0   blue car and blue window             [and]
1   black crow in the window             [in, the]
2   i see my reflection in the window    [i, my, in, the]

如您所见，最后一列包含该文档（记录）中的停用词。

【讨论】：

以上是关于如何使用 nltk 或 python 删除停用词的主要内容，如果未能解决你的问题，请参考以下文章

整理了25个Python文本处理案例，收藏！

python 从nltk下载英语停用词

将单词添加到 nltk 停止列表

如何在 NLTK 中为停用词添加更多语言？

NLTK 停用词列表

[学习记录]NLTK常见操作一（去网页标记，统计词频，去停用词）