Python-在表格中显示常用词并跳过某些词

Posted 2023-02-16

技术标签:

【中文标题】Python-在表格中显示常用词并跳过某些词【英文标题】：Python- displaying frequent words in a table and skipping certain words 【发布时间】：2017-07-12 17:01:58 【问题描述】：

目前我正在对一个文本文件进行频率分析，该文件显示文本文件中最常用的 100 个单词。目前我正在使用此代码：

from collections import Counter
import re
words = re.findall(r'\w+', open('tweets.txt').read().lower())
print Counter(words).most_common (100)

上面的代码有效，输出是：

[('the', 1998), ('t', 1829), ('https', 1620), ('co', 1604), ('to', 1247), ('and', 1053), ('in', 957), ('a', 899), ('of', 821), ('i', 789), ('is', 784), ('you', 753), ('will', 654), ('for', 601), ('on', 574), ('thank', 470), ('be', 455), ('great', 447), ('hillary', 440), ('we', 390), ('that', 373), ('s', 363), ('it', 346), ('with', 345), ('at', 333), ('me', 327), ('are', 311), ('amp', 290), ('clinton', 288), ('trump', 287), ('have', 286), ('our', 264), ('realdonaldtrump', 256), ('my', 244), ('all', 237), ('crooked', 236), ('so', 233), ('by', 226), ('this', 222), ('was', 217), ('people', 216), ('has', 210), ('not', 210), ('just', 210), ('america', 204), ('she', 190), ('they', 188), ('trump2016', 180), ('very', 180), ('make', 180), ('from', 175), ('rt', 170), ('out', 169), ('he', 168), ('her', 164), ('makeamericagreatagain', 164), ('join', 161), ('as', 158), ('new', 157), ('who', 155), ('again', 154), ('about', 145), ('no', 142), ('get', 138), ('more', 137), ('now', 136), ('today', 136), ('president', 135), ('can', 134), ('time', 123), ('media', 123), ('vote', 117), ('but', 117), ('am', 116), ('bad', 116), ('going', 115), ('maga', 112), ('u', 112), ('many', 110), ('if', 110), ('country', 108), ('big', 108), ('what', 107), ('your', 105), ('cnn', 105), ('never', 104), ('one', 101), ('up', 101), ('back', 99), ('jobs', 98), ('tonight', 97), ('do', 97), ('been', 97), ('would', 94), ('obama', 93), ('tomorrow', 88), ('said', 88), ('like', 88), ('should', 87), ('when', 86)]

但是，我想以表格形式显示它，标题为“Word”和“Count”。我试过使用prettytable 包并想出了这个：

from collections import Counter
import re
import prettytable

words = re.findall(r'\w+', open('tweets.txt').read().lower())

for label, data in ('Word', words):
    pt = prettytable(field_names=[label, 'Count'])
    c = Counter(data)
    [pt.add_row(kv) for kv in c.most_common() [:100] ]
    pt.align [label], pt.align['Count'] = '1', 'r'
    print pt

它给了我ValueError: too many values to unpack。我的问题是，我的代码有什么问题，有没有办法使用prettytable 显示数据？另外，我该如何修改我的代码？

额外问题：有没有办法在计算频率时省略某些单词？例如跳过单词：and, if, of etc 等

谢谢。

【问题讨论】：

错误在哪一行？更新问题。什么是('Word', words)？错误在这一行“for label, data in ('Word', words):” 对不起，我是 python 新手，Word 是标题标签，'words' 是单词本身（例如，它们、制作、获取等） 【参考方案1】：

我不确定您编写的 for 循环如何工作。你得到的错误是因为你试图迭代有两个元素的元组('Word', words)。语句for label, data in ('Word', words) 尝试将'W' 分配给label，'o' 分配给data，并在第一次迭代中得到'r' 和'd'。也许您打算将这些物品压缩在一起？但是那你为什么要为每个单词制作一个新表呢？

这是一个重写的版本：

from collections import Counter
import re, prettytable

words = re.findall(r'\w+', open('tweets.txt').read().lower())
c = Counter(words)
pt = prettytable.PrettyTable(['Words', 'Counts'])
pt.align['Words'] = 'l'
pt.align['Counts'] = 'r'
for row in c.most_common(100):
    pt.add_row(row)
print pt

要跳过最常见计数中的元素，您只需在调用most_common 之前将它们从计数器中丢弃即可。一种简单的方法是定义一个无效单词列表，然后使用 dict comprehension 过滤掉它们：

bad_words = ['the', 'if', 'of']
c = Counter(k: v for k, v in c.items() if k not in bad_words)

或者，您可以在创建计数器之前对单词列表进行过滤：

words = filter(lambda x: x not in bad_words, words)

我更喜欢在柜台上操作，因为这需要较少的工作，因为数据已经汇总。以下是合并代码供参考：

from collections import Counter
import re, prettytable

bad_words = ['the', 'if', 'of']
words = re.findall(r'\w+', open('tweets.txt').read().lower())

c = Counter(words)
c = Counter(k: v for k, v in c.items() if k not in bad_words)

pt = prettytable.PrettyTable(['Words', 'Counts'])
pt.align['Words'] = 'l'
pt.align['Counts'] = 'r'
for row in c.most_common(100):
    pt.add_row(row)

print(pt)

【讨论】：

你的代码有错误。文件“test4.py”，第 7 行，在 pt.set_field_names(["Words", "Counts"]) 文件 "C:\Python27\lib\site-packages\prettytable.py"，第 217 行，在getattr raise AttributeError(name) AttributeError: set_field_names @Vin23。我修好了。 @Vin23。该库的文档有点过时了，我的第一个版本就是基于此。这个答案与 loics 相比只有一个优势，那就是它列出了 100 个最常用词的表格在跳过被删除之后，而不是之前。【参考方案2】：

这是你想要做的吗？

from prettytable import PrettyTable

x = PrettyTable(["Words", "Counts"])

L = [('the', 1998), ('t', 1829), ('https', 1620), ('co', 1604), ('to', 1247), ('and', 1053), ('in', 957), ('a', 899), ('of', 821), ('i', 789), ('is', 784), ('you', 753), ('will', 654), ('for', 601), ('on', 574), ('thank', 470), ('be', 455), ('great', 447), ('hillary', 440), ('we', 390), ('that', 373), ('s', 363), ('it', 346), ('with', 345), ('at', 333), ('me', 327), ('are', 311), ('amp', 290), ('clinton', 288), ('trump', 287), ('have', 286), ('our', 264), ('realdonaldtrump', 256), ('my', 244), ('all', 237), ('crooked', 236), ('so', 233), ('by', 226), ('this', 222), ('was', 217), ('people', 216), ('has', 210), ('not', 210), ('just', 210), ('america', 204), ('she', 190), ('they', 188), ('trump2016', 180), ('very', 180), ('make', 180), ('from', 175), ('rt', 170), ('out', 169), ('he', 168), ('her', 164), ('makeamericagreatagain', 164), ('join', 161), ('as', 158), ('new', 157), ('who', 155), ('again', 154), ('about', 145), ('no', 142), ('get', 138), ('more', 137), ('now', 136), ('today', 136), ('president', 135), ('can', 134), ('time', 123), ('media', 123), ('vote', 117), ('but', 117), ('am', 116), ('bad', 116), ('going', 115), ('maga', 112), ('u', 112), ('many', 110), ('if', 110), ('country', 108), ('big', 108), ('what', 107), ('your', 105), ('cnn', 105), ('never', 104), ('one', 101), ('up', 101), ('back', 99), ('jobs', 98), ('tonight', 97), ('do', 97), ('been', 97), ('would', 94), ('obama', 93), ('tomorrow', 88), ('said', 88), ('like', 88), ('should', 87), ('when', 86)]


for e in L:
    x.add_row([e[0],e[1]])

print x

结果如下：

+-----------------------+--------+
|         Words         | Counts |
+-----------------------+--------+
|          the          |  1998  |
|           t           |  1829  |
|         https         |  1620  |
|           co          |  1604  |
|           to          |  1247  |
|          and          |  1053  |
|           in          |  957   |
|           a           |  899   |
|           of          |  821   |
|           i           |  789   |
|           is          |  784   |
|          you          |  753   |
|          will         |  654   |
|          for          |  601   |
|           on          |  574   |
|         thank         |  470   |
|           be          |  455   |
|         great         |  447   |
|        hillary        |  440   |
|           we          |  390   |
|          that         |  373   |
|           s           |  363   |
|           it          |  346   |
|          with         |  345   |
|           at          |  333   |
|           me          |  327   |
|          are          |  311   |
|          amp          |  290   |
|        clinton        |  288   |
|         trump         |  287   |
|          have         |  286   |
|          our          |  264   |
|    realdonaldtrump    |  256   |
|           my          |  244   |
|          all          |  237   |
|        crooked        |  236   |
|           so          |  233   |
|           by          |  226   |
|          this         |  222   |
|          was          |  217   |
|         people        |  216   |
|          has          |  210   |
|          not          |  210   |
|          just         |  210   |
|        america        |  204   |
|          she          |  190   |
|          they         |  188   |
|       trump2016       |  180   |
|          very         |  180   |
|          make         |  180   |
|          from         |  175   |
|           rt          |  170   |
|          out          |  169   |
|           he          |  168   |
|          her          |  164   |
| makeamericagreatagain |  164   |
|          join         |  161   |
|           as          |  158   |
|          new          |  157   |
|          who          |  155   |
|         again         |  154   |
|         about         |  145   |
|           no          |  142   |
|          get          |  138   |
|          more         |  137   |
|          now          |  136   |
|         today         |  136   |
|       president       |  135   |
|          can          |  134   |
|          time         |  123   |
|         media         |  123   |
|          vote         |  117   |
|          but          |  117   |
|           am          |  116   |
|          bad          |  116   |
|         going         |  115   |
|          maga         |  112   |
|           u           |  112   |
|          many         |  110   |
|           if          |  110   |
|        country        |  108   |
|          big          |  108   |
|          what         |  107   |
|          your         |  105   |
|          cnn          |  105   |
|         never         |  104   |
|          one          |  101   |
|           up          |  101   |
|          back         |   99   |
|          jobs         |   98   |
|        tonight        |   97   |
|           do          |   97   |
|          been         |   97   |
|         would         |   94   |
|         obama         |   93   |
|        tomorrow       |   88   |
|          said         |   88   |
|          like         |   88   |
|         should        |   87   |
|          when         |   86   |
+-----------------------+--------+

编辑 1：如果您想省略某些内容，您可以这样做：

for e in L:
    if e[0]!="and" or e[0]!="if" or e[0]!="of":
        x.add_row([e[0],e[1]])

编辑 2：总结一下：

from collections import Counter
import re

words = re.findall(r'\w+', open('tweets.txt').read().lower())
counts = Counter(words).most_common (100)

from prettytable import PrettyTable

x = PrettyTable(["Words", "Counts"])

skip_list = ['and','if','or'] # see joe's comment

for e in counts:
    if e[0] not in skip_list:
        x.add_row([e[0],e[1]])

print x

【讨论】：

是的，像这样。但是有可能没有长长的不同单词列表吗？你的意思是要从文本文件中挑选每一个数据，直接放到表格中？你能给我一个文本文件的链接吗？我想看看数据在文件中是如何排列的。你可以定义skip_list = [‘and’, ‘if’, ‘or’]和if e[0] not in skip_list: 当然我为什么没有想到这一点...如果您想省略特定的单词，Joe 的回答会更好对不起，我不得不承认，我真的没有看到如何帮助你不使用列表，这是我第一次使用正则表达式和集合。

以上是关于Python-在表格中显示常用词并跳过某些词的主要内容，如果未能解决你的问题，请参考以下文章