Python-在表格中显示常用词并跳过某些词
Posted
技术标签:
【中文标题】Python-在表格中显示常用词并跳过某些词【英文标题】:Python- displaying frequent words in a table and skipping certain words 【发布时间】:2017-07-12 17:01:58 【问题描述】:目前我正在对一个文本文件进行频率分析,该文件显示文本文件中最常用的 100 个单词。目前我正在使用此代码:
from collections import Counter
import re
words = re.findall(r'\w+', open('tweets.txt').read().lower())
print Counter(words).most_common (100)
上面的代码有效,输出是:
[('the', 1998), ('t', 1829), ('https', 1620), ('co', 1604), ('to', 1247), ('and', 1053), ('in', 957), ('a', 899), ('of', 821), ('i', 789), ('is', 784), ('you', 753), ('will', 654), ('for', 601), ('on', 574), ('thank', 470), ('be', 455), ('great', 447), ('hillary', 440), ('we', 390), ('that', 373), ('s', 363), ('it', 346), ('with', 345), ('at', 333), ('me', 327), ('are', 311), ('amp', 290), ('clinton', 288), ('trump', 287), ('have', 286), ('our', 264), ('realdonaldtrump', 256), ('my', 244), ('all', 237), ('crooked', 236), ('so', 233), ('by', 226), ('this', 222), ('was', 217), ('people', 216), ('has', 210), ('not', 210), ('just', 210), ('america', 204), ('she', 190), ('they', 188), ('trump2016', 180), ('very', 180), ('make', 180), ('from', 175), ('rt', 170), ('out', 169), ('he', 168), ('her', 164), ('makeamericagreatagain', 164), ('join', 161), ('as', 158), ('new', 157), ('who', 155), ('again', 154), ('about', 145), ('no', 142), ('get', 138), ('more', 137), ('now', 136), ('today', 136), ('president', 135), ('can', 134), ('time', 123), ('media', 123), ('vote', 117), ('but', 117), ('am', 116), ('bad', 116), ('going', 115), ('maga', 112), ('u', 112), ('many', 110), ('if', 110), ('country', 108), ('big', 108), ('what', 107), ('your', 105), ('cnn', 105), ('never', 104), ('one', 101), ('up', 101), ('back', 99), ('jobs', 98), ('tonight', 97), ('do', 97), ('been', 97), ('would', 94), ('obama', 93), ('tomorrow', 88), ('said', 88), ('like', 88), ('should', 87), ('when', 86)]
但是,我想以表格形式显示它,标题为“Word”和“Count”。我试过使用prettytable
包并想出了这个:
from collections import Counter
import re
import prettytable
words = re.findall(r'\w+', open('tweets.txt').read().lower())
for label, data in ('Word', words):
pt = prettytable(field_names=[label, 'Count'])
c = Counter(data)
[pt.add_row(kv) for kv in c.most_common() [:100] ]
pt.align [label], pt.align['Count'] = '1', 'r'
print pt
它给了我ValueError: too many values to unpack
。我的问题是,我的代码有什么问题,有没有办法使用prettytable
显示数据?另外,我该如何修改我的代码?
额外问题:有没有办法在计算频率时省略某些单词?例如跳过单词:and, if, of etc 等
谢谢。
【问题讨论】:
错误在哪一行?更新问题。 什么是('Word', words)
?
错误在这一行“for label, data in ('Word', words):”
对不起,我是 python 新手,Word 是标题标签,'words' 是单词本身(例如,它们、制作、获取等)
【参考方案1】:
我不确定您编写的 for
循环如何工作。你得到的错误是因为你试图迭代有两个元素的元组('Word', words)
。语句for label, data in ('Word', words)
尝试将'W'
分配给label
,'o'
分配给data
,并在第一次迭代中得到'r'
和'd'
。也许您打算将这些物品压缩在一起?但是那你为什么要为每个单词制作一个新表呢?
这是一个重写的版本:
from collections import Counter
import re, prettytable
words = re.findall(r'\w+', open('tweets.txt').read().lower())
c = Counter(words)
pt = prettytable.PrettyTable(['Words', 'Counts'])
pt.align['Words'] = 'l'
pt.align['Counts'] = 'r'
for row in c.most_common(100):
pt.add_row(row)
print pt
要跳过最常见计数中的元素,您只需在调用most_common
之前将它们从计数器中丢弃即可。一种简单的方法是定义一个无效单词列表,然后使用 dict comprehension 过滤掉它们:
bad_words = ['the', 'if', 'of']
c = Counter(k: v for k, v in c.items() if k not in bad_words)
或者,您可以在创建计数器之前对单词列表进行过滤:
words = filter(lambda x: x not in bad_words, words)
我更喜欢在柜台上操作,因为这需要较少的工作,因为数据已经汇总。以下是合并代码供参考:
from collections import Counter
import re, prettytable
bad_words = ['the', 'if', 'of']
words = re.findall(r'\w+', open('tweets.txt').read().lower())
c = Counter(words)
c = Counter(k: v for k, v in c.items() if k not in bad_words)
pt = prettytable.PrettyTable(['Words', 'Counts'])
pt.align['Words'] = 'l'
pt.align['Counts'] = 'r'
for row in c.most_common(100):
pt.add_row(row)
print(pt)
【讨论】:
你的代码有错误。文件“test4.py”,第 7 行,在这是你想要做的吗?
from prettytable import PrettyTable
x = PrettyTable(["Words", "Counts"])
L = [('the', 1998), ('t', 1829), ('https', 1620), ('co', 1604), ('to', 1247), ('and', 1053), ('in', 957), ('a', 899), ('of', 821), ('i', 789), ('is', 784), ('you', 753), ('will', 654), ('for', 601), ('on', 574), ('thank', 470), ('be', 455), ('great', 447), ('hillary', 440), ('we', 390), ('that', 373), ('s', 363), ('it', 346), ('with', 345), ('at', 333), ('me', 327), ('are', 311), ('amp', 290), ('clinton', 288), ('trump', 287), ('have', 286), ('our', 264), ('realdonaldtrump', 256), ('my', 244), ('all', 237), ('crooked', 236), ('so', 233), ('by', 226), ('this', 222), ('was', 217), ('people', 216), ('has', 210), ('not', 210), ('just', 210), ('america', 204), ('she', 190), ('they', 188), ('trump2016', 180), ('very', 180), ('make', 180), ('from', 175), ('rt', 170), ('out', 169), ('he', 168), ('her', 164), ('makeamericagreatagain', 164), ('join', 161), ('as', 158), ('new', 157), ('who', 155), ('again', 154), ('about', 145), ('no', 142), ('get', 138), ('more', 137), ('now', 136), ('today', 136), ('president', 135), ('can', 134), ('time', 123), ('media', 123), ('vote', 117), ('but', 117), ('am', 116), ('bad', 116), ('going', 115), ('maga', 112), ('u', 112), ('many', 110), ('if', 110), ('country', 108), ('big', 108), ('what', 107), ('your', 105), ('cnn', 105), ('never', 104), ('one', 101), ('up', 101), ('back', 99), ('jobs', 98), ('tonight', 97), ('do', 97), ('been', 97), ('would', 94), ('obama', 93), ('tomorrow', 88), ('said', 88), ('like', 88), ('should', 87), ('when', 86)]
for e in L:
x.add_row([e[0],e[1]])
print x
结果如下:
+-----------------------+--------+
| Words | Counts |
+-----------------------+--------+
| the | 1998 |
| t | 1829 |
| https | 1620 |
| co | 1604 |
| to | 1247 |
| and | 1053 |
| in | 957 |
| a | 899 |
| of | 821 |
| i | 789 |
| is | 784 |
| you | 753 |
| will | 654 |
| for | 601 |
| on | 574 |
| thank | 470 |
| be | 455 |
| great | 447 |
| hillary | 440 |
| we | 390 |
| that | 373 |
| s | 363 |
| it | 346 |
| with | 345 |
| at | 333 |
| me | 327 |
| are | 311 |
| amp | 290 |
| clinton | 288 |
| trump | 287 |
| have | 286 |
| our | 264 |
| realdonaldtrump | 256 |
| my | 244 |
| all | 237 |
| crooked | 236 |
| so | 233 |
| by | 226 |
| this | 222 |
| was | 217 |
| people | 216 |
| has | 210 |
| not | 210 |
| just | 210 |
| america | 204 |
| she | 190 |
| they | 188 |
| trump2016 | 180 |
| very | 180 |
| make | 180 |
| from | 175 |
| rt | 170 |
| out | 169 |
| he | 168 |
| her | 164 |
| makeamericagreatagain | 164 |
| join | 161 |
| as | 158 |
| new | 157 |
| who | 155 |
| again | 154 |
| about | 145 |
| no | 142 |
| get | 138 |
| more | 137 |
| now | 136 |
| today | 136 |
| president | 135 |
| can | 134 |
| time | 123 |
| media | 123 |
| vote | 117 |
| but | 117 |
| am | 116 |
| bad | 116 |
| going | 115 |
| maga | 112 |
| u | 112 |
| many | 110 |
| if | 110 |
| country | 108 |
| big | 108 |
| what | 107 |
| your | 105 |
| cnn | 105 |
| never | 104 |
| one | 101 |
| up | 101 |
| back | 99 |
| jobs | 98 |
| tonight | 97 |
| do | 97 |
| been | 97 |
| would | 94 |
| obama | 93 |
| tomorrow | 88 |
| said | 88 |
| like | 88 |
| should | 87 |
| when | 86 |
+-----------------------+--------+
编辑 1:如果您想省略某些内容,您可以这样做:
for e in L:
if e[0]!="and" or e[0]!="if" or e[0]!="of":
x.add_row([e[0],e[1]])
编辑 2:总结一下:
from collections import Counter
import re
words = re.findall(r'\w+', open('tweets.txt').read().lower())
counts = Counter(words).most_common (100)
from prettytable import PrettyTable
x = PrettyTable(["Words", "Counts"])
skip_list = ['and','if','or'] # see joe's comment
for e in counts:
if e[0] not in skip_list:
x.add_row([e[0],e[1]])
print x
【讨论】:
是的,像这样。但是有可能没有长长的不同单词列表吗? 你的意思是要从文本文件中挑选每一个数据,直接放到表格中?你能给我一个文本文件的链接吗?我想看看数据在文件中是如何排列的。 你可以定义skip_list = [‘and’, ‘if’, ‘or’]
和if e[0] not in skip_list:
当然我为什么没有想到这一点...如果您想省略特定的单词,Joe 的回答会更好
对不起,我不得不承认,我真的没有看到如何帮助你不使用列表,这是我第一次使用正则表达式和集合。以上是关于Python-在表格中显示常用词并跳过某些词的主要内容,如果未能解决你的问题,请参考以下文章
R-从 PurpleAir 传感器读取 csv 文件并跳过某些行末尾包含非标准字符的错误位置