Python 中文文件统计词频 + 中文词云
Posted 清晨的第一抹阳光
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python 中文文件统计词频 + 中文词云相关的知识,希望对你有一定的参考价值。
1. 词频统计:
1 import jieba 2 txt = open("threekingdoms3.txt", "r", encoding=\'utf-8\').read() 3 words = jieba.lcut(txt) 4 counts = {} 5 for word in words: 6 if len(word) == 1: 7 continue 8 else: 9 counts[word] = counts.get(word,0) + 1 10 items = list(counts.items()) 11 items.sort(key=lambda x:x[1], reverse=True) 12 for i in range(15): 13 word, count = items[i] 14 print ("{0:<10}{1:>5}".format(word, count))
结果是:
曹操 946
孔明 737
将军 622
玄德 585
却说 534
关公 509
荆州 413
二人 410
丞相 405
玄德曰 390
不可 387
孔明曰 374
张飞 358
如此 320
不能 318
进一步改进, 我想只知道人物出场统计,代码如下:
1 import jieba 2 txt = open("threekingdoms3.txt", "r", encoding=\'utf-8\').read() 3 names = {\'曹操\',\'孔明\',\'刘备\',\'关羽\',\'张飞\',\'吕布\',\'赵云\',\'孙权\',\'周瑜\',\'袁绍\',\'黄忠\',\'魏延\'} 4 words = jieba.lcut(txt) 5 counts = {} 6 for word in words: 7 if len(word) == 1: 8 continue 9 elif word == "诸葛亮" or word == "孔明曰": 10 rword = "孔明" 11 elif word == "关公" or word == "云长": 12 rword = "关羽" 13 elif word == "玄德" or word == "玄德曰": 14 rword = "刘备" 15 elif word == "孟德" or word == "丞相": 16 rword = "曹操" 17 else: 18 rword = word 19 counts[rword] = counts.get(rword,0) + 1 20 # for word in excludes: 21 # del counts[word] 22 items = list(counts.items()) 23 items.sort(key=lambda x:x[1], reverse=True) 24 for i in range(40): 25 word, count = items[i] 26 if word in names: 27 print ("{0:<10}{1:>5}".format(word, count))
运行结果为:
曹操 1358
孔明 1265
刘备 1251
关羽 783
张飞 358
吕布 300
赵云 278
孙权 257
周瑜 217
袁绍 191
进一步的做词云图:
1 import jieba 2 import os 3 import wordcloud 4 5 def getText(file): 6 with open(file, \'r\', encoding= \'UTF-8\') as txt: 7 txt = txt.read() 8 jieba.lcut(txt) 9 return txt 10 11 12 directoryname = os.getcwd() 13 filename = input() 14 txt = getText(filename + \'.txt\') 15 wordclouds = wordcloud.WordCloud(width=1000, height= 800, margin=2).generate(txt) 16 wordclouds.to_file(\'{}.png\'.format(filename)) 17 18 os.system(\'{}.png\'.format(filename))
名称是可以进一步优化的,参见第二部分代码。
中文wordcloud库默认会出现乱码,解决方法参考 https://blog.csdn.net/Dick633/article/details/80261233
参考:https://blog.csdn.net/weixin_44521703/article/details/93058003
以上是关于Python 中文文件统计词频 + 中文词云的主要内容,如果未能解决你的问题,请参考以下文章