中文词频统计

Posted 2020-10-09 066谢平坚

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了中文词频统计相关的知识，希望对你有一定的参考价值。

中文分词

下载一中文长篇小说，并转换成UTF-8编码。
使用jieba库，进行中文词频统计，输出TOP20的词及出现次数。
排除一些无意义词、合并同一词。
对词频统计结果做简单的解读。
```
import jieba
book=open(‘D:\\xiaoshuo.txt‘,‘r‘,encoding=‘utf-8‘)

#读入待分析的字符串
str=book.read()
book.close()

for i in ‘，。！、   \n “ ” ；‘:
    str=str.replace(i,‘‘)

words=jieba.cut(str)
word=set(words)

#计数字典 
dic={}
for i in word:
    if len(i)>1:
        dic[i]=str.count(i)
str=list(dic.items())

#排序
str.sort(key=lambda x:x[1],reverse=True)
for i in range(20):
    print(str[i])
```
Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>>
============================= RESTART: D:/daa.py =============================
Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\asus\AppData\Local\Temp\jieba.cache
Loading model cost 1.306 seconds.
Prefix dict has been built succesfully.
(‘父亲‘, 10)
(‘背影‘, 4)
(‘丧事‘, 3)
(‘北京‘, 3)
(‘散文‘, 3)
(‘茶房‘, 3)
(‘那年‘, 2)
(‘父母‘, 2)
(‘踌躇‘, 2)
(‘朱自清‘, 2)
(‘要紧‘, 2)
(‘终于‘, 2)
(‘日子‘, 2)
(‘一会‘, 2)
(‘一半‘, 2)
(‘子女‘, 2)
(‘描写‘, 2)
(‘回家‘, 2)
(‘不必‘, 2)
(‘为了‘, 2)
>>>

Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>>
============================= RESTART: D:/daa.py =============================
Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\asus\AppData\Local\Temp\jieba.cache
Loading model cost 1.306 seconds.
Prefix dict has been built succesfully.
(‘父亲‘, 10)
(‘背影‘, 4)
(‘丧事‘, 3)
(‘北京‘, 3)
(‘散文‘, 3)
(‘茶房‘, 3)
(‘那年‘, 2)
(‘父母‘, 2)
(‘踌躇‘, 2)
(‘朱自清‘, 2)
(‘要紧‘, 2)
(‘终于‘, 2)
(‘日子‘, 2)
(‘一会‘, 2)
(‘一半‘, 2)
(‘子女‘, 2)
(‘描写‘, 2)
(‘回家‘, 2)
(‘不必‘, 2)
(‘为了‘, 2)
>>>

Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>>
============================= RESTART: D:/daa.py =============================
Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\asus\AppData\Local\Temp\jieba.cache
Loading model cost 1.306 seconds.
Prefix dict has been built succesfully.
(‘父亲‘, 10)
(‘背影‘, 4)
(‘丧事‘, 3)
(‘北京‘, 3)
(‘散文‘, 3)
(‘茶房‘, 3)
(‘那年‘, 2)
(‘父母‘, 2)
(‘踌躇‘, 2)
(‘朱自清‘, 2)
(‘要紧‘, 2)
(‘终于‘, 2)
(‘日子‘, 2)
(‘一会‘, 2)
(‘一半‘, 2)
(‘子女‘, 2)
(‘描写‘, 2)
(‘回家‘, 2)
(‘不必‘, 2)
(‘为了‘, 2)
>>>

以上是关于中文词频统计的主要内容，如果未能解决你的问题，请参考以下文章