中文词频统计

Posted 2020-10-09 137陈楚洪

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了中文词频统计相关的知识，希望对你有一定的参考价值。

import jieba

a=open(\'C:/1.txt\',\'r\',encoding=\'utf-8\').read()


for i in \'\\n,.\\()。，123"？\':
    a=a.replace(i,\' \')

b=jieba.cut(a)
d=list(b)

exc={\' \',\'和\',\'你\',\'使\',\'都\',\'所\',\'又\',\'一个\',\'啊\', \'也是\', \'的\',\'了\',\'（\',\'…\',\'阿\',\'廖沙\',\'也\',\'是\',\'对\',\'就\',\'“\',\'”\',\'地\',\'他\',\'她\'}
dict={}
key=set(d)
key=key-exc

print(key)

for i in key:
    
    dict[i]=d.count(i)

st=list(dict.items())
st.sort(key=lambda x:x[1],reverse=True)
print(st)
for i in range(20):
    print(st[i])

以上是关于中文词频统计的主要内容，如果未能解决你的问题，请参考以下文章

Spark编程实战-词频统计

中文词频统计

Python 分词后词频统计