中文词频统计

Posted 066谢平坚

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了中文词频统计相关的知识,希望对你有一定的参考价值。

中文分词

  1. 下载一中文长篇小说,并转换成UTF-8编码。
  2. 使用jieba库,进行中文词频统计,输出TOP20的词及出现次数。
  3. 排除一些无意义词、合并同一词。
  4. 对词频统计结果做简单的解读。
  5. import jieba
    book=open(D:\\xiaoshuo.txt,r,encoding=utf-8)
    
    #读入待分析的字符串
    str=book.read()
    book.close()
    
    for i in ,。!、   \n “ ” ;:
        str=str.replace(i,‘‘)
    
    words=jieba.cut(str)
    word=set(words)
    
    #计数字典 
    dic={}
    for i in word:
        if len(i)>1:
            dic[i]=str.count(i)
    str=list(dic.items())
    
    #排序
    str.sort(key=lambda x:x[1],reverse=True)
    for i in range(20):
        print(str[i])

    Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
    Type "copyright", "credits" or "license()" for more information.
    >>>
    ============================= RESTART: D:/daa.py =============================
    Building prefix dict from the default dictionary ...
    Dumping model to file cache C:\Users\asus\AppData\Local\Temp\jieba.cache
    Loading model cost 1.306 seconds.
    Prefix dict has been built succesfully.
    (‘父亲‘, 10)
    (‘背影‘, 4)
    (‘丧事‘, 3)
    (‘北京‘, 3)
    (‘散文‘, 3)
    (‘茶房‘, 3)
    (‘那年‘, 2)
    (‘父母‘, 2)
    (‘踌躇‘, 2)
    (‘朱自清‘, 2)
    (‘要紧‘, 2)
    (‘终于‘, 2)
    (‘日子‘, 2)
    (‘一会‘, 2)
    (‘一半‘, 2)
    (‘子女‘, 2)
    (‘描写‘, 2)
    (‘回家‘, 2)
    (‘不必‘, 2)
    (‘为了‘, 2)
    >>>

    Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
    Type "copyright", "credits" or "license()" for more information.
    >>>
    ============================= RESTART: D:/daa.py =============================
    Building prefix dict from the default dictionary ...
    Dumping model to file cache C:\Users\asus\AppData\Local\Temp\jieba.cache
    Loading model cost 1.306 seconds.
    Prefix dict has been built succesfully.
    (‘父亲‘, 10)
    (‘背影‘, 4)
    (‘丧事‘, 3)
    (‘北京‘, 3)
    (‘散文‘, 3)
    (‘茶房‘, 3)
    (‘那年‘, 2)
    (‘父母‘, 2)
    (‘踌躇‘, 2)
    (‘朱自清‘, 2)
    (‘要紧‘, 2)
    (‘终于‘, 2)
    (‘日子‘, 2)
    (‘一会‘, 2)
    (‘一半‘, 2)
    (‘子女‘, 2)
    (‘描写‘, 2)
    (‘回家‘, 2)
    (‘不必‘, 2)
    (‘为了‘, 2)
    >>>

    Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)] on win32
    Type "copyright", "credits" or "license()" for more information.
    >>>
    ============================= RESTART: D:/daa.py =============================
    Building prefix dict from the default dictionary ...
    Dumping model to file cache C:\Users\asus\AppData\Local\Temp\jieba.cache
    Loading model cost 1.306 seconds.
    Prefix dict has been built succesfully.
    (‘父亲‘, 10)
    (‘背影‘, 4)
    (‘丧事‘, 3)
    (‘北京‘, 3)
    (‘散文‘, 3)
    (‘茶房‘, 3)
    (‘那年‘, 2)
    (‘父母‘, 2)
    (‘踌躇‘, 2)
    (‘朱自清‘, 2)
    (‘要紧‘, 2)
    (‘终于‘, 2)
    (‘日子‘, 2)
    (‘一会‘, 2)
    (‘一半‘, 2)
    (‘子女‘, 2)
    (‘描写‘, 2)
    (‘回家‘, 2)
    (‘不必‘, 2)
    (‘为了‘, 2)
    >>>

以上是关于中文词频统计的主要内容,如果未能解决你的问题,请参考以下文章

Spark编程实战-词频统计

Spark编程实战-词频统计

中文词频统计

中文词频统计

中文词频统计

Python 分词后词频统计