综合练习：词频统计

Posted 2020-10-28 笑看风云动

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了综合练习：词频统计相关的知识，希望对你有一定的参考价值。

联系要求

下载一首英文的歌词或文章

将歌词存入文件中，然后读取出来

将所有,.？！’:等分隔符全部替换为空格

将所有大写转换为小写

生成单词列表

生成词频统计

排序

排除语法型词汇，代词、冠词、连词

输出词频最大TOP20

将分析对象存为utf-8编码的文件，通过文件读取的方式获得词频分析内容。

 1 io=open("test.txt",‘r‘)
 2 news=io.read()
 3 io.close()
 4 strList={‘is‘,‘the‘,‘to‘,‘is‘,‘it‘,‘and‘,‘oh‘,‘in‘}
 5 for item in str1:
 6     news2=news.replace(item," ").lower().split()
 7 #print(news2)
 8 
 9 wordDict={}
10 
11 wordSet=set(news2) -strList
12 for w in news2:
13     wordDict[w]=news2.count(w)
14 
15 
16 wordList=list(wordDict.items())
17 print(wordList)
18 for item in wordList:
19     #print(item)
20     pass
21 wordList.sort(key=lambda x:x[1],reverse=True)
22 newWordList=wordList[:20]
23 for i in newWordList:
24     print(i)

2.中文词频统计

下载一长篇中文文章。

从文件读取待分析文本。

news = open(‘gzccnews.txt‘,‘r‘,encoding = ‘utf-8‘)

安装与使用jieba进行中文分词。

pip install jieba

import jieba

list(jieba.lcut(news))

生成词频统计

排序

排除语法型词汇，代词、冠词、连词

输出词频最大TOP20（或把结果存放到文件里）

 1 #!/usr/bin/python
 2 # -*- coding: UTF-8 -*-
 3 import jieba
 4 
 5 str1=‘‘‘‘"‘‘‘
 6 io=open("test2.txt",‘r‘,encoding=‘UTF-8‘)
 7 strList=io.read()
 8 io.close()
 9 
10 print(strList)
11 wordList =list(jieba.cut(strList))
12 for item in wordList:
13     print(item)
  
   wordList.sort(key=lambda x:x[1],reverse=True)
   newWordList=wordList[:20]
   for i in newWordList:
        print(i)

以上是关于综合练习：词频统计的主要内容，如果未能解决你的问题，请参考以下文章