综合练习：词频统计

Posted 2020-10-28 装逼遇影帝

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了综合练习：词频统计相关的知识，希望对你有一定的参考价值。

1.英文词频统

下载一首英文的歌词或文章

article = ‘‘‘An empty street
An empty house
A hole inside my heart
I‘m all alone
The rooms are getting smaller
I wonder how
I wonder why
I wonder where they are
The days we had
The songs we sang together
Oh yeah
And oh my love
I‘m holding on forever
Reaching for a love that seems so far
So i say a little prayer
And hope my dreams will take me there
Where the skies are blue to see you once again, my love
Over seas and coast to coast
To find a place i love the most
Where the fields are green to see you once again, my love
I try to read
I go to work
I‘m laughing with my friends
But i can‘t stop to keep myself from thinking
Oh no I wonder how
I wonder why
I wonder where they are
The days we had
The songs we sang together
Oh yeah And oh my love
I‘m holding on forever
Reaching for a love that seems so far Mark:
To hold you in my arms
To promise you my love
To tell you from the heart
You‘re all i‘m thinking of
I‘m reaching for a love that seems so far 
So i say a little prayer
And hope my dreams will take me there
Where the skies are blue to see you once again, my love
Over seas and coast to coast
To find a place i love the most
Where the fields are green to see you once again,my love
say a little prayer
dreams will take me there
Where the skies are blue to see you once again ‘‘‘

将所有,.？！’:等分隔符全部替换为空格

sep = ‘‘‘:.,?!‘‘‘
for i in sep:
    article = article.replace(i,‘ ‘);

将所有大写转换为小写

	
article = article.lower();

生成单词列表

article_list = article.split();
print(article_list);

生成词频统计

# # ①统计，遍历集合

# article_dict={}
# article_set =set(article_list)-exclude# 清除重复的部分
# for w in article_set:
#     article_dict[w] = article_list.count(w)
# # 遍历字典
# for w in article_dict:
#     print(w,article_dict[w])
 
 
#方法②,遍历列表
article_dict={}
for w in article_list:
    article_dict[w] = article_dict.get(w,0)+1
# 排除不要的单词
for w in exclude:
    del (article_dict[w]);
 
for w in article_dict:
    print(w,article_dict[w])

排序

dictList = list(article_dict.items())
dictList.sort(key=lambda x:x[1],reverse=True);

排除语法型词汇，代词、冠词、连词

exclude = {‘the‘,‘to‘,‘is‘,‘and‘}
for w in exclude:
    del (article_dict[w]);

输出词频最大TOP20

for i in range(20):
     print(dictList[i])

将分析对象存为utf-8编码的文件，通过文件读取的方式获得词频分析内容。

file =  open("test.txt", "r",encoding=‘utf-8‘);
article = file.read();
file.close()

2.中文词频统计

下载一长篇中文文章。

从文件读取待分析文本。

news = open(‘gzccnews.txt‘,‘r‘,encoding = ‘utf-8‘)

安装与使用jieba进行中文分词。

pip install jieba

import jieba

list(jieba.lcut(news))

生成词频统计

排序

排除语法型词汇，代词、冠词、连词

输出词频最大TOP20（或把结果存放到文件里）

将代码与运行结果截图发布在博客上。

以上是关于综合练习：词频统计的主要内容，如果未能解决你的问题，请参考以下文章