复合数据类型，英文词频统计

Posted 2021-02-13 chenshijiong

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了复合数据类型，英文词频统计相关的知识，希望对你有一定的参考价值。

1.列表，元组，字典，集合分别如何增删改查及遍历。

列表：

# 列表
list = [‘a‘,‘b‘,‘hello‘,1]

# 增 第一在列表后方添加数据 第二为在对应的下边插入数据
list.append(2)
list.insert(0,‘0‘)
print(list)

技术图片

# 删  删除对应下标的列表中的数据
del list[2]
print(list)

技术图片

# 改 修改对应下标的列表中的数据

list[2]= ‘hi‘ 
print(list)

技术图片

# 查 查询下表数据  输出整个列表
print(list[0])
print(list)

技术图片

元组

增：无

删：无

改：无

查：

tup = (‘hi‘,‘a‘,‘b‘)
print(tup[0])
print(tup)

技术图片

字典

dict = {‘a‘:1,‘b‘:2,‘c‘:3}

# 增
dict[‘d‘] =4

技术图片

# 删
del dict[‘a‘]

技术图片

# 改
dict[‘a‘] = 10086

技术图片

# 查
print(dict[‘c‘])
print(dict)

技术图片

集合

jihe = {‘集合1‘,‘集合2‘,‘集合3‘}

# 增
jihe.add(‘集合4‘)

技术图片

# 删
jihe.remove(‘集合2‘)

技术图片

2.总结列表，元组，字典，集合的联系与区别。参考以下几个方面：

括号
有序无序
可变不可变
重复不可重复
存储与查找方式

联系与区别：

括号：列表：[ ], 元组：( ), 字典与集合:{ };

有序无序：有序：列表与元组，无序:集合与字典；

可变与不可变：可变：列表、字典、集合，不可变：元组；

重复不可重复：重复：列表与元组，不重复：字典与集合；

存储与查找方式：列表与元组通过下表查找，字典通过key查找。

统计

1.下载一长篇小说，存成utf-8编码的文本文件 file

2.通过文件读取字符串 str

3.对文本进行预处理

4.分解提取单词 list

5.单词计数字典 set , dict

6.按词频排序 list.sort(key=lambda),turple

7.排除语法型词汇，代词、冠词、连词等无语义词
- 自定义停用词表
- 或用stops.txt

8.输出TOP(20)

9.可视化：词云

排序好的单词列表word保存成csv文件

import pandas as pd
pd.DataFrame(data=word).to_csv(‘big.csv‘,encoding=‘utf-8‘)

线上工具生成词云：
https://wordart.com/create

import pandas as pd

# 获取排除语法型词汇，代词、冠词、连词等无语义词
f = open(r‘C:UsersShinelonDesktopstops.txt‘, ‘r‘, encoding=‘utf8‘)
stops = f.read();
f.close();
stops1 = stops.replace(‘
‘, ‘‘).replace(‘‘‘, ‘‘).replace(‘"‘, ‘‘).replace(‘,‘, ‘‘).lower().split();

# 获取文本信息
xiaoshuo=open(r‘C:UsersShinelonDesktopxiaoshuo.txt‘,‘r‘,encoding=‘utf8‘);
text = xiaoshuo.read();
xiaoshuo.close()
text1 = text.lower().replace(‘.‘, ‘ ‘).replace(‘?‘, ‘ ‘).replace(‘,‘, ‘ ‘).replace(‘"‘,‘‘).split();

# 去除无用的词语
text2 = set(text1) - set(stops1)

# 将词语与其出现的次数以键/值对方式存储
te = {};
for w in text2:
    te[w] = text1.count(w);

# 进行词语出现次数排序 选择前20个词
tesort = list(te.items())
tesort.sort(key=lambda x: x[1], reverse=True)
tesort2 = tesort[0:20]

for i in tesort2:
    print(i)
#存为csv文件
pd.DataFrame(data=tesort).to_csv(r‘C:UsersShinelonDesktop1234.csv‘, encoding=‘utf-8‘)

输出TOP(20)：

技术图片

词云

技术图片

以上是关于复合数据类型，英文词频统计的主要内容，如果未能解决你的问题，请参考以下文章