七里香统计词频
Posted zmxpython
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了七里香统计词频相关的知识,希望对你有一定的参考价值。
import jieba
with open("qi.txt","r",encoding = "utf-8") as f:
hua = f.read()
word = jieba.cut(hua)
count =
for ci in word:
if ci in count:
count[ci] += 1
else:
count[ci] = 1
# 按照出现次数从高到低排序
sorted_counts = sorted(count.items(), key=lambda x: x[1], reverse=True)
# 输出结果
for ci, cishu in sorted_counts:
print(ci,cishu)
”“”
把下面的文字复印到当前文件目录下的qi.txt中保存即可
窗外的麻雀在电线杆上多嘴
你说这一句很有夏天的感觉
手中的铅笔在纸上来来回回
我用几行字形容你是我的谁
秋刀鱼的滋味猫跟你都想了解
初恋的香味就这样被我们寻回
那温暖的阳光像刚摘的鲜艳草莓
你说你舍不得吃掉这一种感觉
雨下整夜我的爱溢出就像雨水
院子落叶跟我的思念厚厚一叠
几句是非也无法将我的热情冷却
你出现在我诗的每一页
雨下整夜我的爱溢出就像雨水
窗台蝴蝶像诗里纷飞的美丽章节
我接着写把永远爱你写进诗的结尾
你是我唯一想要的了解
雨下整夜我的爱溢出就像雨水
院子落叶跟我的思念厚厚一叠
几句是非也无法将我的热情冷却
你出现在我诗的每一页
那饱满的稻穗幸福了这个季节
而你的脸颊像田里熟透的番茄
你突然对我说七里香的名字很美
我此刻却只想亲吻你倔强的嘴
雨下整夜我的爱溢出就像雨水
院子落叶跟我的思念厚厚一叠
几句是非也无法将我的热情冷却
你出现在我诗的每一页
整夜 我的爱溢出就像雨水
窗台蝴蝶像诗里纷飞的美丽章节
我接着写把永远爱你写进诗的结尾
你是我唯一想要的了解
“”“
Python词频统计
- 需求:一篇文章,出现了哪些词?哪些词出现得最多?
英文文本词频统计
英文文本:Hamlet 分析词频
统计英文词频分为两步:
- 文本去噪及归一化
- 使用字典表达词频
代码:
#CalHamletV1.py
def getText():
txt = open("hamlet.txt", "r").read()
txt = txt.lower()
for ch in '!"#$%&()*+,-./:;<=>?@[\]^_‘{|}~':
txt = txt.replace(ch, " ") #将文本中特殊字符替换为空格
return txt
hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))
运行结果:
the 1138
and 965
to 754
of 669
you 550
i 542
a 542
my 514
hamlet 462
in 436
中文文本词频统计
中文文本:《三国演义》分析人物
统计中文词频分为两步:
- 中文文本分词
- 使用字典表达词频
#CalThreeKingdomsV1.py
import jieba
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
else:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(15):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))
运行结果:
曹操 953
孔明 836
将军 772
却说 656
玄德 585
关公 510
丞相 491
二人 469
不可 440
荆州 425
玄德曰 390
孔明曰 390
不能 384
如此 378
张飞 358
能很明显的看到有一些不相关或重复的信息
优化版本
统计中文词频分为三步:
- 中文文本分词
- 使用字典表达词频
- 扩展程序解决问题
我们将不相关或重复的信息放在 excludes 集合里面进行排除。
#CalThreeKingdomsV2.py
import jieba
excludes = {"将军","却说","荆州","二人","不可","不能","如此"}
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "诸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "关公" or word == "云长":
rword = "关羽"
elif word == "玄德" or word == "玄德曰":
rword = "刘备"
elif word == "孟德" or word == "丞相":
rword = "曹操"
else:
rword = word
counts[rword] = counts.get(rword,0) + 1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))
考研英语词频统计
将词频统计应用到考研英语中,我们可以统计出出现次数较多的关键单词。
文本链接: https://pan.baidu.com/s/1Q6uVy-wWBpQ0VHvNI_DQxA 密码: fw3r
# CalHamletV1.py
def getText():
txt = open("86_17_1_2.txt", "r").read()
txt = txt.lower()
for ch in '!"#$%&()*+,-./:;<=>?@[\]^_‘{|}~':
txt = txt.replace(ch, " ") #将文本中特殊字符替换为空格
return txt
pyTxt = getText() #获得没有任何标点的txt文件
words = pyTxt.split() #获得单词
counts = {} #字典,键值对
excludes = {"the", "a", "of", "to", "and", "in", "b", "c", "d", "is", "was", "are", "have", "were", "had", "that", "for", "it", "on", "be", "as", "with", "by", "not", "their", "they", "from", "more", "but", "or", "you", "at", "has", "we", "an", "this", "can", "which", "will", "your", "one", "he", "his", "all", "people", "should", "than", "points", "there", "i", "what", "about", "new", "if", "”", "its", "been", "part", "so", "who", "would", "answer", "some", "our", "may", "most", "do", "when", "1", "text", "section", "2", "many", "time", "into", "10", "no", "other", "up", "following", "【答案】", "only", "out", "each", "much", "them", "such", "world", "these", "sheet", "life", "how", "because", "3", "even", "work", "directions", "use", "could", "now", "first", "make", "years", "way", "20", "those", "over", "also", "best", "two", "well", "15", "us", "write", "4", "5", "being", "social", "read", "like", "according", "just", "take", "paragraph", "any", "english", "good", "after", "own", "year", "must", "american", "less", "her", "between", "then", "children", "before", "very", "human", "long", "while", "often", "my", "too", "40", "four", "research", "author", "questions", "still", "last", "business", "education", "need", "information", "public", "says", "passage", "reading", "through", "women", "she", "health", "example", "help", "get", "different", "him", "mark", "might", "off", "job", "30", "writing", "choose", "words", "economic", "become", "science", "society", "without", "made", "high", "students", "few", "better", "since", "6", "rather", "however", "great", "where", "culture", "come", "both", "three", "same", "government", "old", "find", "number", "means", "study", "put", "8", "change", "does", "today", "think", "future", "school", "yet", "man", "things", "far", "line", "7", "13", "50", "used", "states", "down", "12", "14", "16", "end", "11", "making", "9", "another", "young", "system", "important", "letter", "17", "chinese", "every", "see", "s", "test", "word", "century", "language", "little", "give", "said", "25", "state", "problems", "sentence", "food", "translation", "given", "child", "18", "longer", "question", "back", "don’t", "19", "against", "always", "answers", "know", "having", "among", "instead", "comprehension", "large", "35", "want", "likely", "keep", "family", "go", "why", "41", "home", "law", "place", "look", "day", "men", "22", "26", "45", "it’s", "others", "companies", "countries", "once", "money", "24", "though", "27", "29", "31", "say", "national", "ii", "23", "based", "found", "28", "32", "past", "living", "university", "scientific", "–", "36", "38", "working", "around", "data", "right", "21", "jobs", "33", "34", "possible", "feel", "process", "effect", "growth", "probably", "seems", "fact", "below", "37", "39", "history", "technology", "never", "sentences", "47", "true", "scientists", "power", "thought", "during", "48", "early", "parents", "something", "market", "times", "46", "certain", "whether", "000", "did", "enough", "problem", "least", "federal", "age", "idea", "learn", "common", "political", "pay", "view", "going", "attention", "happiness", "moral", "show", "live", "until", "52", "49", "ago", "percent", "stress", "43", "44", "42", "meaning", "51", "e", "iii", "u", "60", "anything", "53", "55", "cultural", "nothing", "short", "100", "water", "car", "56", "58", "【解析】", "54", "59", "57", "v", "。","63", "64", "65", "61", "62", "66", "70", "75", "f", "【考点分析】", "67", "here", "68", "71", "72", "69", "73", "74", "选项a", "ourselves", "teachers", "helps", "参考范文", "gdp", "yourself", "gone", "150"}
for word in words:
if word not in excludes:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))
x = len(counts)
print(x)
r = 0
next = eval(input("1继续"))
while next == 1:
r += 100
for i in range(r, r+100):
word, count = items[i]
print (""{}"".format(word), end = ", ")
next = eval(input("1继续"))
以上是关于七里香统计词频的主要内容,如果未能解决你的问题,请参考以下文章