任一英文的纯文本文件,统计其中的单词出现个数
Posted wanlifeipeng
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了任一英文的纯文本文件,统计其中的单词出现个数相关的知识,希望对你有一定的参考价值。
第一版: 效率低
path = ‘test.txt‘ with open(path,encoding=‘utf-8‘,newline=‘‘) as f: word = [] words_dict= {} for letter in f.read(): if letter.isalnum(): word.append(letter) elif letter.isspace(): #空白字符 空格 \t \n if word: word = ‘‘.join(word).lower() #转小写 if word not in words_dict: words_dict[word] = 1 else: words_dict[word] += 1 word = [] #处理最后一个单词 if word: word = ‘‘.join(word).lower() # 转小写 if word not in words_dict: words_dict[word] = 1 else: words_dict[word] += 1 word = [] for k,v in words_dict.items(): print(k,v)
第二版:
缺点:遇到大文件要一次读入内存,性能不好
path = ‘test.txt‘ with open(path,‘r‘,encoding=‘utf-8‘) as f: data = f.read() word_reg = re.compile(r‘\w+‘) #word_reg = re.compile(r‘\w+\b‘) word_list = word_reg.findall(data) word_list = [word.lower() for word in word_list] #转小写 word_set = set(word_list) #避免重复查询 # words_dict = {} # for word in word_set: # words_dict[word] = word_list.count(word) # 简洁写法 words_dict = {word: word_list.count(word) for word in word_set} for k,v in words_dict.items(): print(k,v)
第三版:
path = ‘test.txt‘ with open(path, ‘r‘, encoding=‘utf-8‘) as f: word_list = [] word_reg = re.compile(r‘\w+‘) for line in f: #line_words = word_reg.findall(line) #比上面的正则更加简单 line_words = line.split() word_list.extend(line_words) word_set = set(word_list) # 避免重复查询 words_dict = {word: word_list.count(word) for word in word_set} for k, v in words_dict.items(): print(k, v)
以上是关于任一英文的纯文本文件,统计其中的单词出现个数的主要内容,如果未能解决你的问题,请参考以下文章
java中,给定一个文本,统计其中的单词个数,要求以单词在文本中出现的先后顺序输出
C语言,输入一行英文字母,统计其中有多少个单词,单词之间用空格分隔.