如何只计算字典中的单词,同时返回字典键名的计数
Posted
技术标签:
【中文标题】如何只计算字典中的单词,同时返回字典键名的计数【英文标题】:How to count only the words in a dictionary, while returning a count of the dictionary key name 【发布时间】:2020-03-22 01:43:29 【问题描述】:我想给一个excel文件发短信。首先,我必须将所有行连接到一个大文本文件中。然后,扫描文本以查找字典中的单词。如果找到该单词,则将其计为字典键名。最后返回关系表 [word, count] 中的计数单词列表。 我可以数单词,但无法让字典部分正常工作。 我的问题是:
-
我这样做是否正确?
有可能吗?怎么可能?
来自互联网的调整代码
import collections
import re
import matplotlib.pyplot as plt
import pandas as pd
#% matplotlib inline
#file = open('PrideAndPrejudice.txt', 'r')
#file = file.read()
''' Convert excel column/ rows into a string of words'''
#text_all = pd.read_excel('C:\Python_Projects\Rake\data_file.xlsx')
#df=pd.DataFrame(text_all)
#case_words= df['case_text']
#print(case_words)
#case_concat= case_words.str.cat(sep=' ')
#print (case_concat)
text_all = ("Billy was glad to see jack. Jack was estatic to play with Billy. Jack and Billy were lonely without eachother. Jack is tall and Billy is clever.")
''' done'''
import collections
import pandas as pd
import matplotlib.pyplot as plt
#% matplotlib inline
# Read input file, note the encoding is specified here
# It may be different in your text file
# Startwords
startwords = 'happy':'glad','sad': 'lonely','big': 'tall', 'smart': 'clever'
#startwords = startwords.union(set(['happy','sad','big','smart']))
# Instantiate a dictionary, and for every word in the file,
# Add to the dictionary if it doesn't exist. If it does, increase the count.
wordcount =
# To eliminate duplicates, remember to split by punctuation, and use case demiliters.
for word in text_all.lower().split():
word = word.replace(".","")
word = word.replace(",","")
word = word.replace(":","")
word = word.replace("\"","")
word = word.replace("!","")
word = word.replace("“","")
word = word.replace("‘","")
word = word.replace("*","")
if word in startwords:
if word in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
# Print most common word
n_print = int(input("How many most common words to print: "))
print("\nOK. The most common words are as follows\n".format(n_print))
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(n_print):
print(word, ": ", count)
# Close the file
#file.close()
# Create a data frame of the most common words
# Draw a bar chart
lst = word_counter.most_common(n_print)
df = pd.DataFrame(lst, columns = ['Word', 'Count'])
df.plot.bar(x='Word',y='Count')
错误:空的“DataFrame”:没有要绘制的数字数据
预期输出:
-
快乐 1
伤心1
大1
智能1
【问题讨论】:
嘿伙计,感谢您分享您的代码,但是如果您在运行代码后显示原始代码和输出会更好。通常,这足以让其他人帮助编写针对您的问题的自定义解决方案。阅读How to Ask。 我收到错误。因为我不知道如何解决这个问题。我不认为我可以让它按照我想要的方式阅读字典。我会再处理一下代码,看看能否澄清我的问题。 抱歉,Rob,我说的是原始代码,而我本来想说的是原始数据,如果你能提供你的数据样本,我会尽力而为 我们需要 matlabplot 和 pandas 来计算字数吗?它是否在您可以包含在问题中的小样本上正常工作?另外,起始词也有错误。 这是我在网上找到的代码。它接近我需要的,所以我想我会调整它。但是,调整它似乎对我不起作用..LOL。我需要熊猫,因为真实数据在关系数据库中。我得到的那部分。它计算我遇到问题的字典中的单词。 Matplotlib 让我可以在条形图(Pareto)中绘制数据。 【参考方案1】:这是一种适用于最新版本的pandas
(在撰写本文时为0.25.3)的方法:
# Setup
df = pd.DataFrame('case_text': ["Billy was glad to see jack. Jack was estatic to play with Billy. Jack and Billy were lonely without eachother. Jack is tall and Billy is clever."])
startwords = "happy":["glad","estatic"],
"sad": ["depressed", "lonely"],
"big": ["tall", "fat"],
"smart": ["clever", "bright"]
# First you need to rearrange your startwords dict
startwords_map = w: k for k, v in startwords.items() for w in v
(df['case_text'].str.lower() # casts to lower case
.str.replace('[.,\*!?:]', '') # removes punctuation and special characters
.str.split() # splits the text on whitespace
.explode() # expands into a single pandas.Series of words
.map(startwords_map) # maps the words to the startwords
.value_counts() # counts word occurances
.to_dict()) # outputs to dict
[出]
'happy': 2, 'big': 1, 'smart': 1, 'sad': 1
【讨论】:
【参考方案2】: if word in startwords:
if word in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
这部分似乎有问题,它会检查 word
是否在 startwords
中,然后进一步检查 wordcount
,如果它在 wordcount
中,它应该根据您的逻辑增加字数。所以我相信你必须切换执行。
if word in wordcount:
//in dict, count++
wordcount[word] += 1
else:
// first time, set to 1
wordcount[word] = 1
【讨论】:
谢谢,这确实解决了整个字数统计问题。现在让它只计算字典中的单词并返回键数。以上是关于如何只计算字典中的单词,同时返回字典键名的计数的主要内容,如果未能解决你的问题,请参考以下文章
array_diff_key — 使用键名比较计算数组的差集
array_diff_key — 使用键名比较计算数组的差集
array_diff_ukey — 用回调函数对键名比较计算数组的差集