如何定义一个函数来根据多个条件汇总和选择数据?
Posted
技术标签:
【中文标题】如何定义一个函数来根据多个条件汇总和选择数据?【英文标题】:How to define a single function to sum and select data based on multiple conditions? 【发布时间】:2019-09-12 16:45:18 【问题描述】:我编写了几行代码来根据 2 个不同的条件获得 3 种类型的分数(3 列)。我在下面粘贴了我的代码。输出是基于 5 家公司中每家公司的文档评分的 3 个不同的列。现在,这些代码行可以工作,但我想定义一个函数来为每个对应的行做同样的事情。 我希望这样,如果我要向主数据集添加更多数据,该函数会自动选择任何新数据的总和对应值。
感谢任何帮助。谢谢!
#Article scores
amazon_article_score = keywords.loc[(keywords['Company'] == 'amazon') &
(keywords['DocumentType'] == 'article'),'Polarity'].sum()
keydata.loc[keydata.index[0], 'ArticleScore'] = amazon_article_score
boeing_article_score = keywords.loc[(keywords['Company'] == 'boeing') &
(keywords['DocumentType'] == 'article'), 'Polarity'].sum()
keydata.loc[keydata.index[1], 'ArticleScore'] = boeing_article_score
target_article_score = keywords.loc[(keywords['Company'] == 'target') &
(keywords['DocumentType'] == 'article'), 'Polarity'].sum()
keydata.loc[keydata.index[2], 'ArticleScore'] = target_article_score
tesla_article_score = keywords.loc[(keywords['Company'] == 'tesla') &
(keywords['DocumentType'] == 'article'),'Polarity'].sum()
keydata.loc[keydata.index[3], 'ArticleScore'] = tesla_article_score
walmart_article_score = keywords.loc[(keywords['Company'] == 'walmart') &
(keywords['DocumentType'] == 'article'), 'Polarity'].sum()
keydata.loc[keydata.index[4], 'ArticleScore'] = walmart_article_score
#Blog scores
amazon_blog_score = keywords.loc[(keywords['Company'] == 'amazon') &
(keywords['DocumentType'] == 'blog'), 'Polarity'].sum()
keydata.loc[keydata.index[0], 'BlogScore'] = amazon_blog_score
boeing_blog_score = keywords.loc[(keywords['Company'] == 'boeing') &
(keywords['DocumentType'] == 'blog'), 'Polarity'].sum()
keydata.loc[keydata.index[1], 'BlogScore'] = boeing_blog_score
target_blog_score = keywords.loc[(keywords['Company'] == 'target') &
(keywords['DocumentType'] == 'blog'), 'Polarity'].sum()
keydata.loc[keydata.index[2], 'BlogScore'] = target_blog_score
tesla_blog_score = keywords.loc[(keywords['Company'] == 'tesla') &
(keywords['DocumentType'] == 'blog'), 'Polarity'].sum()
keydata.loc[keydata.index[3], 'BlogScore'] = tesla_blog_score
walmart_blog_score = keywords.loc[(keywords['Company'] == 'walmart') &
(keywords['DocumentType'] == 'blog'), 'Polarity'].sum()
keydata.loc[keydata.index[4], 'BlogScore'] = walmart_blog_score
#News scores
amazon_news_score = keywords.loc[(keywords['Company'] == 'amazon') &
(keywords['DocumentType'] == 'news'), 'Polarity'].sum()
keydata.loc[keydata.index[0], 'NewsScore'] = amazon_news_score
boeing_news_score = keywords.loc[(keywords['Company'] == 'boeing') &
(keywords['DocumentType'] == 'news'), 'Polarity'].sum()
keydata.loc[keydata.index[1], 'NewsScore'] = boeing_news_score
target_news_score = keywords.loc[(keywords['Company'] == 'target') &
(keywords['DocumentType'] == 'news'), 'Polarity'].sum()
keydata.loc[keydata.index[2], 'NewsScore'] = target_news_score
tesla_news_score = keywords.loc[(keywords['Company'] == 'tesla') &
(keywords['DocumentType'] == 'news'), 'Polarity'].sum()
keydata.loc[keydata.index[3], 'NewsScore'] = tesla_news_score
walmart_news_score = keywords.loc[(keywords['Company'] == 'walmart') &
(keywords['DocumentType'] == 'news'), 'Polarity'].sum()
keydata.loc[keydata.index[4], 'NewsScore'] = walmart_news_score
【问题讨论】:
【参考方案1】:您需要将主数据集保留为一个列表,然后每个构建一个函数,但是在该公司名称出现的地方,您应该从该列表(或其他可迭代)中提取。
您会遇到将这些值存储为对象的问题,就像您在此处所做的那样,因此您应该将结果输出到带有 blog_score
article_score
等键的字典中。
然后,您应该创建一个添加新公司的函数,本质上是将值生成到字典中。如果您希望它们随时间变化,您还应该有一个更新您的值的函数。
scores =
def add_company(company_name):
article_score = keywords.loc[(keywords['Company'] == company_name) & (keywords['DocumentType'] == 'article'),'Polarity'].sum()
keydata.loc[keydata.index[0], 'ArticleScore'] = article_score
scores[company_name] = 'article_score'=article_score
return
在上面的例子中,它遗漏了很多东西,比如检查该公司是否已经存在,而是添加到字典或返回错误等。但你应该了解基本概念。
编辑:
这是一个完整的版本。你需要用动态的方式替换我的人工companies
来生成它。
companies = [('amazon',0),('boeing',1),('target',2),('tesla',3),('walmart',4)]
def add_company(company_tuple):
score_types = [('article','ArticleScore'),
('blog','BlogScore'),
('news','NewsScore')]
# calculate scores
for type in score_types:
score = keywords.loc[(keywords['Company'] == company_tuple[0]) &
(keywords['DocumentType'] == type[0]),'Polarity'].sum()
keydata.loc[keydata.index[company_tuple[1]], type[1]] = score
return
for company in companies:
add_company(company)
【讨论】:
首先感谢您的帮助!我尝试了从您的评论中了解的内容,但我只得到第一行的输出。我认为这是不对的,因为我需要在函数中创建一个循环来遍历所有行。我怎么能用你的函数做到这一点? 这只会添加一行,正确。因此,如果您想添加一组行,您可以将公司名称作为一个可迭代的列表,然后执行for company in companies:
并运行 add_company(company)
函数。您可能希望将其他分数计算添加到同一个函数中,就像我添加了 article_score
一样。
我就是这么做的,现在每一行都得到了相同的结果。我在做:companies = keydata['Company'].tolist() scores = def add_company(company_name): article_score = keywords.loc[(keywords['Company'] == company_name) & (keywords['DocumentType'] == 'article'),'Polarity'].sum() keydata.loc[keydata.index, 'ArticleScore'] = article_score blog_score = keywords.loc[(keywords['Company'] == company_name) & (keywords['DocumentType'] == 'blog'),'Polarity'].sum() keydata.loc[keydata.index, 'BlogScore'] = blog_score
(在下一条评论中继续)
news_score = keywords.loc[(keywords['Company'] == company_name) & (keywords['DocumentType'] == 'news'),'Polarity'].sum() keydata.loc[keydata.index, 'NewsScore'] = news_score scores[company_name] = 'article_score':article_score, 'blog_score': blog_score, 'news_score': news_score return for company in companies: add_company(company)
如果每一行都得到相同的结果,请回顾并确保计算分数的函数是正确的。我没有您要从中提取的数据或其他任何数据,因此我在回答中提供的内容对于如何为可迭代项提取单个函数的问题是通用的。您应该检查以确保 companies = keydata['Company'].tolist()
的输出是小写字符串的预期公司列表。以上是关于如何定义一个函数来根据多个条件汇总和选择数据?的主要内容,如果未能解决你的问题,请参考以下文章
如何把EXCEL表中的数据进行按多个条件进行分类汇总,并统计出个数,并求和?