将数据框与其他数据框合并并根据特定条件计算分组百分比
Posted
技术标签:
【中文标题】将数据框与其他数据框合并并根据特定条件计算分组百分比【英文标题】:Merge data frame with other and calculate groupby percentage based on the specific condition 【发布时间】:2021-09-10 13:06:03 【问题描述】:我有两个数据框,如下图
df1:
Sports Expected_%
Cricket 70
Football 20
Tennis 10
df2:
Region Sports Count Percentage
North Cricket 800 75
North Football 50 5
North Tennis 150 20
South Cricket 1300 65
South Football 550 27.5
South Tennis 150 7.5
预期输出:
Region Sports Count Percentage Expected_% Expected_count
North Cricket 800 75 70 700
North Football 50 5 20 200
North Tennis 150 20 10 100
South Cricket 1300 65 70 1400
South Football 550 27.5 20 400
South Tennis 150 7.5 10 200
解释:
Expected_% for Cricket = 70
Total Count for North = 1000
Expected_Count for North = 1000*70/100 = 700
【问题讨论】:
【参考方案1】:将DataFrame.merge
与左连接用于新列,然后将GroupBy.transform
与sum
用于新Series
,乘以新列并除以100
:
df = df2.merge(df1, on='Sports', how='left')
summed = df.groupby('Region')['Count'].transform('sum')
df['Expected_count'] = summed.mul(df['Expected_%']).div(100)
print (df)
Region Sports Count Percentage Expected_% Expected_count
0 North Cricket 800 75.0 70 700.0
1 North Football 50 5.0 20 200.0
2 North Tennis 150 20.0 10 100.0
3 South Cricket 1300 65.0 70 1400.0
4 South Football 550 27.5 20 400.0
5 South Tennis 150 7.5 10 200.0
或使用Series.map
新建列:
df2['Expected_%']= df2['Sports'].map(df1.set_index('Sports')['Expected_%'])
summed = df2.groupby('Region')['Count'].transform('sum')
df2['Expected_count'] = summed.mul(df2['Expected_%']).div(100)
print (df2)
Region Sports Count Percentage Expected_% Expected_count
0 North Cricket 800 75.0 70 700.0
1 North Football 50 5.0 20 200.0
2 North Tennis 150 20.0 10 100.0
3 South Cricket 1300 65.0 70 1400.0
4 South Football 550 27.5 20 400.0
5 South Tennis 150 7.5 10 200.0
【讨论】:
【参考方案2】:另一种方式:
map_dict = dict(df1.values)
df2['Percentage'] = df2.groupby('Region').apply(lambda x: (x['Count'].sum() * x['Sports'].map(map_dict))).div(100).values
【讨论】:
性能...避免这种情况,因为速度慢。 @jezrael Ohh!!.. 我还没有检查过这个性能!! 是的,取决于数据,我想慢 10 倍,但也许更多。 简单的一般规则 - 原生 pandas 函数很快,自定义函数不是。如果每组调用一个有点复杂的函数,那么性能会降低,因为groupby.apply
,因为复杂的函数,以及因为调用 N 次(组数)s
@jezrael 有道理!!以上是关于将数据框与其他数据框合并并根据特定条件计算分组百分比的主要内容,如果未能解决你的问题,请参考以下文章
将数据框与 SpatialPolygonsDataFrame 合并