将数据框与其他数据框合并并根据特定条件计算分组百分比

Posted

技术标签:

【中文标题】将数据框与其他数据框合并并根据特定条件计算分组百分比【英文标题】:Merge data frame with other and calculate groupby percentage based on the specific condition 【发布时间】:2021-09-10 13:06:03 【问题描述】:

我有两个数据框,如下图

df1:

Sports     Expected_%
Cricket    70
Football   20
Tennis     10

df2:

Region    Sports     Count    Percentage     
North     Cricket    800      75                              
North     Football   50       5            
North     Tennis     150      20           
South     Cricket    1300     65           
South     Football   550      27.5         
South     Tennis     150      7.5  

    

预期输出:

Region    Sports     Count    Percentage   Expected_%     Expected_count    
North     Cricket    800      75           70             700
North     Football   50       5            20             200
North     Tennis     150      20           10             100
South     Cricket    1300     65           70             1400
South     Football   550      27.5         20             400
South     Tennis     150      7.5          10             200

解释:

Expected_% for Cricket = 70

Total Count for North = 1000

Expected_Count for North = 1000*70/100 = 700

【问题讨论】:

【参考方案1】:

DataFrame.merge 与左连接用于新列,然后将GroupBy.transformsum 用于新Series,乘以新列并除以100

df = df2.merge(df1, on='Sports', how='left')
summed = df.groupby('Region')['Count'].transform('sum')
df['Expected_count'] = summed.mul(df['Expected_%']).div(100)
print (df)
  Region    Sports  Count  Percentage  Expected_%  Expected_count
0  North   Cricket    800        75.0          70           700.0
1  North  Football     50         5.0          20           200.0
2  North    Tennis    150        20.0          10           100.0
3  South   Cricket   1300        65.0          70          1400.0
4  South  Football    550        27.5          20           400.0
5  South    Tennis    150         7.5          10           200.0

或使用Series.map 新建列:

df2['Expected_%']= df2['Sports'].map(df1.set_index('Sports')['Expected_%'])
summed = df2.groupby('Region')['Count'].transform('sum')
df2['Expected_count'] = summed.mul(df2['Expected_%']).div(100)
print (df2)
  Region    Sports  Count  Percentage  Expected_%  Expected_count
0  North   Cricket    800        75.0          70           700.0
1  North  Football     50         5.0          20           200.0
2  North    Tennis    150        20.0          10           100.0
3  South   Cricket   1300        65.0          70          1400.0
4  South  Football    550        27.5          20           400.0
5  South    Tennis    150         7.5          10           200.0

【讨论】:

【参考方案2】:

另一种方式:

map_dict = dict(df1.values)
df2['Percentage'] = df2.groupby('Region').apply(lambda x: (x['Count'].sum() * x['Sports'].map(map_dict))).div(100).values

【讨论】:

性能...避免这种情况,因为速度慢。 @jezrael Ohh!!.. 我还没有检查过这个性能!! 是的,取决于数据,我想慢 10 倍,但也许更多。 简单的一般规则 - 原生 pandas 函数很快,自定义函数不是。如果每组调用一个有点复杂的函数,那么性能会降低,因为groupby.apply,因为复杂的函数,以及因为调用 N 次(组数)s @jezrael 有道理!!

以上是关于将数据框与其他数据框合并并根据特定条件计算分组百分比的主要内容,如果未能解决你的问题,请参考以下文章

将数据框与系列合并

将大型 Dask 数据框与小型 Pandas 数据框合并

将数据框与 SpatialPolygonsDataFrame 合并

Pandas 将数据框与共享列合并,左右填充

当两个数据框都包含重复键时,如何将两个熊猫数据框与左连接合并?

熊猫数据框条件 .mean() 取决于特定列中的值