pandas groupby 聚合具有多列的自定义函数

Posted 2023-03-11

技术标签:

【中文标题】pandas groupby 聚合具有多列的自定义函数【英文标题】：pandas groupby aggregate customised function with multiple columns 【发布时间】：2019-06-08 13:26:15 【问题描述】：

我正在尝试在 pandas 中使用带有 groupby 的自定义函数。我发现使用apply 可以让我通过以下方式做到这一点：

（从两组计算新平均值的示例）

import pandas as pd

def newAvg(x):
    x['cm'] = x['count']*x['mean']
    sCount = x['count'].sum()
    sMean = x['cm'].sum()
    return sMean/sCount

data = [['A', 4, 2.5], ['A', 3, 6], ['B', 4, 9.5], ['B', 3, 13]]
df = pd.DataFrame(data, columns=['pool', 'count', 'mean'])

df_gb = df.groupby(['pool']).apply(newAvg)

是否可以将其集成到agg 函数中？沿着这些思路：

df.groupby(['pool']).agg('count': sum, ['count', 'mean']: apply(newAvg))

【问题讨论】：

【参考方案1】：

函数agg 分别处理每一列，因此可能的解决方案是先用assign 创建列cm，然后聚合sum，最后划分每一列：

df_gb = df.assign(cm=df['count']*df['mean']).groupby('pool')['cm','count'].sum()
print (df_gb)
        cm  count
pool             
A     28.0      7
B     77.0      7

out = df_gb.pop('cm') / df_gb.pop('count')
print (out)
pool
A     4.0
B    11.0
dtype: float64

【讨论】：

【参考方案2】：

带有agg 的字典用于对每个系列执行单独的计算。对于你的问题，我建议pd.concat:

g = df.groupby('pool')
res = pd.concat([g['count'].sum(), g.apply(newAvg).rename('newAvg')], axis=1)

print(res)

#       count  newAvg
# pool               
# A         7     4.0
# B         7    11.0

这不是最有效的解决方案，因为您的函数 newAvg 正在执行最初可以在整个数据帧上执行的计算，但它确实支持任意预定义计算。

【讨论】：

【参考方案3】：

IIUC

df.groupby(['pool']).apply(lambda x : pd.Series('count':sum(x['count']),'newavg':newAvg(x)))
Out[58]: 
      count  newavg
pool               
A       7.0     4.0
B       7.0    11.0

【讨论】：

我非常喜欢这个。不过还是谢谢大家 @Christian 快乐编码【参考方案4】：

将assign 与eval 一起使用：

df.assign(cm=df['count']*df['mean'])\
  .groupby('pool', as_index=False)['cm','count'].sum()\
  .eval('AggCol = cm / count')

输出：

  pool    cm  count  AggCol
0    A  28.0      7     4.0
1    B  77.0      7    11.0

【讨论】：

【参考方案5】：

如果要计算加权平均值，可以使用 agg 和 NumPy np.average 函数轻松完成。只需阅读“平均”列的系列：

df_gb = df.groupby(['pool']).agg(lambda x: np.average(x['mean'], weights=x['count']))['mean']

您也可以使用 newAvg 函数执行此操作，尽管这会产生警告：

df_gb2 = df.groupby(['pool']).agg(newAvg)['mean']

如果你愿意使用newAvg函数，你可以重新定义它以避免在副本上工作：

def newAvg(x):
    cm = x['count']*x['mean']
    sCount = x['count'].sum()
    sMean = cm.sum()
    return sMean/sCount

通过此修改，您将获得预期的输出：

df_gb2 = df.groupby(['pool']).agg(newAvg)['mean']
print(df_gb2)

# pool
# A     4.0
# B    11.0
# Name: mean, dtype: float64

【讨论】：

以上是关于pandas groupby 聚合具有多列的自定义函数的主要内容，如果未能解决你的问题，请参考以下文章