Python pandas groupby 在多列上聚合，然后旋转

Posted 2023-02-16

技术标签:

【中文标题】Python pandas groupby 在多列上聚合，然后旋转【英文标题】：Python pandas groupby aggregate on multiple columns, then pivot 【发布时间】：2017-08-27 15:09:15 【问题描述】：

在 Python 中，我有一个类似于以下内容的 pandas DataFrame：

Item | shop1 | shop2 | shop3 | Category
------------------------------------
Shoes| 45    | 50    | 53    | Clothes
TV   | 200   | 300   | 250   | Technology
Book | 20    | 17    | 21    | Books
phone| 300   | 350   | 400   | Technology

其中 shop1、shop2 和 shop3 是不同商店中每件商品的成本。现在，我需要在一些数据清理之后返回一个 DataFrame，就像这样：

Category (index)| size| sum| mean | std
----------------------------------------

其中 size 是每个 Category 中的项目数， sum、mean 和 std 与应用于 3 个商店的相同函数相关。如何使用 split-apply-combine 模式（groupby、aggregate、apply...）进行这些操作？

有人可以帮帮我吗？这个我快疯了……谢谢！

【问题讨论】：

【参考方案1】：

针对 Pandas 0.22+ 进行了编辑，考虑到不赞成通过聚合在组中使用字典。

我们建立了一个非常相似的字典，我们使用字典的键来指定我们的功能，并使用字典本身来重命名列。

rnm_cols = dict(size='Size', sum='Sum', mean='Mean', std='Std')
df.set_index(['Category', 'Item']).stack().groupby('Category') \
  .agg(rnm_cols.keys()).rename(columns=rnm_cols)

            Size   Sum        Mean        Std
Category                                     
Books          3    58   19.333333   2.081666
Clothes        3   148   49.333333   4.041452
Technology     6  1800  300.000000  70.710678

选项 1使用 agg ← 链接到文档

agg_funcs = dict(Size='size', Sum='sum', Mean='mean', Std='std')
df.set_index(['Category', 'Item']).stack().groupby(level=0).agg(agg_funcs)

                  Std   Sum        Mean  Size
Category                                     
Books        2.081666    58   19.333333     3
Clothes      4.041452   148   49.333333     3
Technology  70.710678  1800  300.000000     6

选项 2事半功倍使用 describe ← 链接到文档

df.set_index(['Category', 'Item']).stack().groupby(level=0).describe().unstack()

            count        mean        std    min    25%    50%    75%    max
Category                                                                   
Books         3.0   19.333333   2.081666   17.0   18.5   20.0   20.5   21.0
Clothes       3.0   49.333333   4.041452   45.0   47.5   50.0   51.5   53.0
Technology    6.0  300.000000  70.710678  200.0  262.5  300.0  337.5  400.0

【讨论】：

感谢您的回答@piRSquared，如果我们想为同一个列字典应用多个函数是行不通的。有什么办法可以处理吗？ @CanCeylan 这在 Pandas 系列中使用 groupby 和聚合。它对 DataFrame 的行为有所不同。【参考方案2】：

df.groupby('Category').agg('Item':'size','shop1':['sum','mean','std'],'shop2':['sum','mean','std'],'shop3':['sum','mean','std'])

或者，如果您想在所有商店中使用它：

df1 = df.set_index(['Item','Category']).stack().reset_index().rename(columns='level_2':'Shops',0:'costs')
df1.groupby('Category').agg('Item':'size','costs':['sum','mean','std'])

【讨论】：

【参考方案3】：

如果我理解正确，您想计算所有商店的聚合指标，而不是单独计算。为此，您可以先stack 您的数据框，然后按Category 分组：

stacked = df.set_index(['Item', 'Category']).stack().reset_index()
stacked.columns = ['Item', 'Category', 'Shop', 'Price']
stacked.groupby('Category').agg('Price':['count','sum','mean','std'])

这会导致

           Price                             
           count   sum        mean        std
Category                                     
Books          3    58   19.333333   2.081666
Clothes        3   148   49.333333   4.041452
Technology     6  1800  300.000000  70.710678

【讨论】：

以上是关于Python pandas groupby 在多列上聚合，然后旋转的主要内容，如果未能解决你的问题，请参考以下文章

Groupby对python中的多列求和并计数

Python Pandas——在多列上融化、旋转、转置

具有多列的groupby，在pandas中具有添加和频率计数[重复]

pandas groupby 聚合具有多列的自定义函数

Pandas 一次缩放多列并使用 groupby() 进行逆变换

Pandas Groupby 独特的多列