Pandas 中的聚合

Posted 2023-02-23

技术标签:

【中文标题】Pandas 中的聚合【英文标题】：Aggregation in Pandas 【发布时间】：2019-05-15 20:05:41 【问题描述】：

list

tuple

strings with separator

我已经看到这些反复出现的问题，询问有关 pandas 聚合功能的各个方面。今天，关于聚合及其各种用例的大部分信息都分散在数十个措辞不当、无法搜索的帖子中。这里的目的是为后代整理一些更重要的观点。

本问答旨在成为一系列有用的用户指南的下一部分：

How to pivot a dataframe, Pandas concat How do I operate on a DataFrame with a Series for every column? Pandas Merging 101

请注意，这篇文章并不是要替代documentation about aggregation 和关于groupby，所以也请阅读！

【问题讨论】：

请尽量不要关闭规范帖子（您无法解决规范问答帖子中的 1 个问题） 【参考方案1】：

问题 1

如何使用 Pandas 执行聚合？

扩展aggregation documentation。

聚合函数是减少返回对象维度的函数。这意味着输出 Series/DataFrame 的行数与原始行数相同或更少。

下表列出了一些常见的聚合函数：

功能说明 mean() 计算组的平均值 sum() 计算组值的总和 size() 计算组大小 count() 计算组数 std() 组的标准差 var() 计算组的方差 sem() 组平均值的标准误差 describe() 生成描述性统计 first() 计算组值中的第一个 last() 计算最后一个组值 nth() 取第 n 个值，如果 n 是一个列表，则取一个子集 min() 计算组值的最小值 max() 计算组值的最大值

np.random.seed(123)

df = pd.DataFrame('A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one'],
                   'C' : np.random.randint(5, size=6),
                   'D' : np.random.randint(5, size=6),
                   'E' : np.random.randint(5, size=6))
print (df)
     A      B  C  D  E
0  foo    one  2  3  0
1  foo    two  4  1  0
2  bar  three  2  1  1
3  foo    two  1  0  3
4  bar    two  3  1  4
5  foo    one  2  1  0

按过滤列和Cython implemented functions聚合：

df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

聚合函数用于所有列，但未在 groupby 函数中指定，此处为 A, B 列：

df2 = df.groupby(['A', 'B'], as_index=False).sum()
print (df2)
     A      B  C  D  E
0  bar  three  2  1  1
1  bar    two  3  1  4
2  foo    one  4  4  0
3  foo    two  5  1  3

您还可以在groupby 函数之后的列表中仅指定一些用于聚合的列：

df3 = df.groupby(['A', 'B'], as_index=False)['C','D'].sum()
print (df3)
     A      B  C  D
0  bar  three  2  1
1  bar    two  3  1
2  foo    one  4  4
3  foo    two  5  1

使用函数DataFrameGroupBy.agg得到相同的结果：

df1 = df.groupby(['A', 'B'], as_index=False)['C'].agg('sum')
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

df2 = df.groupby(['A', 'B'], as_index=False).agg('sum')
print (df2)
     A      B  C  D  E
0  bar  three  2  1  1
1  bar    two  3  1  4
2  foo    one  4  4  0
3  foo    two  5  1  3

对于应用于一列的多个函数，请使用tuples 列表 - 新列和聚合函数的名称：

df4 = (df.groupby(['A', 'B'])['C']
         .agg([('average','mean'),('total','sum')])
         .reset_index())
print (df4)
     A      B  average  total
0  bar  three      2.0      2
1  bar    two      3.0      3
2  foo    one      2.0      4
3  foo    two      2.5      5

如果要传递多个函数，可以传递list of tuples：

df5 = (df.groupby(['A', 'B'])
         .agg([('average','mean'),('total','sum')]))

print (df5)
                C             D             E
          average total average total average total
A   B
bar three     2.0     2     1.0     1     1.0     1
    two       3.0     3     1.0     1     4.0     4
foo one       2.0     4     2.0     4     0.0     0
    two       2.5     5     0.5     1     1.5     3

然后在列中获取MultiIndex：

print (df5.columns)
MultiIndex(levels=[['C', 'D', 'E'], ['average', 'total']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

对于转换为列，将MultiIndex 与join 一起使用map：

df5.columns = df5.columns.map('_'.join)
df5 = df5.reset_index()
print (df5)
     A      B  C_average  C_total  D_average  D_total  E_average  E_total
0  bar  three        2.0        2        1.0        1        1.0        1
1  bar    two        3.0        3        1.0        1        4.0        4
2  foo    one        2.0        4        2.0        4        0.0        0
3  foo    two        2.5        5        0.5        1        1.5        3

另一个解决方案是传递聚合函数列表，然后展平MultiIndex，对于其他列名称使用str.replace：

df5 = df.groupby(['A', 'B']).agg(['mean','sum'])

df5.columns = (df5.columns.map('_'.join)
                  .str.replace('sum','total')
                  .str.replace('mean','average'))
df5 = df5.reset_index()
print (df5)
     A      B  C_average  C_total  D_average  D_total  E_average  E_total
0  bar  three        2.0        2        1.0        1        1.0        1
1  bar    two        3.0        3        1.0        1        4.0        4
2  foo    one        2.0        4        2.0        4        0.0        0
3  foo    two        2.5        5        0.5        1        1.5        3

如果想用聚合函数分别指定每一列通过dictionary:

df6 = (df.groupby(['A', 'B'], as_index=False)
         .agg('C':'sum','D':'mean')
         .rename(columns='C':'C_total', 'D':'D_average'))
print (df6)
     A      B  C_total  D_average
0  bar  three        2        1.0
1  bar    two        3        1.0
2  foo    one        4        2.0
3  foo    two        5        0.5

你也可以传递自定义函数：

def func(x):
    return x.iat[0] + x.iat[-1]

df7 = (df.groupby(['A', 'B'], as_index=False)
         .agg('C':'sum','D': func)
         .rename(columns='C':'C_total', 'D':'D_sum_first_and_last'))
print (df7)
     A      B  C_total  D_sum_first_and_last
0  bar  three        2                     2
1  bar    two        3                     2
2  foo    one        4                     4
3  foo    two        5                     1

问题2

聚合后没有DataFrame！发生了什么？

两列或多列聚合：

df1 = df.groupby(['A', 'B'])['C'].sum()
print (df1)
A    B
bar  three    2
     two      3
foo  one      4
     two      5
Name: C, dtype: int32

首先检查 Pandas 对象的Index 和type：

print (df1.index)
MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']],
           labels=[[0, 0, 1, 1], [1, 2, 0, 2]],
           names=['A', 'B'])

print (type(df1))
<class 'pandas.core.series.Series'>

对于如何将MultiIndex Series 获取到列，有两种解决方案：

添加参数as_index=False

df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

使用Series.reset_index:

df1 = df.groupby(['A', 'B'])['C'].sum().reset_index()
print (df1)
     A      B  C
0  bar  three  2
1  bar    two  3
2  foo    one  4
3  foo    two  5

如果按一列分组：

df2 = df.groupby('A')['C'].sum()
print (df2)
A
bar    5
foo    9
Name: C, dtype: int32

...用Index 获取Series：

print (df2.index)
Index(['bar', 'foo'], dtype='object', name='A')

print (type(df2))
<class 'pandas.core.series.Series'>

而且解决方法和MultiIndex Series中的一样：

df2 = df.groupby('A', as_index=False)['C'].sum()
print (df2)
     A  C
0  bar  5
1  foo  9

df2 = df.groupby('A')['C'].sum().reset_index()
print (df2)
     A  C
0  bar  5
1  foo  9

问题3

我怎样才能主要聚合字符串列（到`list`s，@987654378 @s, `strings with separator`)?

df = pd.DataFrame('A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
                   'C' : ['three', 'one', 'two', 'two', 'three','two', 'one'],
                   'D' : [1,2,3,2,3,1,2])
print (df)
   A      B      C  D
0  a    one  three  1
1  c    two    one  2
2  b  three    two  3
3  b    two    two  2
4  a    two  three  3
5  c    one    two  1
6  b  three    one  2

可以通过list、tuple、set来代替聚合函数来转换列：

df1 = df.groupby('A')['B'].agg(list).reset_index()
print (df1)
   A                    B
0  a           [one, two]
1  b  [three, two, three]
2  c           [two, one]

另一种方法是使用GroupBy.apply：

df1 = df.groupby('A')['B'].apply(list).reset_index()
print (df1)
   A                    B
0  a           [one, two]
1  b  [three, two, three]
2  c           [two, one]

要转换为带分隔符的字符串，仅当它是字符串列时才使用.join：

df2 = df.groupby('A')['B'].agg(','.join).reset_index()
print (df2)
   A                B
0  a          one,two
1  b  three,two,three
2  c          two,one

如果是数值列，使用带有astype的lambda函数转换为strings：

df3 = (df.groupby('A')['D']
         .agg(lambda x: ','.join(x.astype(str)))
         .reset_index())
print (df3)
   A      D
0  a    1,3
1  b  3,2,2
2  c    2,1

另一种解决方案是在groupby之前转换为字符串：

df3 = (df.assign(D = df['D'].astype(str))
         .groupby('A')['D']
         .agg(','.join).reset_index())
print (df3)
   A      D
0  a    1,3
1  b  3,2,2
2  c    2,1

为了转换所有列，不要在groupby 之后传递列列表。没有任何列D，因为automatic exclusion of 'nuisance' columns。这意味着所有数字列都被排除在外。

df4 = df.groupby('A').agg(','.join).reset_index()
print (df4)
   A                B            C
0  a          one,two  three,three
1  b  three,two,three  two,two,one
2  c          two,one      one,two

所以需要将所有列转换成字符串，然后获取所有列：

df5 = (df.groupby('A')
         .agg(lambda x: ','.join(x.astype(str)))
         .reset_index())
print (df5)
   A                B            C      D
0  a          one,two  three,three    1,3
1  b  three,two,three  two,two,one  3,2,2
2  c          two,one      one,two    2,1

问题 4

我如何汇总计数？

df = pd.DataFrame('A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'],
                   'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'],
                   'C' : ['three', np.nan, np.nan, 'two', 'three','two', 'one'],
                   'D' : [np.nan,2,3,2,3,np.nan,2])
print (df)
   A      B      C    D
0  a    one  three  NaN
1  c    two    NaN  2.0
2  b  three    NaN  3.0
3  b    two    two  2.0
4  a    two  three  3.0
5  c    one    two  NaN
6  b  three    one  2.0

每个组的size 的函数GroupBy.size：

df1 = df.groupby('A').size().reset_index(name='COUNT')
print (df1)
   A  COUNT
0  a      2
1  b      3
2  c      2

函数GroupBy.count 排除缺失值：

df2 = df.groupby('A')['C'].count().reset_index(name='COUNT')
print (df2)
   A  COUNT
0  a      2
1  b      2
2  c      1

这个函数应该用于多列计算非缺失值：

df3 = df.groupby('A').count().add_suffix('_COUNT').reset_index()
print (df3)
   A  B_COUNT  C_COUNT  D_COUNT
0  a        2        2        1
1  b        3        2        3
2  c        2        1        1

一个相关的函数是Series.value_counts。它以降序返回包含唯一值计数的对象的大小，因此第一个元素是最常出现的元素。它默认排除NaNs 值。

df4 = (df['A'].value_counts()
              .rename_axis('A')
              .reset_index(name='COUNT'))
print (df4)
   A  COUNT
0  b      3
1  a      2
2  c      2

如果你想要同样的输出，比如使用函数groupby + size，添加Series.sort_index：

df5 = (df['A'].value_counts()
              .sort_index()
              .rename_axis('A')
              .reset_index(name='COUNT'))
print (df5)
   A  COUNT
0  a      2
1  b      3
2  c      2

问题 5

如何创建一个由聚合值填充的新列？

方法GroupBy.transform返回一个与被分组的对象索引相同（相同大小）的对象。

更多信息请参见the Pandas documentation。

np.random.seed(123)

df = pd.DataFrame('A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'],
                    'B' : ['one', 'two', 'three','two', 'two', 'one'],
                    'C' : np.random.randint(5, size=6),
                    'D' : np.random.randint(5, size=6))
print (df)
     A      B  C  D
0  foo    one  2  3
1  foo    two  4  1
2  bar  three  2  1
3  foo    two  1  0
4  bar    two  3  1
5  foo    one  2  1


df['C1'] = df.groupby('A')['C'].transform('sum')
df['C2'] = df.groupby(['A','B'])['C'].transform('sum')


df[['C3','D3']] = df.groupby('A')['C','D'].transform('sum')
df[['C4','D4']] = df.groupby(['A','B'])['C','D'].transform('sum')

print (df)

     A      B  C  D  C1  C2  C3  D3  C4  D4
0  foo    one  2  3   9   4   9   5   4   4
1  foo    two  4  1   9   5   9   5   5   1
2  bar  three  2  1   5   2   5   2   2   1
3  foo    two  1  0   9   5   9   5   5   1
4  bar    two  3  1   5   3   5   2   3   1
5  foo    one  2  1   9   4   9   5   4   4

【讨论】：

@AbhishekDujari - 我尝试用一些有关聚合的相关问题来扩展文档，所以它更像是文档中的更多信息。谢谢。虽然我建议为项目本身做出贡献。这些很好的例子将使很多学生受益可用聚合函数的列表......你在哪里找到的？我似乎在官方文档中的任何地方都找不到它！谢谢！ @QACollective - 你可以查看this【参考方案2】：

如果您具有 R 或 SQL 背景，以下三个示例将教您以您已经熟悉的方式进行聚合所需的一切：

让我们首先创建一个 Pandas 数据框

import pandas as pd

df = pd.DataFrame('key1' : ['a','a','a','b','a'],
                   'key2' : ['c','c','d','d','e'],
                   'value1' : [1,2,2,3,3],
                   'value2' : [9,8,7,6,5])

df.head(5)

我们创建的表格如下所示：

key1	key2	value1	value2
a	c	1	9
a	c	2	8
a	d	2	7
b	d	3	6
a	e	3	5

1.类似于 SQL 的行缩减聚合`Group By`

1.1 如果熊猫版本`>=0.25`

通过运行print(pd.__version__) 检查您的 Pandas 版本。如果您的 Pandas 版本为 0.25 或更高版本，则以下代码将起作用：

df_agg = df.groupby(['key1','key2']).agg(mean_of_value_1=('value1', 'mean'),
                                         sum_of_value_2=('value2', 'sum'),
                                         count_of_value1=('value1','size')
                                         ).reset_index()


df_agg.head(5)

生成的数据表将如下所示：

key1	key2	mean_of_value1	sum_of_value2	count_of_value1
a	c	1.5	17	2
a	d	2.0	7	1
a	e	3.0	5	1
b	d	3.0	6	1

SQL 等效是：

SELECT
      key1
     ,key2
     ,AVG(value1) AS mean_of_value_1
     ,SUM(value2) AS sum_of_value_2
     ,COUNT(*) AS count_of_value1
FROM
    df
GROUP BY
     key1
    ,key2

1.2 如果熊猫版本`<0.25`

如果您的 Pandas 版本早于 0.25，则运行上述代码会出现以下错误：

TypeError: aggregate() 缺少 1 个必需的位置参数：'arg'

现在要对value1 和value2 进行聚合，您将运行以下代码：

df_agg = df.groupby(['key1','key2'],as_index=False).agg('value1':['mean','count'],'value2':'sum')

df_agg.columns = ['_'.join(col).strip() for col in df_agg.columns.values]

df_agg.head(5)

生成的表格如下所示：

key1	key2	value1_mean	value1_count	value2_sum
a	c	1.5	2	17
a	d	2.0	1	7
a	e	3.0	1	5
b	d	3.0	1	6

重命名列需要使用以下代码单独完成：

df_agg.rename(columns="value1_mean" : "mean_of_value1",
                       "value1_count" : "count_of_value1",
                       "value2_sum" : "sum_of_value2"
                       , inplace=True)

2.创建不减少行数的列 (`EXCEL - SUMIF, COUNTIF`)

如果你想做一个 SUMIF、COUNTIF 等，就像你在 Excel 中做的那样，没有减少行数，那么你需要这样做。

df['Total_of_value1_by_key1'] = df.groupby('key1')['value1'].transform('sum')

df.head(5)

生成的数据框将如下所示，其行数与原始数据框相同：

key1	key2	value1	value2	Total_of_value1_by_key1
a	c	1	9	8
a	c	2	8	8
a	d	2	7	8
b	d	3	6	3
a	e	3	5	8

3.创建排名列`ROW_NUMBER() OVER (PARTITION BY ORDER BY)`

最后，在某些情况下，您可能想要创建一个 rank 列，它是ROW_NUMBER() OVER (PARTITION BY key1 ORDER BY value1 DESC, value2 ASC) 的 SQL 等效。

这是你的做法。

 df['RN'] = df.sort_values(['value1','value2'], ascending=[False,True]) \
              .groupby(['key1']) \
              .cumcount() + 1

 df.head(5)

注意：我们通过在每行末尾添加\ 来使代码多行。

生成的数据框如下所示：

key1	key2	value1	value2	RN
a	c	1	9	4
a	c	2	8	3
a	d	2	7	2
b	d	3	6	1
a	e	3	5	1

在上述所有示例中，最终的数据表将具有表结构，并且不会具有您可能在其他语法中获得的数据透视结构。

其他聚合算子：

mean() 计算组的平均值

sum()计算组值的总和

size() 计算组大小

count()计算组数

std() 组的标准差

var() 计算组的方差

sem()组均值的标准误

describe() 生成描述性统计数据

first() 计算组值中的第一个

last() 计算最后一个组值

nth()取第n个值，如果n是一个列表，则取一个子集

min() 计算组值的最小值

max() 计算组值的最大值

【讨论】：

当df 有一些nan 时这是否成立？

以上是关于Pandas 中的聚合的主要内容，如果未能解决你的问题，请参考以下文章

数据分析—Pandas 中的分组聚合Groupby 高阶操作

Pandas 中的转换与聚合

pandas 数据框中的聚合，其中一行中的列名

pandas 中的聚合和计数

python / pandas中的条件聚合