如何从 pandas groupby().sum() 的输出创建一个新列？

Posted 2023-03-11

技术标签:

【中文标题】如何从 pandas groupby().sum() 的输出创建一个新列？【英文标题】：How do I create a new column from the output of pandas groupby().sum()? 【发布时间】：2021-07-16 10:58:38 【问题描述】：

试图从groupby 计算中创建一个新列。在下面的代码中，我得到了每个日期的正确计算值（参见下面的组），但是当我尝试用它创建一个新列（df['Data4']）时，我得到了 NaN。所以我试图在数据框中创建一个新列，其中所有日期的总和为Data3，并将其应用于每个日期行。例如，2015-05-08 有 2 行（总数为 50+5 = 55），在这个新列中，我希望两行都有 55。

import pandas as pd
import numpy as np
from pandas import DataFrame

df = pd.DataFrame(
    'Date' : ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 
    'Sym'  : ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 
    'Data2': [11, 8, 10, 15, 110, 60, 100, 40],
    'Data3': [5, 8, 6, 1, 50, 100, 60, 120]
)

group = df['Data3'].groupby(df['Date']).sum()

df['Data4'] = group

【问题讨论】：

【参考方案1】：

您想使用 transform 这将返回一个索引与 df 对齐的 Series，因此您可以将其添加为新列：

In [74]:

df = pd.DataFrame('Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40],'Data3': [5, 8, 6, 1, 50, 100, 60, 120])

df['Data4'] = df['Data3'].groupby(df['Date']).transform('sum')
df
Out[74]:
   Data2  Data3        Date   Sym  Data4
0     11      5  2015-05-08  aapl     55
1      8      8  2015-05-07  aapl    108
2     10      6  2015-05-06  aapl     66
3     15      1  2015-05-05  aapl    121
4    110     50  2015-05-08  aaww     55
5     60    100  2015-05-07  aaww    108
6    100     60  2015-05-06  aaww     66
7     40    120  2015-05-05  aaww    121

【讨论】：

如果我们在这里有第二个 groupby 会发生什么：***.com/a/40067099/281545 @Mr_and_Mrs_D 在这种情况下，您必须重置索引并对公共列执行左合并才能重新添加列或者，可以使用df.groupby('Date')['Data3'].transform('sum')（我觉得它更容易记住）。如何使用此模板按两列分组？谢谢【参考方案2】：

如何使用 Groupby().Sum() 创建新列？

有两种方法 - 一种简单明了，另一种更有趣。

每个人的最爱：`GroupBy.transform()` 和 `'sum'`

@Ed Chum 的回答可以简化一点。请致电 DataFrame.groupby 而不是 Series.groupby。这导致更简单的语法。

# The setup.
df[['Date', 'Data3']]

         Date  Data3
0  2015-05-08      5
1  2015-05-07      8
2  2015-05-06      6
3  2015-05-05      1
4  2015-05-08     50
5  2015-05-07    100
6  2015-05-06     60
7  2015-05-05    120

df.groupby('Date')['Data3'].transform('sum')

0     55
1    108
2     66
3    121
4     55
5    108
6     66
7    121
Name: Data3, dtype: int64

速度有点快，

df2 = pd.concat([df] * 12345)

%timeit df2['Data3'].groupby(df['Date']).transform('sum')
%timeit df2.groupby('Date')['Data3'].transform('sum')

10.4 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.58 ms ± 559 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

非常规，但值得考虑：`GroupBy.sum()` + `Series.map()`

我偶然发现了 API 中的一个有趣特性。据我所知，您可以在 0.20 以上的任何主要版本上重现这一点（我在 0.23 和 0.24 上对此进行了测试）。如果您改为使用GroupBy 的直接函数并使用map 广播它，您似乎始终可以将transform 花费的时间缩短几毫秒：

df.Date.map(df.groupby('Date')['Data3'].sum())

0     55
1    108
2     66
3    121
4     55
5    108
6     66
7    121
Name: Date, dtype: int64

比较

df.groupby('Date')['Data3'].transform('sum')

0     55
1    108
2     66
3    121
4     55
5    108
6     66
7    121
Name: Data3, dtype: int64

我的测试表明，如果您负担得起直接使用GroupBy 函数（例如mean、min、max、first 等），map 会更快一些。对于最多约 20 万条记录的大多数一般情况，它或多或少更快。在那之后，性能真的取决于数据。

（左：v0.23，右：v0.24）

知道的不错的选择，如果您有较小的帧和较少的组数，那就更好了。 . .但我会推荐transform 作为首选。觉得这还是值得分享的。

基准代码，供参考：

import perfplot

perfplot.show(
    setup=lambda n: pd.DataFrame('A': np.random.choice(n//10, n), 'B': np.ones(n)),
    kernels=[
        lambda df: df.groupby('A')['B'].transform('sum'),
        lambda df:  df.A.map(df.groupby('A')['B'].sum()),
    ],
    labels=['GroupBy.transform', 'GroupBy.sum + map'],
    n_range=[2**k for k in range(5, 20)],
    xlabel='N',
    logy=True,
    logx=True
)

【讨论】：

这很高兴知道！你介意包括（至少在未来的 perfplots 中）版本号吗？性能差异很有趣，但毕竟这些是未来可能会解决的实现细节。特别是如果开发人员注意到您的帖子。 @jpp 是的，这很公平！已添加版本。这是在 0.23 上测试的，但我相信只要您有任何超过 0.20 的版本，就会看到差异。【参考方案3】：

我一般建议使用功能更强大的apply，您可以使用它在单个表达式中编写查询，甚至用于更复杂的用途，例如定义一个新列，其值定义为对组的操作，以及也可以在同一组内具有不同的值！

这比为每个组定义具有相同值的列的简单情况更普遍（如本问题中的sum，它因组而异，在同一组内相同）。

简单案例（组内具有相同值的新列，跨组不同）：

# I'm assuming the name of your dataframe is something long, like
# `my_data_frame`, to show the power of being able to write your
# data processing in a single expression without multiple statements and
# multiple references to your long name, which is the normal style
# that the pandas API naturally makes you adopt, but which make the
# code often verbose, sparse, and a pain to generalize or refactor

my_data_frame = pd.DataFrame(
    'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 
    'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 
    'Data2': [11, 8, 10, 15, 110, 60, 100, 40],
    'Data3': [5, 8, 6, 1, 50, 100, 60, 120])

(my_data_frame
    # create groups by 'Date'
    .groupby(['Date'])
    # for every small Group DataFrame `gdf` with the same 'Date', do:
    # assign a new column 'Data4' to it, with the value being
    # the sum of 'Data3' for the small dataframe `gdf`
    .apply(lambda gdf: gdf.assign(Data4=lambda gdf: gdf['Data3'].sum()))
    # after groupby operations, the variable(s) you grouped by on
    # are set as indices. In this case, 'Date' was set as an additional
    # level for the (multi)index. But it is still also present as a
    # column. Thus, we drop it from the index:
    .droplevel(0)
)

### OR

# We don't even need to define a variable for our dataframe.
# We can chain everything in one expression

(pd
    .DataFrame(
        'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 
        'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 
        'Data2': [11, 8, 10, 15, 110, 60, 100, 40],
        'Data3': [5, 8, 6, 1, 50, 100, 60, 120])
    .groupby(['Date'])
    .apply(lambda gdf: gdf.assign(Data4=lambda gdf: gdf['Data3'].sum()))
    .droplevel(0)
)

输出：

	Date	Sym	Data2	Data3	Data4
3	2015-05-05	aapl	15	1	121
7	2015-05-05	aaww	40	120	121
2	2015-05-06	aapl	10	6	66
6	2015-05-06	aaww	100	60	66
1	2015-05-07	aapl	8	8	108
5	2015-05-07	aaww	60	100	108
0	2015-05-08	aapl	11	5	55
4	2015-05-08	aaww	110	50	55

（为什么python表达式要放在括号内？这样我们就不需要在代码中到处都用反斜杠了，我们可以在表达式代码中放入cmets来描述每一步。） em>

这有什么强大的？它正在利用“split-apply-combine 范式”的全部力量。它允许您从“将数据帧拆分为块”和“在这些块上运行任意操作”的角度进行思考，而无需减少/聚合，即不减少行数。（并且无需编写显式、冗长的循环，也无需使用昂贵的连接或连接来将结果粘合回来。）

让我们考虑一个更复杂的例子。您的数据框中有多个时间序列的数据。您有一个代表一种产品的列，一个具有时间戳的列，以及一个包含该产品在一年中某个时间售出的商品数量的列。您想按产品分组并获得一个新列，其中包含每个类别销售的商品的累计总数。我们想要一个列，在具有相同产品的每个“块”内，仍然是一个时间序列，并且单调递增（仅在一个块内）。

我们怎样才能做到这一点？ groupby + apply!

(pd
     .DataFrame(
        'Date': ['2021-03-11','2021-03-12','2021-03-13','2021-03-11','2021-03-12','2021-03-13'], 
        'Product': ['shirt','shirt','shirt','shoes','shoes','shoes'], 
        'ItemsSold': [300, 400, 234, 80, 10, 120],
        )
    .groupby(['Product'])
    .apply(lambda gdf: (gdf
        # sort by date within a group
        .sort_values('Date')
        # create new column
        .assign(CumulativeItemsSold=lambda df: df['ItemsSold'].cumsum())))
    .droplevel(0)
)

输出：

	Date	Product	ItemsSold	CumulativeItemsSold
0	2021-03-11	shirt	300	300
1	2021-03-12	shirt	400	700
2	2021-03-13	shirt	234	934
3	2021-03-11	shoes	80	80
4	2021-03-12	shoes	10	90
5	2021-03-13	shoes	120	210

这种方法的另一个优点是什么？即使我们必须按多个字段分组，它也有效！例如，如果我们的产品有一个'Color' 字段，并且我们想要按(Product, Color) 分组的累积系列，我们可以：

(pd
     .DataFrame(
        'Date': ['2021-03-11','2021-03-12','2021-03-13','2021-03-11','2021-03-12','2021-03-13',
                 '2021-03-11','2021-03-12','2021-03-13','2021-03-11','2021-03-12','2021-03-13'], 
        'Product': ['shirt','shirt','shirt','shoes','shoes','shoes',
                    'shirt','shirt','shirt','shoes','shoes','shoes'], 
        'Color': ['yellow','yellow','yellow','yellow','yellow','yellow',
                  'blue','blue','blue','blue','blue','blue'], # new!
        'ItemsSold': [300, 400, 234, 80, 10, 120,
                      123, 84, 923, 0, 220, 94],
        )
    .groupby(['Product', 'Color']) # We group by 2 fields now
    .apply(lambda gdf: (gdf
        .sort_values('Date')
        .assign(CumulativeItemsSold=lambda df: df['ItemsSold'].cumsum())))
    .droplevel([0,1]) # We drop 2 levels now

输出：

	Date	Product	Color	ItemsSold	CumulativeItemsSold
6	2021-03-11	shirt	blue	123	123
7	2021-03-12	shirt	blue	84	207
8	2021-03-13	shirt	blue	923	1130
0	2021-03-11	shirt	yellow	300	300
1	2021-03-12	shirt	yellow	400	700
2	2021-03-13	shirt	yellow	234	934
9	2021-03-11	shoes	blue	0	0
10	2021-03-12	shoes	blue	220	220
11	2021-03-13	shoes	blue	94	314
3	2021-03-11	shoes	yellow	80	80
4	2021-03-12	shoes	yellow	10	90
5	2021-03-13	shoes	yellow	120	210

（这种很容易扩展到对多个字段进行分组的可能性是我喜欢将groupby 的参数始终放在列表中的原因，即使它是单个名称，例如前面的“产品”示例。）

您可以在一个表达式中综合完成所有这些操作。（当然，如果 python 的 lambdas 看起来更好看，它会更好看。）

我为什么要讨论一般案例？因为这是在搜索“pandas new column groupby”之类的内容时出现的第一个 SO 问题。

关于此类操作的 API 的其他想法

基于对组进行的任意计算添加列很像defining new column using aggregations over Windows in SparkSQL 的好习惯。

例如，您可以这样想（它是 Scala 代码，但 PySpark 中的等效代码看起来几乎相同）：

val byDepName = Window.partitionBy('depName)
empsalary.withColumn("avg", avg('salary) over byDepName)

就像（以我们上面看到的方式使用 pandas）：

empsalary = pd.DataFrame(...some dataframe...)
(empsalary
    # our `Window.partitionBy('depName)`
    .groupby(['depName'])
    # our 'withColumn("avg", avg('salary) over byDepName)
    .apply(lambda gdf: gdf.assign(avg=lambda df: df['salary'].mean()))
    .droplevel(0)
)

（请注意 Spark 示例的综合性和更好的程度。pandas 等价物看起来有点笨拙。pandas API 无法轻松编写这类“流畅”操作）。

这个成语依次来自SQL's Window Functions，PostgreSQL 文档给出了一个非常好的定义：（强调我的）

窗口函数对与当前行有某种关联的一组表行进行计算。这与可以使用聚合函数完成的计算类型相当。但与常规聚合函数不同，窗口函数的使用不会导致行被分组为单个输出行 - 行保留其单独的身份。在幕后，窗口函数能够访问的不仅仅是查询结果的当前行。

并给出一个漂亮的 SQL 单行示例：（在组内排名）

SELECT depname, empno, salary, rank() OVER (PARTITION BY depname ORDER BY salary DESC) FROM empsalary;

depname	empno	salary	rank
develop	8	6000	1
develop	10	5200	2
develop	11	5200	2
develop	9	4500	4
develop	7	4200	5
personnel	2	3900	1
personnel	5	3500	2
sales	1	5000	1
sales	4	4800	2
sales	3	4800	2

最后一件事：您可能还对 pandas 的 pipe 感兴趣，它类似于 apply，但工作方式略有不同，并且为内部操作提供了更大的工作范围。更多信息请见here

【讨论】：

【参考方案4】：

df = pd.DataFrame(
'Date' : ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 
'Sym'  : ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 
'Data2': [11, 8, 10, 15, 110, 60, 100, 40],
'Data3': [5, 8, 6, 1, 50, 100, 60, 120]
)
print(pd.pivot_table(data=df,index='Date',columns='Sym',     aggfunc='Data2':'sum','Data3':'sum'))

输出

Data2      Data3     
Sym         aapl aaww  aapl aaww
Date                            
2015-05-05    15   40     1  120
2015-05-06    10  100     6   60
2015-05-07     8   60     8  100
2015-05-08    11  110     5   50

【讨论】：

以上是关于如何从 pandas groupby().sum() 的输出创建一个新列？的主要内容，如果未能解决你的问题，请参考以下文章

如何从 pandas groupby().sum() 的输出创建一个新列？

如何使用 Groupby().Sum() 创建新列？

每个人的最爱：GroupBy.transform() 和 'sum'

非常规，但值得考虑：GroupBy.sum() + Series.map()

关于此类操作的 API 的其他想法

每个人的最爱：`GroupBy.transform()` 和 `'sum'`

非常规，但值得考虑：`GroupBy.sum()` + `Series.map()`