在 Python/Pandas 中执行不同操作的多列有条件地聚合分组数据

Posted 2023-02-16

技术标签:

【中文标题】在 Python/Pandas 中执行不同操作的多列有条件地聚合分组数据【英文标题】：Aggregate grouped data conditionally over many columns doing different operations in Python/Pandas 【发布时间】：2021-11-10 13:39:09 【问题描述】：

考虑以下简化的示例数据框df：

Department     CustomerID    Date          Price     MenswearDemand  HomeDemand
0    Menswear      418089    2019-04-18    199             199           0
1    Menswear      613573    2019-04-24    199             199           0
2    Menswear      161840    2019-04-25    199             199           0
3    Menswear     2134926    2019-04-29    199             199           0
4    Menswear      984801    2019-04-30     19              19           0
5        Home      398555    2019-01-27     52               0          52
6    Menswear      682906    2019-02-03     97              97           0
7    Menswear      682906    2019-02-03     97              97           0
8    Menswear      923491    2019-02-09     80              80           0
9    Menswear     1098782    2019-02-25    258             258           0
10   Menswear      721696    2019-03-25     12              12           0
11   Menswear      695706    2019-04-10    129             129           0
12  Underwear      637026    2019-01-18    349               0           0
13  Underwear      205997    2019-01-25    279               0           0
14  Underwear      787984    2019-02-01     27               0           0
15  Underwear      318256    2019-02-01    279               0           0
16  Underwear      570454    2019-02-14    262               0           0
17  Underwear     1239118    2019-02-28    279               0           0
18       Home     1680791    2019-04-04   1398               0        1398

我想根据'CustomerID' 对这些数据进行分组，然后：

'Date'

'2021-01-01'

'MenswearDemand'

'HomeDemand'

我应该得到的结果是这样的：

            Date  MenswearDemand  HomeDemand
CustomerID                                  
161840         6             199           0
205997        96               0           0
318256        89               0           0
398555        94               0          52
418089        13             199           0
570454        76               0           0
613573         7             199           0
637026       103               0           0
682906        87             194           0
695706        21             129           0
721696        37              12           0
787984        89               0           0
923491        81              80           0
984801         1              19           0
1098782       65             258           0
1239118       62               0           0
1680791       27               0        1398
2134926        2             199           0

这就是我设法解决这个问题的方法：

df['Date'] = pd.to_datetime(df['Date'])
cutoffDate = df['Date'].max() + dt.timedelta(days = 1)
newdf = df.groupby('CustomerID').agg('Date': lambda x: (cutoffDate - x.max()).days,
                                      'MenswearDemand': lambda x: x.sum(),
                                      'HomeDemand': lambda x: x.sum())

但是，实际上我得到了大约 1500 万行和 30 个需求列。我真的不想每次都在我的聚合函数中写下所有那些'DemandColumn': lambda x: x.sum()，因为它们都应该相加。有没有更好的方法来做到这一点？就像传入一个想要对其执行特定操作的列子集的数组一样？

【问题讨论】：

【参考方案1】：

如果您提前知道列名，则可以在传递给agg 函数之前构建字典。

...
cutoffDate = df['Date'].max() + dt.timedelta(days=1)

agg_dict = 'Date': lambda x: (cutoffDate - x.max()).days

DemandColumns = ['MenswearDemand', 'HomeDemand']
f = lambda x: x.sum()
agg_dict.update(col_name: f for col_name in DemandColumns)

newdf = df.groupby('CustomerID').agg(agg_dict)

另一个选择（知道列名，在前面的示例中为DemandColumns）是首先使用agg 函数计算Date 列，然后使用filter 函数传递所需列的列表作为 items 参数只保留那些确切的列。

...
cutoffDate = df['Date'].max() + dt.timedelta(days=1)
groups = df.groupby('CustomerID')
newdf = groups.agg(lambda x: (cutoffDate - x.max()).days)
newdf = pd.concat([newdf, groups.apply(lambda x: x.filter(items=DemandColumns).agg(sum))], axis=1)

如果所需的列 (DemandColumns) 遵循给定的模式，您可以排除列表创建并使用带有 regex 参数的 filter 函数。在这种情况下，您可以使用正则表达式 '.*Demand$' 返回所有以 Demand 字符串结尾的列。

newdf = pd.concat([newdf, groups.apply(lambda x: x.filter(regex='.*Demand$').agg(sum))], axis=1)

【讨论】：

要提高性能，请使用 np.sum 或简单地使用 'sum' 而不是 lambda x: x.sum() 作为 agg func 参数。对于许多组，这应该会更快。【参考方案2】：

只是给出一个基于convtools 的替代方案：

from datetime import datetime, timedelta

from convtools import conversion as c
from convtools.contrib.tables import Table


# this way you can define multiple meaningful metrics
metric_to_config = 
    "sum_MenswearDemand": 
        "column": "MenswearDemand",
        "reducer": c.ReduceFuncs.Sum,
    ,
    "sum_HomeDemand": "column": "HomeDemand", "reducer": c.ReduceFuncs.Sum,
    "median_Age": "column": "Age", "reducer": c.ReduceFuncs.Median,


# pass required metric names as input
required_metrics = ["sum_MenswearDemand", "sum_HomeDemand"]

# prepare aggregation config
parse_date = c.call_func(datetime.strptime, c.this(), "%Y-%m-%d").call_method(
    "date"
)
aggregate_config = 
    "CustomerID": c.item("CustomerID"),
    "Date": c.ReduceFuncs.Max(c.item("Date")).pipe(parse_date),

for metric in required_metrics:
    config = metric_to_config[metric]
    reducer = config["reducer"]
    column = config["column"]
    aggregate_config[metric] = reducer(c.item(column))

# this is where code generation happens
converter = (
    c.group_by(c.item("CustomerID"))
    .aggregate(aggregate_config)
    .pipe(
        # total max is calculated below and saved under "max_date" label
        # here we replace "Date" with day diffs
        c.iter_mut(
            c.Mut.set_item(
                "Date", (c.label("max_date") - c.item("Date")).attr("days")
            )
        ),
        # calculate max date from aggregation results
        label_input=
            "max_date": (
                c.call_func(max, c.iter(c.item("Date")))
                + timedelta(days=1)
            )
        ,
    )
    .gen_converter()
)

# reading required columns from input csv file
rows = (
    Table.from_csv(
        "tmp/input_1.csv",
        header=True,
        dialect=Table.csv_dialect(delimiter="\t"),
    )
    .take(
        "CustomerID",
        "Date",
        *metric_to_config[metric]["column"] for metric in required_metrics,
    )
    .into_iter_rows(dict)
)
# aggregating input rows
iterable_of_results = converter(rows)

# outputting to csv file if needed
Table.from_rows(iterable_of_results).into_csv(
    "tmp/out.csv",
    dialect=Table.csv_dialect(delimiter="\t"),
)

使用这个库的好处是它是轻量级的，没有依赖关系，允许流处理，有时它并不比 pandas/polars 慢，因为生成的代码很简单。

【讨论】：

以上是关于在 Python/Pandas 中执行不同操作的多列有条件地聚合分组数据的主要内容，如果未能解决你的问题，请参考以下文章