Pandas DataFrame：多个组的滚动集联合聚合

Posted 2023-03-11

技术标签:

【中文标题】Pandas DataFrame：多个组的滚动集联合聚合【英文标题】：Pandas DataFrame: Rolling Set Union Aggregation over multiple Groups 【发布时间】：2019-03-24 10:23:12 【问题描述】：

我有一个带有 DateTimeIndex 的 DataFrame、一个我想要分组的列和一个包含整数集的列：

import pandas as pd

df = pd.DataFrame([['2018-01-01', 1, 1, 2, 3],
                   ['2018-01-02', 1, 3],
                   ['2018-01-03', 1, 3, 4, 5],
                   ['2018-01-04', 1, 5, 6],
                   ['2018-01-01', 2, 7],
                   ['2018-01-02', 2, 8],
                   ['2018-01-03', 2, 9],
                   ['2018-01-04', 2, 10]],
                  columns=['timestamp', 'group', 'ids'])

df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)

            group        ids
timestamp                   
2018-01-01      1  1, 2, 3
2018-01-02      1        3
2018-01-03      1  3, 4, 5
2018-01-04      1     5, 6
2018-01-01      2        7
2018-01-02      2        8
2018-01-03      2        9
2018-01-04      2       10

在每个组中，我想在过去 x 天内构建一个滚动集合并集。所以假设 X=3 的结果应该是：

            group              ids
timestamp                   
2018-01-01      1        1, 2, 3
2018-01-02      1        1, 2, 3
2018-01-03      1  1, 2, 3, 4, 5
2018-01-04      1     3, 4, 5, 6
2018-01-01      2              7
2018-01-02      2           7, 8
2018-01-03      2        7, 8, 9
2018-01-04      2       8, 9, 10

从my previous question 的回答中，我知道了如何在没有分组的情况下做到这一点，所以到目前为止我想出了这个解决方案：

grouped = df.groupby('group')
new_df = pd.DataFrame()
for name, group in grouped:
    group['ids'] = [
        set.union(*group['ids'].to_frame().iloc(axis=1)[max(0, i-2): i+1,0])
        for i in range(len(group.index))
    ]
    new_df = new_df.append(group)

它给出了正确的结果，但看起来很笨拙，并且还给出了以下警告：

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

不过，所提供链接上的文档似乎并不适合我的确切情况。（在这种情况下，至少我无法理解。）

我的问题：如何改进此代码，使其干净、高效且不抛出警告消息？

【问题讨论】：

【参考方案1】：

作为mentioned in the docs，不要在循环中使用pd.DataFrame.append；这样做会很昂贵。

改为使用list 并提供给pd.concat。

您可以通过在列表中创建数据副本来避免SettingWithCopyWarning，即在列表理解中通过assign + iloc 避免chained indexing：

L = [group.assign(ids=[set.union(*group.iloc[max(0, i-2): i+1, -1]) \
                       for i in range(len(group.index))]) \
     for _, group in df.groupby('group')]

res = pd.concat(L)

print(res)

            group              ids
timestamp                         
2018-01-01      1        1, 2, 3
2018-01-02      1        1, 2, 3
2018-01-03      1  1, 2, 3, 4, 5
2018-01-04      1     3, 4, 5, 6
2018-01-01      2              7
2018-01-02      2           8, 7
2018-01-03      2        8, 9, 7
2018-01-04      2       8, 9, 10

【讨论】：

这避免了警告，它也比我的“解决方案”高出大约一个数量级。谢谢！我内心的某些东西仍然希望它在没有双 for 循环的情况下工作。 ://

以上是关于Pandas DataFrame：多个组的滚动集联合聚合的主要内容，如果未能解决你的问题，请参考以下文章