在 groupby 熊猫对象上应用 rolling() 时，多索引重复

Posted 2023-03-11

技术标签:

【中文标题】在 groupby 熊猫对象上应用 rolling() 时，多索引重复【英文标题】：Multiindex duplicated when rolling() applied on a groupby pandas object 【发布时间】：2019-08-26 06:50:20 【问题描述】：

我有一个错误：

x.field.rolling(window=5,min_periods=1).mean() 其中x 是pandas.core.groupby.groupby.DataFrameGroupBy 对象。

我尝试了page 中提出的解决方案。所以我这样做了：

x.field.apply(lambda x: x.rolling(window=5,min_periods=1).mean())

与上面介绍的网页相反，我仍然遇到同样的错误。

+---------+---------+-------+--------------------+
| machin  | machin  | truc  | a column of series |
+---------+---------+-------+--------------------+
| machin1 | machin1 | truc1 | 1                  |
|         |         | truc2 | 2                  |
|         |         | truc3 | 3                  |
|         |         | truc4 | 4                  |
| machin2 | machin2 | truc1 | 100                |
|         |         | truc2 | 99                 |
|         |         | truc3 | 98                 |
+---------+---------+-------+--------------------+

如您所见，列索引“machin”在使用滚动方法之前被重复显示。

例如让我们写x.field.apply(lambda x: x+1)。它返回：

+---------+-------+--------------------+
| machin  | truc  | a column of series |
+---------+-------+--------------------+
| machin1 | truc1 | 2                  |
|         | truc2 | 3                  |
|         | truc3 | 4                  |
|         | truc4 | 5                  |
| machin2 | truc1 | 101                |
|         | truc2 | 100                |
|         | truc3 | 99                 |
+---------+-------+--------------------+

所以没有重复，没有错误。它表明这确实是 rolling() 方法的问题。

这里有一些代码可以帮助您重现我的计算

import pandas as pd

#creation of records
rec=['machin':'machin1',
    'truc':['truc1','truc2','truc3','truc4'],
    'a column':[1,2,3,4],
    'machin':'machin2',
    'truc':['truc1','truc2','truc3'],
    'a column':[100,99,98]]

#creation of pandas dataframe
df=pd.concat([pd.DataFrame(rec[0]),pd.DataFrame(rec[1])])

#creation of multi-index
df.set_index(['machin','truc'],inplace=True)

#creation of a groupby object
x=df.groupby(by='machin')

#rolling computation. Note that to do x.field or x['field'] is the same, and gives same bug as I checked.
x['a column'].rolling(window=5,min_periods=1).mean()

#rolling with apply and lambda, gives same bug
x['a column'].apply(lambda x:x.rolling(window=5,min_periods=1).mean())

#making apply and lambda alone gives no bug
a=x['a column'].apply(lambda x: x+1)

我尝试过的其他解决方案

我尝试重置系列的索引，doc here。

a.reset_index(name='machin')

引发异常：ValueError: cannot insert machin, already exists

虽然您可以在多索引中的名称值中看到“机器”：

a.index
MultiIndex(levels=[['machin1', 'machin2'], ['machin1', 'machin2'],  ['truc1', 'truc2', 'truc3', 'truc4']],
       labels=[[0, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2]],
       names=['machin', 'machin', 'truc'])

我也试过 drop，doc here:

a.drop(index='machin')
a.drop(index=0)

引发异常：KeyError: 'machin' 或 KeyError: 0

我的版本

Python 3.7.1（默认，2018 年 12 月 14 日，19:28:38）在 anaconda 环境中，即使在终端：[GCC 7.3.0] :: Anaconda, Inc. on linux

熊猫 0.23.4

【问题讨论】：

【参考方案1】：

使用groupby 的group_keys 参数：

df.groupby('machin', group_keys=False).rolling(window=5, min_periods=1).mean()

或者，您可以使用reset_index 删除第 0 级，即滚动插入：

df.groupby('machin').rolling(window=5, min_periods=1).mean().reset_index(level=0, drop=True)

任一输出：

               a column
machin  truc           
machin1 truc1       1.0
        truc2       1.5
        truc3       2.0
        truc4       2.5
machin2 truc1     100.0
        truc2      99.5
        truc3      99.0

【讨论】：

第一个选项对我不起作用。但是第二个效果很好。谢谢。第一个选项对我也不起作用。我认为pandas最近在这个模块中很活跃。我刚刚升级到 1.3.5 版，并认为这个错误已经被重新引入。即使使用 group_keys=False 我也能理解。

以上是关于在 groupby 熊猫对象上应用 rolling() 时，多索引重复的主要内容，如果未能解决你的问题，请参考以下文章