Pandas 0.18.1 groupby 和多级聚合错误重新采样

Posted

技术标签:

【中文标题】Pandas 0.18.1 groupby 和多级聚合错误重新采样【英文标题】:Pandas 0.18.1 groupby and resample with multilevel aggregation error 【发布时间】:2016-12-16 02:52:32 【问题描述】:

我刚刚将 pandas 从 0.17.1 更新到 0.18.1,并认为我在更改一些预先存在的代码时发现了下面概述的新重新采样方法的问题。根据这个文档,我下面示例中的 df3_resample 和 df4_resample 应该返回相同的数据帧,但是 df4_resample 会引发异常。这让我绊倒了一段时间,所以我想我会分享。

Exception: Column(s) A already selected

http://pandas.pydata.org/pandas-docs/version/0.18.0/whatsnew.html#whatsnew-0180-breaking-resample

http://pandas.pydata.org/pandas-docs/version/0.18.1/whatsnew.html#groupby-syntax-with-window-and-resample-operations

df = pd.DataFrame(np.random.rand(10,4),
              columns=list('ABCD'),
              index=pd.date_range('2010-01-01 09:00:00', periods=10, freq='s'))
df['item'] = 'item_a' # add column for groupby

# THIS WORKS 
df1_resample = df.groupby('item').resample('2s').agg('A': np.mean, 'B': np.max).reset_index()
print df1_resample

# THIS WORKS 
df2_resample = df.resample('2s').agg('A': 'A_mean': np.mean, 'A_max': np.max).reset_index()
print df2_resample

# THIS WORKS 
df3_resample = df.groupby('item').apply(lambda x: x.resample('2s').agg('A': 'A_mean': np.mean, 'A_max': np.max)).reset_index()
print df3_resample

# THIS DOESN"T WORKS 
df4_resample = df.groupby('item').resample('2s').agg('A': 'A_mean': np.mean, 'A_max': np.max)
print df4_resample

输出:

 item             level_1         A         B
0  item_a 2010-01-01 09:00:00  0.611660  0.739640 
1  item_a 2010-01-01 09:00:02  0.615876  0.880113
2  item_a 2010-01-01 09:00:04  0.218292  0.441504
3  item_a 2010-01-01 09:00:06  0.753698  0.637787
4  item_a 2010-01-01 09:00:08  0.471272  0.474738
                  index         A          
                         A_mean     A_max
0 2010-01-01 09:00:00  0.611660  0.813038
1 2010-01-01 09:00:02  0.615876  0.994657
2 2010-01-01 09:00:04  0.218292  0.233478
3 2010-01-01 09:00:06  0.753698  0.848107
4 2010-01-01 09:00:08  0.471272  0.610592
     item             level_1         A          
                                 A_mean     A_max
0  item_a 2010-01-01 09:00:00  0.611660  0.813038
1  item_a 2010-01-01 09:00:02  0.615876  0.994657
2  item_a 2010-01-01 09:00:04  0.218292  0.233478
3  item_a 2010-01-01 09:00:06  0.753698  0.848107
4  item_a 2010-01-01 09:00:08  0.471272  0.610592


  File "<some_file.py>", line 29, in <module>
    df4_resample = df.groupby('item').resample('2s').agg('A': 'A_mean': np.mean, 'A_max': np.max)

  File "C:\Anaconda2\lib\site-packages\pandas\tseries\resample.py", line 293, in aggregate
  result, how = self._aggregate(arg, *args, **kwargs)

  File "C:\Anaconda2\lib\site-packages\pandas\core\base.py", line 505, in _aggregate
    result = list(_agg(arg, _agg_1dim).values())

  File "C:\Anaconda2\lib\site-packages\pandas\core\base.py", line 496, in _agg
    result[fname] = func(fname, agg_how)

  File "C:\Anaconda2\lib\site-packages\pandas\core\base.py", line 479, in _agg_1dim
    return colg.aggregate(how, _level=(_level or 0) + 1)

  File "C:\Anaconda2\lib\site-packages\pandas\tseries\resample.py", line 293, in aggregate
    result, how = self._aggregate(arg, *args, **kwargs)

  File "C:\Anaconda2\lib\site-packages\pandas\core\base.py", line 528, in _aggregate
  result = _agg(arg, lambda fname,

  File "C:\Anaconda2\lib\site-packages\pandas\core\base.py", line 496, in _agg
     result[fname] = func(fname, agg_how)

  File "C:\Anaconda2\lib\site-packages\pandas\core\base.py", line 529, in <lambda>
agg_how: _agg_1dim(self._selection, agg_how))

  File "C:\Anaconda2\lib\site-packages\pandas\core\base.py", line 475, in _agg_1dim
  colg = self._gotitem(name, ndim=1, subset=subset)

  File "C:\Anaconda2\lib\site-packages\pandas\core\base.py", line 680, in _gotitem
  groupby=self._groupby[key],

  File "C:\Anaconda2\lib\site-packages\pandas\core\base.py", line 326, in __getitem__
    raise Exception('Column(s) %s already selected' % self._selection)

  Exception: Column(s) A already selected

【问题讨论】:

【参考方案1】:

我不确定为什么 resample 对此不起作用,但有一个方便的解决方法,不需要使用 lambda。试试这个:

df.groupby([
    'item', pd.Grouper(freq = '2s')
]).agg(
    'A' : ['mean', 'max']
).rename(columns = 
    'mean' : 'A_mean', 'max' : 'A_max'
, level = 1).reset_index()

您可以将pd.Grouper('2s') 添加到您的groupby(),而不是使用.resample('2S')。它的功能与您的情况相同。这是文档 --> http://pandas.pydata.org/pandas-docs/version/0.18/generated/pandas.Grouper.html

另一方面,您应该避免使用嵌套字典重命名列(已弃用),而应使用实际的 .rename() 函数。

【讨论】:

以上是关于Pandas 0.18.1 groupby 和多级聚合错误重新采样的主要内容,如果未能解决你的问题,请参考以下文章

Pandas学习总结——3. Pandas分组

Pandas进阶之DataFrame多级索引

Pandas高级教程之:GroupBy用法

使用 pandas read_csv 方法的 Python 多级索引

如何在 pandas 中使用过滤条件和 groupby

Pandas 使用 groupby 和模式填充