使用句点选择/切片多索引数据帧时间序列会导致错误?

Posted

技术标签:

【中文标题】使用句点选择/切片多索引数据帧时间序列会导致错误?【英文标题】:Select/slice a multi-index dataframe time-series using a period leads to a bug? 【发布时间】:2016-12-26 16:44:37 【问题描述】:

我有一个多索引,其第一级索引是一个时间序列,与以下索引完全相同:

In[168]: rng = pd.date_range('01-01-2000',periods=50,freq='M')

In[169]: long_df = pd.DataFrame(np.random.randn(50,4),index = rng, columns=['bar','baz','foo','zoo'])

In[170]: long_df = long_df.stack()

In[171]: long_df[:10]
Out[171]: 

2000-01-31  bar    2.079474
            baz   -0.569920
            foo    1.149012
            zoo   -0.228926
2000-02-29  bar    0.429502
            baz   -0.117166
            foo    0.956546
            zoo   -1.483818
2000-03-31  bar   -1.137998
            baz    1.049849

编辑

我可以使用句点对其进行切片,并且效果很好:

In[172]: long_df = long_df.sort_index()

In[173]: long_df.loc['2001']
Out[173]: 
2001-01-31  bar   -0.193987
            baz    0.769297
            foo    0.286880
            zoo   -1.431313
2001-02-28  bar   -0.840502
            baz    1.786758
            foo    0.878356
            zoo    0.433383
2001-03-31  bar    0.897548
            baz    1.901540
            foo    0.110606
            zoo    0.571267
2001-04-30  bar   -0.375377
            baz    1.423742
            foo   -0.415006
            zoo   -0.141000
(...)

但是,当我使用多索引版本时,我正在使用切片没有得到确认:

In[204]: dfmi
Out[204]: 
                      Last  Days to expiry
Date       Ticker                         
1988-12-06 HGF89   1.46894              52
           HGF90   1.17100             419
           HGG89   1.42100              80
           HGH89   1.37344             113
           HGH90   1.17450             477
           HGK89   1.28750             171
           HGK90   1.15900             539
           HGN89   1.24550             233
           HGN90   1.15900             598
           HGU89   1.21750             295
           HGU90   1.15900             659
           HGZ89   1.18500             386
1988-12-07 HGF89   1.51900              51
           HGF90   1.18900             418
           HGG89   1.46394              79
           HGH89   1.41300             112
           HGH90   1.19250             476
           HGK89   1.31750             170
           HGK90   1.17700             538
           HGN89   1.27550             232
           HGN90   1.17700             597
           HGU89   1.24250             294
           HGU90   1.17700             658
           HGZ89   1.20300             385
1988-12-08 HGF89   1.58100              50
           HGF90   1.18900             417
           HGG89   1.50894              78
           HGH89   1.43994             111
           HGH90   1.19250             475
           HGK89   1.32750             169
                   ...             ...
2016-07-05 HGK7    2.20500             325
           HGM7    2.20900             358
           HGN6    2.18150              22
           HGN7    2.21000             387
           HGQ6    2.18150              55
           HGQ7    2.21450             420
           HGU6    2.18350              85
           HGU7    2.21550             449
           HGV6    2.18700             114
           HGV7    2.21850             479
           HGX6    2.19100             146
           HGX7    2.22000             511
           HGZ6    2.19250             176
2016-07-06 HGF7    2.16700             205
           HGG7    2.17100             233
           HGH7    2.17100             266
           HGJ7    2.17550             294
           HGK7    2.17650             324
           HGM7    2.18050             357
           HGN6    2.15150              21
           HGN7    2.18150             386
           HGQ6    2.15150              54
           HGQ7    2.18600             419
           HGU6    2.15350              84
           HGU7    2.18700             448
           HGV6    2.15700             113
           HGV7    2.19000             478
           HGX6    2.16100             145
           HGX7    2.19150             510
           HGZ6    2.16300             175

[167701 rows x 2 columns]

In[204]: dfmi = dfmi.sort_index()

In[205]: dfmi.loc['2001']
Out[206]: 
                      Last  Days to expiry
Date       Ticker                         
1988-12-06 HGF89   1.46894              52
           HGF90   1.17100             419
           HGG89   1.42100              80
           HGH89   1.37344             113
           HGH90   1.17450             477
           HGK89   1.28750             171
           HGK90   1.15900             539
           HGN89   1.24550             233
           HGN90   1.15900             598
           HGU89   1.21750             295
           HGU90   1.15900             659
1988-12-07 HGF89   1.51900              51
           HGF90   1.18900             418
           HGG89   1.46394              79
           HGH89   1.41300             112
           HGH90   1.19250             476
           HGK89   1.31750             170
           HGK90   1.17700             538
           HGN89   1.27550             232
           HGN90   1.17700             597
           HGU89   1.24250             294
           HGU90   1.17700             658
1988-12-08 HGF89   1.58100              50
           HGF90   1.18900             417
           HGG89   1.50894              78
           HGH89   1.43994             111
           HGH90   1.19250             475
           HGK89   1.32750             169
           HGK90   1.17700             537
           HGN89   1.27750             231
                   ...             ...
2016-07-05 HGH7    2.19950             267
           HGJ7    2.20400             295
           HGK7    2.20500             325
           HGM7    2.20900             358
           HGN6    2.18150              22
           HGN7    2.21000             387
           HGQ6    2.18150              55
           HGQ7    2.21450             420
           HGU6    2.18350              85
           HGU7    2.21550             449
           HGV6    2.18700             114
           HGV7    2.21850             479
           HGX6    2.19100             146
           HGX7    2.22000             511
2016-07-06 HGF7    2.16700             205
           HGG7    2.17100             233
           HGH7    2.17100             266
           HGJ7    2.17550             294
           HGK7    2.17650             324
           HGM7    2.18050             357
           HGN6    2.15150              21
           HGN7    2.18150             386
           HGQ6    2.15150              54
           HGQ7    2.18600             419
           HGU6    2.15350              84
           HGU7    2.18700             448
           HGV6    2.15700             113
           HGV7    2.19000             478
           HGX6    2.16100             145
           HGX7    2.19150             510

[161017 rows x 2 columns]

我注意到我作为示例给出的 long_df (pandas.core.series.Series) 和我使用的 df (pandas.core.frame.DataFrame) 之间存在类型差异

正确的做法是什么?

非常感谢您的提示,

【问题讨论】:

【参考方案1】:

您需要添加loc,但需要最新版本的pandas 0.18.1:

print (long_df.loc['2001'])

2001-01-31  bar    1.684425
            baz    1.215258
            foo    0.158968
            zoo    0.689477
2001-02-28  bar   -0.123582
            baz    0.312533
            foo    0.609169
            zoo   -0.093985
2001-03-31  bar    0.372093
            baz   -0.281191
            foo   -0.400354
            zoo    0.646965
2001-04-30  bar   -0.287488
            baz   -0.928941
            foo    1.365416
            zoo    0.267282
2001-05-31  bar   -1.021086
            baz    0.317819
            foo   -0.393135
            zoo   -0.213589
2001-06-30  bar   -2.594173
...
...

编辑:

另一个解决方案是get_level_values 来自第一级,get_loc 用于查找整数索引:

import pandas as pd

long_df = pd.read_csv('test/testslice.csv', parse_dates=[0], index_col=[0,1])
dfmi = long_df.stack().sort_index()

print (dfmi.index.get_level_values(0))
DatetimeIndex(['1988-12-06', '1988-12-06', '1988-12-06', '1988-12-06',
               '1988-12-06', '1988-12-06', '1988-12-06', '1988-12-06',
               '1988-12-06', '1988-12-06',
               ...
               '2016-07-06', '2016-07-06', '2016-07-06', '2016-07-06',
               '2016-07-06', '2016-07-06', '2016-07-06', '2016-07-06',
               '2016-07-06', '2016-07-06'],
              dtype='datetime64[ns]', name='Date', length=335402, freq=None)

print (dfmi.index.get_level_values(0).get_loc('2001'))
slice(121844, 133684, None)
print (dfmi.iloc[dfmi.index.get_level_values(0).get_loc('2001')])

Date        Ticker                
2001-01-02  HGF01   Last                0.8180
                    Days to expiry     27.0000
            HGF02   Last                0.8180
                    Days to expiry    392.0000
            HGG01   Last                0.8165
                    Days to expiry     55.0000
            HGG02   Last                0.8180
                    Days to expiry    420.0000
            HGH01   Last                0.8115
                    Days to expiry     85.0000
            HGH02   Last                0.8180
                    Days to expiry    448.0000
            HGJ01   Last                0.8125
                    Days to expiry    114.0000
            HGJ02   Last                0.8170
                    Days to expiry    479.0000
            HGK01   Last                0.8135
                    Days to expiry    147.0000
            HGK02   Last                0.8160
                    Days to expiry    512.0000
            HGM01   Last                0.8145
                    Days to expiry    176.0000
            HGM02   Last                0.8155
                    Days to expiry    540.0000
            HGN01   Last                0.8155
                    Days to expiry    206.0000
            HGN02   Last                0.8140
                    Days to expiry    573.0000
            HGQ01   Last                0.8160
                    Days to expiry    239.0000
                                        ...   
2001-12-31  HGK03   Last                0.6960
                    Days to expiry    513.0000
            HGM02   Last                0.6680
                    Days to expiry    177.0000
            HGM03   Last                0.6980
                    Days to expiry    542.0000
            HGN02   Last                0.6710
                    Days to expiry    210.0000
            HGN03   Last                0.7005
                    Days to expiry    575.0000
            HGQ02   Last                0.6740
                    Days to expiry    240.0000
            HGQ03   Last                0.7030
                    Days to expiry    604.0000
            HGU02   Last                0.6770
                    Days to expiry    269.0000
            HGU03   Last                0.7050
                    Days to expiry    634.0000
            HGV02   Last                0.6795
                    Days to expiry    302.0000
            HGV03   Last                0.7080
                    Days to expiry    667.0000
            HGX02   Last                0.6820
                    Days to expiry    329.0000
            HGX03   Last                0.7110
                    Days to expiry    694.0000
            HGZ02   Last                0.6850
                    Days to expiry    361.0000
            HGZ03   Last                0.7140
                    Days to expiry    728.0000
dtype: float64

通过评论编辑1:

不幸的是,如果需要按范围选择,我只有列表理解和concat 的缓慢解决方案:

print (list(range(1993, 2003)))
[1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002]

dfs = [dfmi.iloc[dfmi.index.get_level_values(0).get_loc(str(x))] for x in range(1993, 2003)]
print (pd.concat(dfs))

1993-01-01 00:00:00  bar    0.080676
                     baz    0.315925
                     foo   -1.484132
                     zoo   -0.977202
1993-01-01 01:00:00  bar    0.817846
                     baz   -1.280649
                     foo    0.727975
                     zoo   -0.062142
1993-01-01 02:00:00  bar    1.278623
                     baz    0.268865
                     foo   -0.183612
                     zoo    0.194996
1993-01-01 03:00:00  bar   -0.304734
                     baz   -0.227468
                     foo   -0.134305
                     zoo    0.887374
1993-01-01 04:00:00  bar   -0.166669
                     baz   -0.132718
                     foo   -0.624932
                     zoo    1.959724
1993-01-01 05:00:00  bar   -1.379774
                     baz   -0.738452
                     foo    0.398924
                     zoo    0.005612
1993-01-01 06:00:00  bar   -0.864205
                     baz   -0.813321
                     foo    0.931858
                     zoo   -1.005977
1993-01-01 07:00:00  bar    0.667380
                     baz   -1.208457
                              ...   
2002-10-30 08:00:00  foo    0.311835
                     zoo    0.611802
2002-10-30 09:00:00  bar    2.615050
                     baz   -0.291767
                     foo   -0.508202
                     zoo    0.443429
2002-10-30 10:00:00  bar   -1.724252
                     baz   -0.126579
                     foo    1.108530
                     zoo   -0.553025
2002-10-30 11:00:00  bar    1.208705
                     baz   -1.561024
                     foo    0.722768
                     zoo    1.893419
2002-10-30 12:00:00  bar    0.239383
                     baz   -0.543053
                     foo   -0.687370
                     zoo    0.848929
2002-10-30 13:00:00  bar    0.897465
                     baz    0.631292
                     foo    0.068200
                     zoo   -1.579010
2002-10-30 14:00:00  bar   -0.996531
                     baz   -1.208318
                     foo    0.174970
                     zoo   -0.780913
2002-10-30 15:00:00  bar    0.237465
                     baz    0.380585
                     foo   -1.646285
                     zoo   -0.730744
dtype: float64

【讨论】:

嗯。它在这个示例数据框上运行良好。但是当我尝试需要切片的那个时,出现以下错误: KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (1)' 我的版本正确(pd.__version__ = '0.18.1') 是的,你需要第一个排序索引 - 试试long_df = long_df.stack().sort_index(),见docs 对索引进行排序后,我不再收到错误消息,但它不承认我的选择。我注意到我作为示例给出的 long_df (pandas.core.series.Series) 和我使用的 df (pandas.core.frame.DataFrame) 之间存在类型差异。 这能解释为什么它不承认切片吗?我将再次编辑我的问题并给出更多解释

以上是关于使用句点选择/切片多索引数据帧时间序列会导致错误?的主要内容,如果未能解决你的问题,请参考以下文章

使用部分索引元组列表对多索引数据帧进行切片的最佳方法是啥?

将值从一个数据帧切片复制到另一个:使用“IndexSlice”的多索引熊猫数据帧的切片是不是总是一致地排序?

pandas中的多索引(时间序列)切片错误

将熊猫多索引切片彼此分开

熊猫切片多索引数据框

pandas:选择索引,然后选择多索引切片上的列