使用句点选择/切片多索引数据帧时间序列会导致错误?
Posted
技术标签:
【中文标题】使用句点选择/切片多索引数据帧时间序列会导致错误?【英文标题】:Select/slice a multi-index dataframe time-series using a period leads to a bug? 【发布时间】:2016-12-26 16:44:37 【问题描述】:我有一个多索引,其第一级索引是一个时间序列,与以下索引完全相同:
In[168]: rng = pd.date_range('01-01-2000',periods=50,freq='M')
In[169]: long_df = pd.DataFrame(np.random.randn(50,4),index = rng, columns=['bar','baz','foo','zoo'])
In[170]: long_df = long_df.stack()
In[171]: long_df[:10]
Out[171]:
2000-01-31 bar 2.079474
baz -0.569920
foo 1.149012
zoo -0.228926
2000-02-29 bar 0.429502
baz -0.117166
foo 0.956546
zoo -1.483818
2000-03-31 bar -1.137998
baz 1.049849
编辑
我可以使用句点对其进行切片,并且效果很好:
In[172]: long_df = long_df.sort_index()
In[173]: long_df.loc['2001']
Out[173]:
2001-01-31 bar -0.193987
baz 0.769297
foo 0.286880
zoo -1.431313
2001-02-28 bar -0.840502
baz 1.786758
foo 0.878356
zoo 0.433383
2001-03-31 bar 0.897548
baz 1.901540
foo 0.110606
zoo 0.571267
2001-04-30 bar -0.375377
baz 1.423742
foo -0.415006
zoo -0.141000
(...)
但是,当我使用多索引版本时,我正在使用切片没有得到确认:
In[204]: dfmi
Out[204]:
Last Days to expiry
Date Ticker
1988-12-06 HGF89 1.46894 52
HGF90 1.17100 419
HGG89 1.42100 80
HGH89 1.37344 113
HGH90 1.17450 477
HGK89 1.28750 171
HGK90 1.15900 539
HGN89 1.24550 233
HGN90 1.15900 598
HGU89 1.21750 295
HGU90 1.15900 659
HGZ89 1.18500 386
1988-12-07 HGF89 1.51900 51
HGF90 1.18900 418
HGG89 1.46394 79
HGH89 1.41300 112
HGH90 1.19250 476
HGK89 1.31750 170
HGK90 1.17700 538
HGN89 1.27550 232
HGN90 1.17700 597
HGU89 1.24250 294
HGU90 1.17700 658
HGZ89 1.20300 385
1988-12-08 HGF89 1.58100 50
HGF90 1.18900 417
HGG89 1.50894 78
HGH89 1.43994 111
HGH90 1.19250 475
HGK89 1.32750 169
... ...
2016-07-05 HGK7 2.20500 325
HGM7 2.20900 358
HGN6 2.18150 22
HGN7 2.21000 387
HGQ6 2.18150 55
HGQ7 2.21450 420
HGU6 2.18350 85
HGU7 2.21550 449
HGV6 2.18700 114
HGV7 2.21850 479
HGX6 2.19100 146
HGX7 2.22000 511
HGZ6 2.19250 176
2016-07-06 HGF7 2.16700 205
HGG7 2.17100 233
HGH7 2.17100 266
HGJ7 2.17550 294
HGK7 2.17650 324
HGM7 2.18050 357
HGN6 2.15150 21
HGN7 2.18150 386
HGQ6 2.15150 54
HGQ7 2.18600 419
HGU6 2.15350 84
HGU7 2.18700 448
HGV6 2.15700 113
HGV7 2.19000 478
HGX6 2.16100 145
HGX7 2.19150 510
HGZ6 2.16300 175
[167701 rows x 2 columns]
In[204]: dfmi = dfmi.sort_index()
In[205]: dfmi.loc['2001']
Out[206]:
Last Days to expiry
Date Ticker
1988-12-06 HGF89 1.46894 52
HGF90 1.17100 419
HGG89 1.42100 80
HGH89 1.37344 113
HGH90 1.17450 477
HGK89 1.28750 171
HGK90 1.15900 539
HGN89 1.24550 233
HGN90 1.15900 598
HGU89 1.21750 295
HGU90 1.15900 659
1988-12-07 HGF89 1.51900 51
HGF90 1.18900 418
HGG89 1.46394 79
HGH89 1.41300 112
HGH90 1.19250 476
HGK89 1.31750 170
HGK90 1.17700 538
HGN89 1.27550 232
HGN90 1.17700 597
HGU89 1.24250 294
HGU90 1.17700 658
1988-12-08 HGF89 1.58100 50
HGF90 1.18900 417
HGG89 1.50894 78
HGH89 1.43994 111
HGH90 1.19250 475
HGK89 1.32750 169
HGK90 1.17700 537
HGN89 1.27750 231
... ...
2016-07-05 HGH7 2.19950 267
HGJ7 2.20400 295
HGK7 2.20500 325
HGM7 2.20900 358
HGN6 2.18150 22
HGN7 2.21000 387
HGQ6 2.18150 55
HGQ7 2.21450 420
HGU6 2.18350 85
HGU7 2.21550 449
HGV6 2.18700 114
HGV7 2.21850 479
HGX6 2.19100 146
HGX7 2.22000 511
2016-07-06 HGF7 2.16700 205
HGG7 2.17100 233
HGH7 2.17100 266
HGJ7 2.17550 294
HGK7 2.17650 324
HGM7 2.18050 357
HGN6 2.15150 21
HGN7 2.18150 386
HGQ6 2.15150 54
HGQ7 2.18600 419
HGU6 2.15350 84
HGU7 2.18700 448
HGV6 2.15700 113
HGV7 2.19000 478
HGX6 2.16100 145
HGX7 2.19150 510
[161017 rows x 2 columns]
我注意到我作为示例给出的 long_df (pandas.core.series.Series) 和我使用的 df (pandas.core.frame.DataFrame) 之间存在类型差异
正确的做法是什么?
非常感谢您的提示,
【问题讨论】:
【参考方案1】:您需要添加loc
,但需要最新版本的pandas 0.18.1:
print (long_df.loc['2001'])
2001-01-31 bar 1.684425
baz 1.215258
foo 0.158968
zoo 0.689477
2001-02-28 bar -0.123582
baz 0.312533
foo 0.609169
zoo -0.093985
2001-03-31 bar 0.372093
baz -0.281191
foo -0.400354
zoo 0.646965
2001-04-30 bar -0.287488
baz -0.928941
foo 1.365416
zoo 0.267282
2001-05-31 bar -1.021086
baz 0.317819
foo -0.393135
zoo -0.213589
2001-06-30 bar -2.594173
...
...
编辑:
另一个解决方案是get_level_values
来自第一级,get_loc
用于查找整数索引:
import pandas as pd
long_df = pd.read_csv('test/testslice.csv', parse_dates=[0], index_col=[0,1])
dfmi = long_df.stack().sort_index()
print (dfmi.index.get_level_values(0))
DatetimeIndex(['1988-12-06', '1988-12-06', '1988-12-06', '1988-12-06',
'1988-12-06', '1988-12-06', '1988-12-06', '1988-12-06',
'1988-12-06', '1988-12-06',
...
'2016-07-06', '2016-07-06', '2016-07-06', '2016-07-06',
'2016-07-06', '2016-07-06', '2016-07-06', '2016-07-06',
'2016-07-06', '2016-07-06'],
dtype='datetime64[ns]', name='Date', length=335402, freq=None)
print (dfmi.index.get_level_values(0).get_loc('2001'))
slice(121844, 133684, None)
print (dfmi.iloc[dfmi.index.get_level_values(0).get_loc('2001')])
Date Ticker
2001-01-02 HGF01 Last 0.8180
Days to expiry 27.0000
HGF02 Last 0.8180
Days to expiry 392.0000
HGG01 Last 0.8165
Days to expiry 55.0000
HGG02 Last 0.8180
Days to expiry 420.0000
HGH01 Last 0.8115
Days to expiry 85.0000
HGH02 Last 0.8180
Days to expiry 448.0000
HGJ01 Last 0.8125
Days to expiry 114.0000
HGJ02 Last 0.8170
Days to expiry 479.0000
HGK01 Last 0.8135
Days to expiry 147.0000
HGK02 Last 0.8160
Days to expiry 512.0000
HGM01 Last 0.8145
Days to expiry 176.0000
HGM02 Last 0.8155
Days to expiry 540.0000
HGN01 Last 0.8155
Days to expiry 206.0000
HGN02 Last 0.8140
Days to expiry 573.0000
HGQ01 Last 0.8160
Days to expiry 239.0000
...
2001-12-31 HGK03 Last 0.6960
Days to expiry 513.0000
HGM02 Last 0.6680
Days to expiry 177.0000
HGM03 Last 0.6980
Days to expiry 542.0000
HGN02 Last 0.6710
Days to expiry 210.0000
HGN03 Last 0.7005
Days to expiry 575.0000
HGQ02 Last 0.6740
Days to expiry 240.0000
HGQ03 Last 0.7030
Days to expiry 604.0000
HGU02 Last 0.6770
Days to expiry 269.0000
HGU03 Last 0.7050
Days to expiry 634.0000
HGV02 Last 0.6795
Days to expiry 302.0000
HGV03 Last 0.7080
Days to expiry 667.0000
HGX02 Last 0.6820
Days to expiry 329.0000
HGX03 Last 0.7110
Days to expiry 694.0000
HGZ02 Last 0.6850
Days to expiry 361.0000
HGZ03 Last 0.7140
Days to expiry 728.0000
dtype: float64
通过评论编辑1:
不幸的是,如果需要按范围选择,我只有列表理解和concat
的缓慢解决方案:
print (list(range(1993, 2003)))
[1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002]
dfs = [dfmi.iloc[dfmi.index.get_level_values(0).get_loc(str(x))] for x in range(1993, 2003)]
print (pd.concat(dfs))
1993-01-01 00:00:00 bar 0.080676
baz 0.315925
foo -1.484132
zoo -0.977202
1993-01-01 01:00:00 bar 0.817846
baz -1.280649
foo 0.727975
zoo -0.062142
1993-01-01 02:00:00 bar 1.278623
baz 0.268865
foo -0.183612
zoo 0.194996
1993-01-01 03:00:00 bar -0.304734
baz -0.227468
foo -0.134305
zoo 0.887374
1993-01-01 04:00:00 bar -0.166669
baz -0.132718
foo -0.624932
zoo 1.959724
1993-01-01 05:00:00 bar -1.379774
baz -0.738452
foo 0.398924
zoo 0.005612
1993-01-01 06:00:00 bar -0.864205
baz -0.813321
foo 0.931858
zoo -1.005977
1993-01-01 07:00:00 bar 0.667380
baz -1.208457
...
2002-10-30 08:00:00 foo 0.311835
zoo 0.611802
2002-10-30 09:00:00 bar 2.615050
baz -0.291767
foo -0.508202
zoo 0.443429
2002-10-30 10:00:00 bar -1.724252
baz -0.126579
foo 1.108530
zoo -0.553025
2002-10-30 11:00:00 bar 1.208705
baz -1.561024
foo 0.722768
zoo 1.893419
2002-10-30 12:00:00 bar 0.239383
baz -0.543053
foo -0.687370
zoo 0.848929
2002-10-30 13:00:00 bar 0.897465
baz 0.631292
foo 0.068200
zoo -1.579010
2002-10-30 14:00:00 bar -0.996531
baz -1.208318
foo 0.174970
zoo -0.780913
2002-10-30 15:00:00 bar 0.237465
baz 0.380585
foo -1.646285
zoo -0.730744
dtype: float64
【讨论】:
嗯。它在这个示例数据框上运行良好。但是当我尝试需要切片的那个时,出现以下错误: KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (1)' 我的版本正确(pd.__version__ = '0.18.1') 是的,你需要第一个排序索引 - 试试long_df = long_df.stack().sort_index()
,见docs
对索引进行排序后,我不再收到错误消息,但它不承认我的选择。我注意到我作为示例给出的 long_df (pandas.core.series.Series) 和我使用的 df (pandas.core.frame.DataFrame) 之间存在类型差异。
这能解释为什么它不承认切片吗?我将再次编辑我的问题并给出更多解释以上是关于使用句点选择/切片多索引数据帧时间序列会导致错误?的主要内容,如果未能解决你的问题,请参考以下文章