查询 hdf5 日期时间列

Posted

技术标签:

【中文标题】查询 hdf5 日期时间列【英文标题】:Query hdf5 datetime column 【发布时间】:2018-08-16 21:08:00 【问题描述】:

我有一个 hdf5 文件,其中包含一个表,其中 time 列采用 datetime64[ns] 格式。

我想获取所有早于thresh 的行。我怎样才能做到这一点?这是我尝试过的:

thresh = pd.datetime.strptime('2018-03-08 14:19:41','%Y-%m-%d %H:%M:%S').timestamp()
hdf = pd.read_hdf(STORE, 'gh1', where = 'time>thresh' )

我收到以下错误:

Traceback (most recent call last):

  File "<ipython-input-80-fa444735d0a9>", line 1, in <module>
    runfile('/home/joao/github/control_panel/controlpanel/controlpanel/reading_test.py', wdir='/home/joao/github/control_panel/controlpanel/controlpanel')

  File "/home/joao/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "/home/joao/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/home/joao/github/control_panel/controlpanel/controlpanel/reading_test.py", line 15, in <module>
    hdf = pd.read_hdf(STORE, 'gh1', where = 'time>thresh' )

  File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 370, in read_hdf
    return store.select(key, auto_close=auto_close, **kwargs)

  File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 717, in select
    return it.get_result()

  File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 1457, in get_result
    results = self.func(self.start, self.stop, where)

  File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 710, in func
    columns=columns, **kwargs)

  File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 4141, in read
    if not self.read_axes(where=where, **kwargs):

  File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 3340, in read_axes
    self.selection = Selection(self, where=where, **kwargs)

  File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 4706, in __init__
    self.condition, self.filter = self.terms.evaluate()

  File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 556, in evaluate
    self.condition = self.terms.prune(ConditionBinOp)

  File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 118, in prune
    res = pr(left.value, right.value)

  File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 113, in pr
    encoding=self.encoding).evaluate()

  File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 327, in evaluate
    values = [self.convert_value(v) for v in rhs]

  File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 327, in <listcomp>
    values = [self.convert_value(v) for v in rhs]

  File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 185, in convert_value
    v = pd.Timestamp(v)

  File "pandas/_libs/tslib.pyx", line 390, in pandas._libs.tslib.Timestamp.__new__

  File "pandas/_libs/tslib.pyx", line 1549, in pandas._libs.tslib.convert_to_tsobject

  File "pandas/_libs/tslib.pyx", line 1735, in pandas._libs.tslib.convert_str_to_tsobject

ValueError: could not convert string to Timestamp

【问题讨论】:

您在 HDF5 文件中的 time 列似乎属于字符串 dtype... @MaxU 我仔细检查了一下,我的time 列是 datetime64[ns]。也许我应该将其更改为浮动并存储时间戳 【参考方案1】:

演示:

创建样本 DF(100.000 行):

In [9]: N = 10**5

In [10]: dates = pd.date_range('1980-01-01', freq='99T', periods=N)

In [11]: df = pd.DataFrame('date':dates, 'val':np.random.rand(N))

In [12]: df
Out[12]:
                     date       val
0     1980-01-01 00:00:00  0.985215
1     1980-01-01 01:39:00  0.452295
2     1980-01-01 03:18:00  0.780096
3     1980-01-01 04:57:00  0.004596
4     1980-01-01 06:36:00  0.515051
...                   ...       ...
99995 1998-10-27 15:45:00  0.509954
99996 1998-10-27 17:24:00  0.046636
99997 1998-10-27 19:03:00  0.026678
99998 1998-10-27 20:42:00  0.660652
99999 1998-10-27 22:21:00  0.839426

[100000 rows x 2 columns]

将其写入 HDF5 文件(索引 date 列):

In [13]: df.to_hdf('d:/temp/test.h5', 'test', format='t', data_columns=['date'])

按索引有条件地读取 HDF5:

In [14]: x = pd.read_hdf('d:/temp/test.h5', 'test', where="date > '1998-10-27 15:00:00'")

In [15]: x
Out[15]:
                     date       val
99995 1998-10-27 15:45:00  0.509954
99996 1998-10-27 17:24:00  0.046636
99997 1998-10-27 19:03:00  0.026678
99998 1998-10-27 20:42:00  0.660652
99999 1998-10-27 22:21:00  0.839426

【讨论】:

以上是关于查询 hdf5 日期时间列的主要内容,如果未能解决你的问题,请参考以下文章

使用 H5Py 在 HDF5 中存储日期时间

悖论:查询日期列 + 时间列作为日期时间

索引布尔列与日期时间列的查询性能

数据库的日期区间查询方法。

如何查询日期时间列 - MS SQL 2008

如何将字符串日期列转换为 Google 大查询中的日期列?