将 Pandas DataFrames 保存为 HDF5 存储,各种错误

Posted

技术标签:

【中文标题】将 Pandas DataFrames 保存为 HDF5 存储,各种错误【英文标题】:Saving Pandas DataFrames as a HDF5 store, various errors 【发布时间】:2017-03-28 13:12:27 【问题描述】:

只想在 HDF5 存储(.h5 文件)中存档一些 Pandas 数据帧。下面是我正在使用的代码。

# Fake data over N runs
Data_N = []
for n in range(5):
    Data_N.append(np.random.randn(5000,15,125))

# Create HDFStore object
store = pd.HDFStore('test.h5')

# For each run:
for n in range(len(Data_N)):
    Data = Data_N[n]

    # Pandas DataFrame for "flattened" fake data
    Data_subDFs = []
    nanbuff = np.nan*np.zeros((1,len(Data[0,0])))

    for i in range(len(Data)):
        Data_i = np.vstack((nanbuff,Data[i,:,:]))
        Data_subDFs.append(pd.DataFrame(data = Data_i))

    Data_DF = pd.concat(Data_subDFs)

    # Row and column labels for the DataFrame
    Data_rows = []
    for i in range(len(Data)):
        Data_rows.append(['Layer %d:' % (i+1)] + range(1,len(Data[0])+1))

    Data_DF.index = sum(Data_rows,[])
    Data_DF.columns = range(1,len(Data[0,0])+1)

    # Put Pandas DataFrame into store
    store.put('Data_DF_%d' % (n+1), Data_DF)
    #store.put('Data_DF_%d' % (n+1), Data_DF, format='table')
    #store.put('Data_DF_%d' % (n+1), Data_DF, format='table', data_columns=True)

# Save the HDF5 file
store.close()

这给出了以下输出:

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->axis1] [items->None]

如果我使用 put 的第二个版本,它会给出:

TypeError: Passing an incorrect value to a table column. Expected a Col (or subc
lass) instance and got: "ObjectAtom()". Please make use of the Col(), or descend
ant, constructor to properly initialize columns.

如果我使用 put 的第三个版本,它会给出:

ValueError: cannot have non-object label DataIndexableCol

谁能解释一下不同的版本,为什么我不能在没有酸洗的情况下将我认为是有效的 Pandas DataFrame 保存在 HDF5 中?

如果有帮助,我认为我不需要能够附加 DataFrame/store。我只想要使用 Pandas HDF5 界面保存 DF 的最佳性能方式。

谢谢!


编辑 1:

我将“每次运行:”之后的代码更新为此

# For each run:
for run in range(len(Data_N)):
    Data = Data_N[run]
    l = len(Data)
    m = len(Data[0])
    n = len(Data[0,0])

    # Pandas DataFrame for "flattened" fake data
    Data_subDFs = []

    for i in range(len(Data)):
        Data_i = Data[i,:,:]
        Data_subDFs.append(pd.DataFrame(data = Data_i))

    Data_DF = pd.concat(Data_subDFs)

    # Row and column labels for the DataFrame
    L1 = np.zeros((l*m,1), dtype=object) # Layer number
    L2 = np.zeros((l*m,1), dtype=object) # Row number

    for i in range(l):
        for j in range(m):
            L1[i*m + j,0] = 'Layer %d' % (i+1)
            L2[i*m + j,0] = '%d' % (j+1)

    Data_DF.index = np.hstack((L1,L2))
    Data_DF.columns = range(1,n+1)

    # Put Pandas DataFrame into store
    store.put('Data_DF_%d' % (run+1), Data_DF)
    #store.put('Data_DF_%d' % (run+1), Data_DF, format='table')
    #store.put('Data_DF_%d' % (run+1), Data_DF, format='table', data_columns=True)

但是对于每个 put 行,这会给出相同的警告或错误。


编辑 2(这工作!):

# For each run:
for run in range(len(Data_N)):
    Data = Data_N[run]
    l = len(Data)
    m = len(Data[0])
    n = len(Data[0,0])

    # Pandas DataFrame for "flattened" fake data
    Data_DF = pd.DataFrame(Data.reshape(l*m,n))

    # Layer and row labels
    layers = np.arange(1,l+1)
    rows = np.arange(1,m+1)

    # Pandas multi-index
    mindex = pd.MultiIndex.from_product([layers,rows], names=['Layer','Row'])

    # DataFrame multi-index and column labels
    Data_DF.index = mindex
    Data_DF.columns = range(1,n+1)

    # Put Pandas DataFrame into store
    store.put('Data_DF_%d' % (run+1), Data_DF)
    #store.put('Data_DF_%d' % (run+1), Data_DF, format='table')
    #store.put('Data_DF_%d' % (run+1), Data_DF, format='table', data_columns=True)

第三行仍然给出同样的错误,但由于第二行有效,我假设第三行在这种情况下只是一个无效的命令。

第二行也比第一行快很多,而且都比酸洗路线快得多。谢谢!

【问题讨论】:

【参考方案1】:

更新:

这是一个小演示:

设置:

data = np.random.randn(5,10,5)
index = pd.MultiIndex.from_product([np.arange(1, len(data)+1),
                                  np.arange(1,len(data[0])+1)], names=['Layer','No'])
df = pd.DataFrame(data.reshape(data.shape[0] * data.shape[1], data.shape[2]),
                  index=index)

数据:

In [82]: df
Out[82]:
                 0         1         2         3         4
Layer No
1     1   1.167144  0.640303  0.059197 -1.637180  0.667196
      2   2.150872 -0.825325 -0.332458 -1.307043  1.361330
      3  -0.931299 -0.931882  0.153943 -0.446289  0.651594
      4  -0.131500 -0.489745  1.264029  0.889779  1.081613
      5  -0.479022 -1.516204  0.616170  0.126860  0.125559
      6   1.114287 -0.939504  0.058869  0.321159  0.340881
      7  -0.527516 -0.362337 -0.590430 -0.609017  1.835716
      8   0.063372  0.000792  0.855485 -0.113592  0.890687
      9  -0.160041  1.978954  0.778428  1.988354  2.095665
      10  0.687911  0.115918 -0.653885  0.486365 -0.775659
2     1  -0.123350  0.674359 -0.120634 -1.350044 -0.176252
      2  -1.986077 -0.846584  0.895982  0.236790  0.240023
      3   0.878597  0.241594  0.405382  1.785109  1.228188
      4  -1.510238 -0.303274  0.247082  1.841996 -0.864595
      5  -1.424249 -0.183216 -0.044330  0.324894 -0.271179
      6  -0.345720 -0.942421  0.538227 -0.558793 -1.075346
      7   1.327952 -2.335520 -0.164645  1.489798 -0.876896
      8   1.043723  0.770489 -1.052739 -0.830190  1.005406
      9   0.789100 -0.706633 -1.014431 -1.164513 -0.266424
      10  2.061175  0.933526 -1.601836 -1.542535 -1.220943
3     1  -0.061520 -0.932599  0.103480 -0.318529 -0.311965
      2  -0.401409 -0.308739 -1.399233 -1.172032 -0.550774
      3   0.670272  1.215724  0.711328  2.332297 -1.326704
      4   0.377469  0.752313 -1.223832  0.431555 -0.901796
      5  -2.386383  0.053921 -1.175427 -0.794099 -0.469374
      6   0.951571 -2.220609  0.208136 -2.141828  0.010316
      7   1.047133  0.924568  0.282091  1.367981 -0.617389
      8   1.083008 -1.519416  0.535690  0.196885 -0.022692
      9   1.307252  1.099716  0.766976 -0.466699  1.113605
      10 -0.614214  0.702395 -0.131248  1.773092  0.241553
4     1  -1.280026  0.278248 -0.518560 -0.395394  0.434473
      2   1.498882 -1.359542  0.012312 -0.231728 -2.643232
      3  -0.539773 -0.755483 -1.002526  0.198792 -0.120656
      4   0.056788  1.289477 -0.440122 -1.454418 -0.043193
      5  -0.777678  1.734322 -1.270129  0.160094  0.355290
      6  -1.037775 -0.542944 -0.913428  0.885965 -0.155220
      7  -0.855498 -0.330268 -1.903738  0.098101 -0.670830
      8   0.786258  0.599100 -0.426781  0.425572  0.132932
      9  -0.430497 -1.414292 -0.997637  0.696176 -0.480886
      10  1.211665 -1.233842  0.137176  1.520013 -1.052884
5     1  -0.267698 -1.013917 -1.324896 -1.189835 -0.192396
      2   1.047264 -0.454829  1.051039  1.565423  0.749844
      3   0.159177  0.481088  0.711499 -1.217079  0.444402
      4   0.254420 -0.114102  0.620231  1.890822  1.269808
      5   0.673696 -0.321638 -0.887355  0.426549 -0.935591
      6  -1.836808  0.450332  1.187512 -0.215318 -1.142346
      7  -1.496568  0.633886  0.625143  0.295385  1.445084
      8  -0.473427 -0.608318 -0.602080  0.134105  0.704027
      9   2.319899  0.763272  0.861798  1.464612 -0.708869
      10 -0.199555  0.721122  0.099777 -0.466488  0.923112

In [84]: df.index.levels
Out[84]: FrozenList([[1, 2, 3, 4, 5], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])

现在你可以像下面这样切片:

In [85]: idx = pd.IndexSlice

In [86]: df.loc[idx[[2,4], 2:5], :]
Out[86]:
                 0         1         2         3         4
Layer No
2     2  -1.986077 -0.846584  0.895982  0.236790  0.240023
      3   0.878597  0.241594  0.405382  1.785109  1.228188
      4  -1.510238 -0.303274  0.247082  1.841996 -0.864595
      5  -1.424249 -0.183216 -0.044330  0.324894 -0.271179
4     2   1.498882 -1.359542  0.012312 -0.231728 -2.643232
      3  -0.539773 -0.755483 -1.002526  0.198792 -0.120656
      4   0.056788  1.289477 -0.440122 -1.454418 -0.043193
      5  -0.777678  1.734322 -1.270129  0.160094  0.355290

保存到 HDF 存储并从中选择:

In [88]: store = pd.HDFStore('d:/temp/test.h5')

In [89]: store.append('test', df, complib='blosc', complevel=5)

In [90]: store.close()

In [91]: store = pd.HDFStore('d:/temp/test.h5')

In [92]: store.select('test', where="Layer in [2,4] and No in [2,4,6]")
Out[92]:
                 0         1         2         3         4
Layer No
2     2  -1.986077 -0.846584  0.895982  0.236790  0.240023
      4  -1.510238 -0.303274  0.247082  1.841996 -0.864595
      6  -0.345720 -0.942421  0.538227 -0.558793 -1.075346
4     2   1.498882 -1.359542  0.012312 -0.231728 -2.643232
      4   0.056788  1.289477 -0.440122 -1.454418 -0.043193
      6  -1.037775 -0.542944 -0.913428  0.885965 -0.155220

改为MultiIndex documentation(有两个级别:LayerNo)。

【讨论】:

您好,感谢您的建议。我编辑了我的帖子,尝试使用多索引。但它没有用。有什么进一步的建议吗?我执行不正确吗?谢谢 谢谢!!那行得通(再次更新了我的原始帖子)。我投票给你的答案是正确的,并试图投票给它,但我没有足够的信用。再次感谢!

以上是关于将 Pandas DataFrames 保存为 HDF5 存储,各种错误的主要内容,如果未能解决你的问题,请参考以下文章

使用 matplotlib 将 Pandas DataFrames 绘制到饼图中

使用多处理时结合 Pandas DataFrames

PySpark DataFrames - 在不转换为 Pandas 的情况下进行枚举的方法?

PySpark DataFrames - 在不转换为 Pandas 的情况下进行枚举的方法?

根据两个 pandas DataFrames 之间的条件为新列分配值

使用 List Comprehension (Pandas) 从 DataFrames 列表中删除 DataFrames 列