为啥我的 hdf5 文件看起来如此不必要地大？

Posted 2023-03-11

技术标签:

【中文标题】为啥我的 hdf5 文件看起来如此不必要地大？【英文标题】：Why do my hdf5 files seem so unnecessarily large?为什么我的 hdf5 文件看起来如此不必要地大？ 【发布时间】：2021-03-15 01:42:17 【问题描述】：

我正在处理一个庞大的数据集（数百 GB），其中有大约 4000 万个标识符存储为 32 个字符的字符串，每个标识符都有数百或数千行数字数据。

为了节省空间并提高从磁盘读取数据的效率，似乎最好不要在数据集中一遍又一遍地重复标识符。例如，一个看起来像

的数据表

verylongstringidentifier1, 1.2
verylongstringidentifier1, 2.3
verylongstringidentifier1, 3.4
.
.
verylongstringidentifier2, 2.1
verylongstringidentifier2, 1.0
.
.

如果不重复字符串标识符，则可以更有效地存储。一种选择是为每个标识符保存单独的文件，我可能会走这条路，但拥有数百万个单独的小文件有点烦人，而且从磁盘 I/O 的角度来看可能效率低下。

我对 hdf5 完全陌生，但我所读到的内容表明它应该适用于这种情况，因为可以使用标识符作为键来存储数据集。但是，当我保存到 hdf5 文件时，生成的文件大约比我简单地写入平面 csv 文件所得到的文件大 40 倍。我是否遗漏了有关如何存储 hdf5 文件的内容，或者我只是做错了什么？下面的测试代码是我用来验证（并尝试诊断）问题的。

# trying to figure out why hdf5 file sizes are so huge
import time
import string 
import random
import numpy as np 
import pandas as pd
from pandas import HDFStore

# generate 1000 random 32-character strings
strings = [''.join(random.choices(string.ascii_lowercase, k=32)) for _ in range(1000)] 

# for each of these random strings, create 200 rows of three random floats
# concatenate into one big dataframe
df = pd.DataFrame()
for s in strings:
  vars = np.random.rand(200,3)
  ss = np.full((200,1),s)
  s_data = np.concatenate((ss, vars), axis=1)
  df = pd.concat([df, pd.DataFrame(s_data)], axis=0)

df.columns = ['string', 'v1', 'v2', 'v3']

# write to one big csv file
df.to_csv('/tmp/test.csv', index=False)

# write to compressed bzip2 file
df.to_csv('/tmp/test.csv.bz2', index=False, compression='bz2')

# write to separate csv files for each string
unique_strings = df.string.unique()
for s in unique_strings:
  s_chunk = df[df.string == s]
  fname = '/tmp/test_' + s + '.csv.bz2'
  # don't need to store the string, since it can be retrieved as the filename
  s_chunk[['v1', 'v2', 'v3']].to_csv(fname, index=False, compression='bz2')

# write to hdf5 file with strings as keys
# what I'm trying to do here is *not* save the strings in the datasets, but instead
# use the strings as the names (keys) for the datasets
# My understanding is this would enable me to retrieve the data for a given string
# with pd.read_hdf(h5data, key=<string for which I want data>)
h5data = HDFStore('/tmp/test.h5')
for s in unique_strings:
  s_chunk = df[df.string == s]
  # don't need to store the string, because we'll use it as the key
  s_chunk[['v1', 'v2', 'v3']].to_hdf(h5data, key=s, format='table', complib='bzip2')
h5data.close()

生成的文件大小：

 18M  /tmp/test.csv
4.7M  /tmp/test.csv.bz2
 80M  /tmp/test.h5

【问题讨论】：

这可能是由于您的块布局 - 块尺寸越小，您的 HDF5 文件就会越臃肿。尝试在块大小（以正确解决您的用例）和它们在 HDF5 文件中引入的开销（大小方面）之间找到最佳平衡。 【参考方案1】：

这可能是因为 Pandas 将每个组/数据集的大量无关信息转储到 HDF5 文件中。当我运行您的代码并使用 HDFView 检查文件时，这很明显。

我更喜欢使用 h5py 库来创建和管理 HDF5 文件，因为它更简单，更易于控制。

我尝试使用 h5py 来构造文件，其中每个组被命名为一个唯一的字符串，并且每个组内都有一个用于 DataFrame 每一列的数据集。我在您的脚本中使用以下内容写入 HDF5：

with h5py.File("/tmp/test.h5", "w") as h5data:
    for s in unique_strings:
        s_chunk = df[df.string == s]
        # create group with name = string
        g = h5data.create_group(s)
        # create datasets within group for each data column
        dset_v1 = g.create_dataset("v1", data=s_chunk["v1"].values.astype(np.float32), compression="gzip")
        dset_v2 = g.create_dataset("v2", data=s_chunk["v2"].values.astype(np.float32), compression="gzip")
        dset_v3 = g.create_dataset("v3", data=s_chunk["v3"].values.astype(np.float32), compression="gzip")

结果（注意我用gzip而不是bz2）：

 18M    /tmp/test.csv
5.2M    /tmp/test.csv.bz2
 11M    /tmp/test.h5

进一步的优化是每个组中只有一个数据集，其中该数据集是一个二维数组。在这种情况下，三个create_dataset 调用将被替换为一个：

dset = g.create_dataset("data", data=s_chunk[["v1", "v2", "v3"]].values.astype(np.float32), compression="gzip")

结果：

 18M    /tmp/test.csv
5.0M    /tmp/test.csv.bz2
6.0M    /tmp/test.h5

使用bz2 作为压缩会进一步缩小。

【讨论】：

谢谢。我与 h5py 一起工作，这绝对是一个改进。最后，我仍然觉得我得到的文件比需要的要大得多，所以我选择了“将大量 gzip 压缩的 csv 文件保存在战略性命名的目录中”的版本。

以上是关于为啥我的 hdf5 文件看起来如此不必要地大？的主要内容，如果未能解决你的问题，请参考以下文章