使用熊猫将 CSV 文件转换为 HDF5

Posted 2023-03-11

技术标签:

【中文标题】使用熊猫将 CSV 文件转换为 HDF5【英文标题】：Converting CSV file to HDF5 using pandas 【发布时间】：2014-06-13 14:14:44 【问题描述】：

当我使用 pandas 将 csv 文件转换为 hdf5 文件时，生成的文件非常大。例如，一个 170Mb 的测试 csv 文件（23 列，130 万行）会产生一个 2Gb 的 hdf5 文件。但是，如果绕过 pandas 并直接写入 hdf5 文件（使用 pytables），它只有 20Mb。在以下代码（用于在 pandas 中进行转换）中，数据框中的对象列的值被显式转换为字符串对象（以防止酸洗）：

# Open the csv file as pandas data frame
data = pd.read_csv(csvfilepath, sep=delimiter, low_memory=False)

# Write the resulting data frame to the hdf5 file
data.to_hdf(hdf5_file_path, table_name, format='table', complevel=9,
            complib='lzo')

这是检查的 hdf5 文件（使用 vitables）：

在我看来奇怪的是，这些值按数据类型（values_block0:int、values_block1:float 和 values_block2:string）表示为（python？）列表，而不是 csv 文件中每一列的 1 个特定列。我想知道这是否会导致文件过大以及对查询时间有何影响？

鉴于必须转换大约 1Tb，我想知道如何减少生成的 hdf5 文件的大小？

附：我知道这个question，但指出大的 hdf5 文件大小是由 HDF5 格式本身引起的，在这种情况下，考虑到绕过 pandas 产生的 hdf5 文件要小得多，这不是原因。

P.P.S. joris 建议使用 data.iloc 而不是 data.loc 没有任何区别。我已经删除了“转换”，它没有任何区别。 Jeff 要求的读取数据帧的信息：

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1303331 entries, 0 to 1303330
Columns: 23 entries, _PlanId to ACTIVITY_Gratis
dtypes: float64(1), int64(5), object(17)

【问题讨论】：

您的转换肯定有问题，因为您不应该在数据框中获得此类列表。首先，str_cols 是整数位置。在这种情况下，您应该使用data.iloc[: str_cols] = 而不是data.loc[..]。这已经解决了吗？在您阅读 csv 后立即显示 df.info()；你不需要做任何你正在做的“转换”。看到这个问题：***.com/questions/20428355/…，和文档：pandas.pydata.org/pandas-docs/stable/io.html 你试过用"fixed"格式写文件吗？也可以显示数据样本df.head() 【参考方案1】：

Here's 各种 IO 方法的时间/大小的非正式比较

在 64 位 linux 上使用 0.13.1

设置

In [3]: N = 1000000

In [4]: df = DataFrame(dict([ ("int0".format(i),np.random.randint(0,10,size=N)) for i in range(5) ]))

In [5]: df['float'] = np.random.randn(N)

In [6]: from random import randrange

In [8]: for i in range(10):
   ...:     df["object_1_0".format(i)] = ['%08x'%randrange(16**8) for _ in range(N)]
   ...:     

In [9]: for i in range(7):
   ...:     df["object_2_0".format(i)] = ['%15x'%randrange(16**15) for _ in range(N)]
   ...:     

 In [11]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 23 columns):
int0          1000000 non-null int64
int1          1000000 non-null int64
int2          1000000 non-null int64
int3          1000000 non-null int64
int4          1000000 non-null int64
float         1000000 non-null float64
object_1_0    1000000 non-null object
object_1_1    1000000 non-null object
object_1_2    1000000 non-null object
object_1_3    1000000 non-null object
object_1_4    1000000 non-null object
object_1_5    1000000 non-null object
object_1_6    1000000 non-null object
object_1_7    1000000 non-null object
object_1_8    1000000 non-null object
object_1_9    1000000 non-null object
object_2_0    1000000 non-null object
object_2_1    1000000 non-null object
object_2_2    1000000 non-null object
object_2_3    1000000 non-null object
object_2_4    1000000 non-null object
object_2_5    1000000 non-null object
object_2_6    1000000 non-null object
dtypes: float64(1), int64(5), object(17)

types: float64(1), int64(5), object(17)

用各种方法保存

In [12]: df.to_hdf('test_fixed.h5','data',format='fixed')

In [13]: df.to_hdf('test_table_no_dc.h5','data',format='table')

In [14]: df.to_hdf('test_table_dc.h5','data',format='table',data_columns=True)

In [15]: df.to_hdf('test_fixed_compressed.h5','data',format='fixed',complib='blosc',complevel=9)
!ls -ltr *.h5

In [16]: !ls -ltr *.h5
-rw-rw-r-- 1 jreback users 361093304 Apr 28 09:20 test_fixed.h5
-rw-rw-r-- 1 jreback users 311475690 Apr 28 09:21 test_table_no_dc.h5
-rw-rw-r-- 1 jreback users 351316525 Apr 28 09:22 test_table_dc.h5
-rw-rw-r-- 1 jreback users 317467870 Apr 28  2014 test_fixed_compressed.h5

磁盘上的大小将是为每一列选择的字符串大小的函数；如果您使用 NO data_columns 那么它是任何字符串的最长大小。因此，使用 data_columns 写入可能会影响此处的大小（通过您拥有更多列的事实来平衡，因此每列占用更多空间）。您可能想指定min_item_size 来控制查看here

这是磁盘结构的示例：

In [8]: DataFrame(dict(A = ['foo','bar','bah'], B = [1,2,3], C = [1.0,2.0,3.0], D=[4.0,5.0,6.0])).to_hdf('test.h5','data',mode='w',format='table')

In [9]: !ptdump -avd test.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/data (Group) ''
  /data._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := [],
    encoding := None,
    index_cols := [(0, 'index')],
    info := 1: 'type': 'Index', 'names': [None], 'index': ,
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := [(1, ['A', 'B', 'C', 'D'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['values_block_0', 'values_block_1', 'values_block_2']]
/data/table (Table(3,)) ''
  description := 
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
  "values_block_2": StringCol(itemsize=3, shape=(1,), dflt='', pos=3)
  byteorder := 'little'
  chunkshape := (1872,)
  autoindex := True
  colindexes := 
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False
  /data/table._v_attrs (AttributeSet), 19 attributes:
   [CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'values_block_0',
    FIELD_2_FILL := 0,
    FIELD_2_NAME := 'values_block_1',
    FIELD_3_FILL := '',
    FIELD_3_NAME := 'values_block_2',
    NROWS := 3,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer',
    values_block_0_dtype := 'float64',
    values_block_0_kind := ['C', 'D'],
    values_block_1_dtype := 'int64',
    values_block_1_kind := ['B'],
    values_block_2_dtype := 'string24',
    values_block_2_kind := ['A']]
  Data dump:
[0] (0, [1.0, 4.0], [1], ['foo'])
[1] (1, [2.0, 5.0], [2], ['bar'])
[2] (2, [3.0, 6.0], [3], ['bah'])

Dtypes 被分组到块中（如果你有 data_columns 那么它们是分开的）。这些就是这样打印的；它们像存储数组一样。

【讨论】：

以上是关于使用熊猫将 CSV 文件转换为 HDF5的主要内容，如果未能解决你的问题，请参考以下文章