如何使用 hdf5 文件中的一维数组并对它们执行减法、加法等操作？

Posted 2023-03-11

技术标签:

【中文标题】如何使用 hdf5 文件中的一维数组并对它们执行减法、加法等操作？【英文标题】：How to use 1D-arrays from hdf5 file and perform operations such as subtraction, addition etc. on them? 【发布时间】：2021-11-09 22:14:05 【问题描述】：

我有如下所示的一维数组：

array([(b'2P1', b'aP1', 2, 37.33,  4.4 , 3.82),
   (b'3P2', b'aP2', 3, 18.74, -9.67, 4.85),
   (b'4P2', b'aP2', 4, 55.16, 74.22, 4.88)],

如您所见，它们与字符串混合在一起。我无法以元素方式访问它们，例如，如果我想从第二行中减去第一行，只使用带有浮点数的列，我不能这样做！有没有办法做到这一点？这是 hdf5 文件data file 的链接。下面是读取hdf5文件的代码：

import numpy as np
import h5py

with h5py.File('xaa.h5', 'r') as hdff:
    base_items = list(hdff.items())
    print('Items in the base directory: ', base_items)
    dat1 = np.array(hdff['particles/lipids/positions/dataset_0001'])
    dat2 = np.array(hdff['particles/lipids/positions/dataset_0002'])
    print(dat1)

【问题讨论】：

dtype 的数组是什么。对于()，我怀疑它是一种复合数据类型，包含字符串、int 和float 字段。换句话说，它是一个结构化数组。您是否阅读过numpy 介绍材料中有关这些内容的任何内容？这些是 dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', '。是的，我刚读过。感谢您指出。 【参考方案1】：

上面的答案是一个好方法，但您不必使用recfunctions。知道数据集的 dtype 和形状后，您可以创建一个空数组并通过使用字段切片表示法读取感兴趣的数据来填充，如上面的答案所示。

这是执行此操作的代码。（因为我们知道您正在读取 3 个浮点数，并且浮点数是 np.empty() 的默认 dtype，所以我没有费心从数据集中获取字段 dtypes ——如果您需要对整数或字符串字段进行切片，这将很容易添加。 )

with h5py.File('xaa.h5', 'r') as hdf:
    grp = hdf['particles/lipids/positions']
    ds1 = grp['dataset_0000']
    nrows = ds1.shape[0]
    arr = np.empty((nrows,3))
    arr[:,0] = ds1['col4'][:]
    arr[:,1] = ds1['col5'][:]
    arr[:,2] = ds1['col6'][:]
    
    print(arr[0:10,:])

输出：

[[ 80.48  35.36   4.25]
 [ 37.45   3.92   3.96]
 [ 18.53  -9.69   4.68]
 [ 55.39  74.34   4.6 ]
 [ 22.11  68.71   3.85]
 [ -4.13  24.04   3.73]
 [ 40.16   6.39   4.73]
 [ -5.4   35.73   4.85]
 [ 36.67  22.45   4.08]
 [ -3.68 -10.66   4.18]]

【讨论】：

【参考方案2】：

In [188]: f = h5py.File('../Downloads/xaa.h5')
In [189]: f
Out[189]: <HDF5 file "xaa.h5" (mode r)>
...
In [194]: f['particles/lipids/positions'].keys()
Out[194]: <KeysViewHDF5 ['dataset_0000', 'dataset_0001', 'dataset_0002', 'dataset_0003', 'dataset_0004', 'dataset_0005', 'dataset_0006', 'dataset_0007', 'dataset_0008', 'dataset_0009']>
...
In [196]: f['particles/lipids/positions/dataset_0000'].dtype
Out[196]: dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', '<i8'), ('col4', '<f8'), ('col5', '<f8'), ('col6', '<f8')])

我怀疑这是一个结构化数组。 https://numpy.org/doc/stable/user/basics.rec.html

In [202]: arr[0]
Out[202]: (b'1P1', b'aP1', 1, 80.48, 35.36, 4.25)
In [203]: arr['col1'][:10]
Out[203]: 
array([b'1P1', b'2P1', b'3P2', b'4P2', b'5P3', b'6P3', b'7P4', b'8P4',
       b'9P5', b'10P5'], dtype='|S7')

我们可以通过以下方式查看浮动列：

In [204]: arr[['col4','col5','col6']][:10]
Out[204]: 
array([(80.48,  35.36, 4.25), (37.45,   3.92, 3.96),
       (18.53,  -9.69, 4.68), (55.39,  74.34, 4.6 ),
       (22.11,  68.71, 3.85), (-4.13,  24.04, 3.73),
       (40.16,   6.39, 4.73), (-5.4 ,  35.73, 4.85),
       (36.67,  22.45, 4.08), (-3.68, -10.66, 4.18)],
      dtype='names':['col4','col5','col6'], 'formats':['<f8','<f8','<f8'], 'offsets':[23,31,39], 'itemsize':47)

但要将这些字段视为二维数组，我们需要使用 recfunctions 实用程序：

In [198]: import numpy.lib.recfunctions as rf

In [205]: rf.structured_to_unstructured( arr[['col4','col5','col6']][:10])
Out[205]: 
array([[ 80.48,  35.36,   4.25],
       [ 37.45,   3.92,   3.96],
       [ 18.53,  -9.69,   4.68],
       [ 55.39,  74.34,   4.6 ],
       [ 22.11,  68.71,   3.85],
       [ -4.13,  24.04,   3.73],
       [ 40.16,   6.39,   4.73],
       [ -5.4 ,  35.73,   4.85],
       [ 36.67,  22.45,   4.08],
       [ -3.68, -10.66,   4.18]])

【讨论】：

以上是关于如何使用 hdf5 文件中的一维数组并对它们执行减法、加法等操作？的主要内容，如果未能解决你的问题，请参考以下文章