如何读取值为可变长度 char* 数组的 HDF5 标量属性(即 c_strings?)

Posted

技术标签:

【中文标题】如何读取值为可变长度 char* 数组的 HDF5 标量属性(即 c_strings?)【英文标题】:How to read a HDF5 scalar attribute whose value is an array of variable length char* (i.e. c_strings?) 【发布时间】:2017-04-05 12:30:35 【问题描述】:

我已经成功创建了一个标量值属性,它的值是一个可变长度的 const char* 数组。但是我不明白如何阅读此属性!

这是我用来创建属性的代码:

    void create_attribute_with_vector_of_strings_as_value()

    using namespace H5;

    // Create some test strings.
    std::vector<std::string> strings;
    for (int iii = 0; iii < 10; iii++)
    
        strings.push_back("this is " + boost::lexical_cast<std::string>(iii));
    

    // Part 1: grab pointers to the chars
    std::vector<const char*> chars;
    for (auto si = strings.begin(); si != strings.end(); ++si)
    
        std::string &s = (*si);
        chars.push_back(s.c_str());
    
    BOOST_TEST_MESSAGE("Size of char* array is:  " << chars.size());

    // Part 2: create the variable length type
    hvl_t hdf_buffer;
    hdf_buffer.p = chars.data();
    hdf_buffer.len = chars.size();

    // Part 3: create the type
    auto s_type = H5::StrType(H5::PredType::C_S1, H5T_VARIABLE);
    auto svec_type = H5::VarLenType(&s_type);

    try
    
        // Open an existing file and dataset.
        H5File file(m_file_name.c_str(), H5F_ACC_RDWR);

        // Part 4: write the output to a scalar attribute
        DataSet dataset = file.openDataSet(m_dataset_name.c_str());

        std::string filter_names = "multi_filters";

        Attribute attribute = dataset.createAttribute( filter_names.c_str(), svec_type, H5S_SCALAR);
        attribute.write(svec_type, &hdf_buffer);
        file.close();
    

这是从 h5dump 看到的具有属性的数据集:

    HDF5 "d:\tmp\hdf5_tutorial\h5tutr_dset.h5" 
GROUP "/" 
DATASET "dset" 
DATATYPE  H5T_STD_I32BE
DATASPACE  SIMPLE  ( 4, 6 ) / ( 4, 6 ) 
DATA 
(0,0): 1, 7, 13, 19, 25, 31,
(1,0): 2, 8, 14, 20, 26, 32,
(2,0): 3, 9, 15, 21, 27, 33,
(3,0): 4, 10, 16, 22, 28, 34

ATTRIBUTE "multi_filters" 
DATATYPE  H5T_VLEN  H5T_STRING 
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;

DATASPACE  SCALAR
DATA 
(0): ("this is 0", "this is 1", "this is 2", "this is 3", "this is 4", "this is 5", "this is 6", "this is 7", "this is 8", "this is 9")





我不明白如何读取这些数据。到目前为止我已经尝试过的代码如下。它可以编译,但是我已经将数组大小硬连线到已知长度,并且可变长度的 cstrings 是空的?有人对我哪里出错有任何建议吗?具体来说,如何查询 const char* 数组的长度,如何读取数组中实际包含的 const char* cstrings?

    void read_attribute_with_vector_of_strings_as_value()

    using namespace H5;

    std::vector<std::string> strings;

    try
    
        // Open an existing file and dataset readonly
        H5File file(m_file_name.c_str(), H5F_ACC_RDONLY);

        // Part 4: Open the dataset
        DataSet dataset = file.openDataSet(m_dataset_name.c_str());

        // Atribute_name
        std::string filter_names = "multi_filters";

        Attribute attribute = dataset.openAttribute(filter_names.c_str());
        size_t sz = attribute.getInMemDataSize();
        size_t sz_1 = attribute.getStorageSize();
        auto t1 = attribute.getDataType();
        VarLenType t2 = attribute.getVarLenType();
        H5T_class_t type_class = attribute.getTypeClass();
        if (type_class == H5T_STRING)
            BOOST_TEST_MESSAGE("H5T_STRING");

        int length = 10;
        std::vector<char*> tmp_vec(length);
        auto s_type = H5::StrType(H5::PredType::C_S1, H5T_VARIABLE);
        auto svec_type = H5::VarLenType(&s_type);

        hvl_t hdf_buffer;
        hdf_buffer.p = tmp_vec.data();
        hdf_buffer.len = length;
        attribute.read(svec_type, &hdf_buffer);
        //attribute.read(s_type, &hdf_buffer);
        //attribute.read(tmp_vec.data(), s_type);

        for(size_t x = 0; x < tmp_vec.size(); ++x)
        
            fprintf(stdout, "GOT STRING [%s]\n", tmp_vec[x] );
            strings[x] = tmp_vec[x];
        

        file.close();
    

【问题讨论】:

【参考方案1】:

如果您不需要使用特定技术来实现您的想法,您可以考虑 HDFql (http://www.hdfql.com),它是一种高级语言,可以轻松管理 HDF 文件(想想 SQL)。这样,您就可以从操作所描述的 HDF 文件的所有低级细节中解脱出来。在 C++ 中使用 HDFql,读取一个可变长度字符数组是这样完成的:

// include HDFql C++ header file (make sure it can be found by the C++ compiler)
#include <iostream>
#include "HDFql.hpp"

int main(int argc, char *argv[])


    // create an HDF file named "example.h5" and use (i.e. open) it
    HDFql::execute("CREATE AND USE FILE example.h5");

    // create an attribute named "multi_filters" of type varchar of one dimension (size 5)
    HDFql::execute("CREATE ATTRIBUTE multi_filters AS VARCHAR(5)");

    // insert (i.e. write) values "Red", "Green", "Blue", "Orange" and "Yellow" into attribute "multi_filters"
    HDFql::execute("INSERT INTO multi_filters VALUES(Red, Green, Blue, Orange, Yellow)");

    // select (i.e. read) attribute "multi_filters" into HDFql default cursor
    HDFql::execute("SELECT FROM multi_filters");

    // display content of HDFql default cursor
    while(HDFql::cursorNext() == HDFql::Success)
    
        std::cout << "Color " << HDFql::cursorGetChar() << " has a size of " << HDFql::cursorGetSize() << std::endl;
    

    return 0;


【讨论】:

以上是关于如何读取值为可变长度 char* 数组的 HDF5 标量属性(即 c_strings?)的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 C++ API 在 HDF5 文件中写入/读取锯齿状数组?

如何使用可变长度类型将包含多个 std::vector<float> 的结构写出到 HDF5?

在 hdf5 中存储可变长度字符串列表的标准方法是啥?

使用 Java Native Library 在 HDF5 中编写交错数组

通过 h5py (HDF5) 写入具有可变长度字符串的复合数据集

TensorFlow - tf.data.Dataset读取大型HDF5文件