如何在异步上下文中读取（hdf5）文件？

Posted 2023-03-08

技术标签:

【中文标题】如何在异步上下文中读取（hdf5）文件？【英文标题】：How to read from (hdf5) file in async contexts? 【发布时间】：2019-05-02 12:27:24 【问题描述】：

最近我一直在玩一些 Python 3 的异步功能。总的来说，我对 3.6 语法很满意，当然还有你获得的性能提升。在我看来，围绕ASGI 标准发展的令人兴奋的项目之一是starlette。我有一个示例应用程序正在运行，我正在从hdf5 文件中读取数据。 h5py 还不支持异步 I/O。这给我留下了一个问题：我在这里所做的一切有意义吗？据我了解，这段代码毕竟是同步运行的。在异步上下文中执行 I/O 的推荐方法是什么？

async def _flow(indexes):
    print('received flow indexes %s ' %indexes)
    # uses h5py under the hood
    gr = GridH5ResultAdmin(gridadmin_f, results_f)
    t = gr.nodes.timeseries(indexes=indexes)
    data = t.only('s1').data
    # data is a numpy array
    return data['s1'].tolist()

@app.route('/flow_velocity')
async def flow_results(request):

    indexes_list = [[2,3,4,5], [6,7,8,9], [10,11,12,13]]

    tasks = []
    loop = asyncio.get_event_loop()
    t0 = datetime.datetime.now()
    for indexes in indexes_list:
        print('Start getting indexes %s' % indexes)
        # Launch a coroutine for each data fetch
        task = loop.create_task(_flow(indexes))
        tasks.append(task)

    # Wait on, and then gather, all data
    flow_data = await asyncio.gather(*tasks)
    dt = (datetime.datetime.now() - t0).total_seconds()
    print('elapsed time:  [s]'.format(dt))

    return JSONResponse('flow_velocity': flow_data)

日志记录：

INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Start getting indexes "[2, 3, 4, 5]"
Start getting indexes "[6, 7, 8, 9]"
Start getting indexes "[10, 11, 12, 13]"
received flow indexes [2, 3, 4, 5] 
received flow indexes [6, 7, 8, 9] 
received flow indexes [10, 11, 12, 13]
elapsed time: 1.49779 [s]

【问题讨论】：

【参考方案1】：

不幸的是h5py 模块你不能使用asyncio，你在这里做的主要是顺序的，因为如果 I/O 部分不能异步完成，那么你的异步代码的其余部分没有太多意义了

https://github.com/h5py/h5py/issues/837

该线程的摘要

所以添加异步支持有两个不同的问题：

asyncio 目前明确不支持文件系统 I/O，参见例如https://github.com/python/asyncio/wiki/ThirdParty#filesystem、https://groups.google.com/forum/#!topic/python-tulip/MvpkQeetWZA、What is the status of POSIX asynchronous I/O (AIO)? 和 https://github.com/Tinche/aiofiles，这是最接近您想要的。所有 I/O 都通过 HDF5（库）完成，因此无论您想添加什么异步支持都需要 HDF5（库）中的支持

这基本上意味着 h5py 不太可能支持 asyncio。

您可以尝试在线程中运行东西，但不能保证它会正常工作，正如我所提到的，HDF5 控制 I/O，并且您需要确保不会遇到它的任何锁定控制。您可能想了解http://docs.h5py.org/en/latest/high/file.html#file-drivers 中提到的哪种文件模式最适合您。也许您可以考虑其他替代方案，例如 multiprocessing 或 concurrent.futures？

【讨论】：

你可以将 h5py 一个 python 类文件对象传递给 h5py，然后在类文件对象级别实现 asyncio（实现读、写、截断等），我有一个例子的工作（付出了很多努力），但我想我可能遇到了你在这里提到的 h5 锁定机制，因为事情似乎几乎是按顺序运行的，尽管在类文件对象上使用原始 .read() 调用的相同代码运行得非常好快速 - 从本地集群 S3 接口使用 asyncio（带 20 个事件循环实例）进行 1.5 GB/秒的随机搜索。 HDF5 领域在异步 I/O 方面取得了一些进展：hdf5-vol-async.readthedocs.io/en/latest，但我知道没有 python 绑定。

以上是关于如何在异步上下文中读取（hdf5）文件？的主要内容，如果未能解决你的问题，请参考以下文章