Numpy 从谷歌云存储加载内存映射数组（mmap_mode）

Posted 2023-02-23

技术标签:

【中文标题】Numpy 从谷歌云存储加载内存映射数组（mmap_mode）【英文标题】：Numpy load a memory-mapped array (mmap_mode) from google cloud storage 【发布时间】：2020-04-19 09:27:51 【问题描述】：

我想将一个 .npy 从 google 存储 (gs://project/file.npy) 加载到我的 google ml-job 作为训练数据。由于文件是 +10GB 大，我想使用 numpy.load() 的 mmap_mode 选项来避免内存不足。

背景：我将 Keras 与 fit_generator 和 Keras Sequence 一起使用，从存储在 google 存储中的 .npy 加载批量数据。

要访问谷歌存储，我使用的是 BytesIO，因为不是每个库都可以访问谷歌存储。此代码在没有 mmap_mode = 'r' 的情况下工作正常：

from tensorflow.python.lib.io import file_io
from io import BytesIO

filename = 'gs://project/file'

x_file = BytesIO(file_io.read_file_to_string(filename + '.npy', binary_mode = True))
x = np.load(x_file)

如果我激活 mmap_mode，我会收到以下错误：

TypeError: 预期 str、bytes 或 os.PathLike 对象，而不是 BytesIO

我不明白为什么它现在不再接受 BytesIO。

包含 mmap_mode 的代码：

x_file = BytesIO(file_io.read_file_to_string(filename + '.npy', binary_mode = True))
x = np.load(x_file, mmap_mode = 'r')

追踪：

加载返回中的文件“[...]/numpy/lib/npyio.py”，第 444 行 format.open_memmap(file, mode=mmap_mode) 文件 “[...]/numpy/lib/format.py”，第 829 行，在 open_memmap fp = 打开（os_fspath（文件名），'rb'）文件“[...]/numpy/compat/py3k.py”，第 237 行，在 os_fspath "not " + path_type.name) 类型错误：预期的 str、bytes 或 os.PathLike 对象，而不是 BytesIO

【问题讨论】：

查看np.lib.npyio.format.open_memmap的文档（或代码）。它说The name of the file on disk. This may *not* be a file-like object。处理完save/load标头后，这段代码使用np.memmap，所以仅限于可以处理的内容。在 memmap 模式下，文件被“随机”访问。在普通加载中，访问是顺序的——一个字节一个接一个地访问，没有任何类型的回溯或搜索。是否可以拆分您的数据集，以便您可以上传并逐步处理它？另外，您能否检查一下thread，看看那里的想法是否符合您的需求？如果您可以进一步拆分文件，从我的角度来看，这将是最佳选择。如果您可以通过这种方式避免内存不足，问题就会立即得到解决。公平地说，我不知道如何将范围标头应用于您的用例，但认为共享信息是个好主意。 【参考方案1】：

您可以使用 b.getvalue() 从 BytesIO 传递到字节

x_file = BytesIO(file_io.read_file_to_string(filename + '.npy', binary_mode = True))
x = np.load(x_file.getvalue(), mmap_mode = 'r')

【讨论】：

以上是关于Numpy 从谷歌云存储加载内存映射数组（mmap_mode）的主要内容，如果未能解决你的问题，请参考以下文章