在 Tensorflow 中使用来自大型 numpy 数组的数据集

Posted 2023-02-16

技术标签:

【中文标题】在 Tensorflow 中使用来自大型 numpy 数组的数据集【英文标题】：Using Datasets from large numpy arrays in Tensorflow 【发布时间】：2021-12-27 19:35:47 【问题描述】：

我正在尝试加载一个数据集，该数据集存储在我的驱动器上的两个 .npy 文件（用于特征和基本事实）中，并使用它来训练神经网络。

print("loading features...")
data = np.load("[...]/features.npy")

print("loading labels...")
labels = np.load("[...]/groundtruth.npy") / 255

dataset = tf.data.Dataset.from_tensor_slices((data, labels))

调用from_tensor_slices() 方法时抛出tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized. 错误。

ground truth 的文件大于 2.44GB，因此我在使用它创建数据集时遇到问题（请参阅警告 here 和 here）。

我发现的可能解决方案是 TensorFlow 1.x（here 和 here，而我正在运行 2.6 版）或使用 numpy 的 memmap (here)，很遗憾我没有得到运行，另外我想知道这是否会减慢计算速度？

感谢您的帮助，谢谢！

【问题讨论】：

我最终将我的数据集分成两部分并以这种方式阅读，但您的建议帮助我理解了潜在问题并跳出框框思考。我会把它标记为答案，再次感谢你:) 【参考方案1】：

您需要某种数据生成器，因为您的数据太大而无法直接放入tf.data.Dataset.from_tensor_slices。我没有您的数据集，但这里有一个示例，说明如何获取数据批次并在自定义训练循环中训练您的模型。数据是来自here 的 NPZ NumPy 存档：

import numpy as np

def load_data(file='dsprites_ndarray_co1sh3sc6or40x32y32_64x64.npz'):
    dataset_zip = np.load(file, encoding='latin1')

    images = dataset_zip['imgs']
    latents_classes = dataset_zip['latents_classes']

    return images, latents_classes

def get_batch(indices, train_images, train_categories):
    shapes_as_categories = np.array([train_categories[i][1] for i in indices])
    images = np.array([train_images[i] for i in indices])

    return [images.reshape((images.shape[0], 64, 64, 1)).astype('float32'), shapes_as_categories.reshape(
        shapes_as_categories.shape[0], 1).astype('float32')]

# Load your data once
train_images, train_categories = load_data()
indices = list(range(train_images.shape[0]))
random.shuffle(indices)

epochs = 2000
batch_size = 256
total_batch = train_images.shape[0] // batch_size

for epoch in range(epochs):
    for i in range(total_batch):
        batch_indices = indices[batch_size * i: batch_size * (i + 1)]
        batch = get_batch(batch_indices, train_images, train_categories)
        ...
        ...
        # Train your model with this batch.

【讨论】：

感谢您的快速回答，它现在实际上正在训练...我的 RAM 几乎已满，但是 (32GB) 减慢了训练速度，即使特征和标签组合远小于 3GB（磁盘空间），你能想到原因吗？你的批量有多大？我目前正在使用 64 的批大小进行训练，其中每个特征向量是一个包含 96 个条目的一维布尔数组，每个标签向量是一个 640 uint8 的一维数组。您可能不得不降低批量大小，但很难说到底是什么原因。我只是想为您指明正确的方向

以上是关于在 Tensorflow 中使用来自大型 numpy 数组的数据集的主要内容，如果未能解决你的问题，请参考以下文章