用于不适合内存的大型 hdf5 文件的 Keras 自定义数据生成器

Posted 2023-03-11

技术标签:

【中文标题】用于不适合内存的大型 hdf5 文件的 Keras 自定义数据生成器【英文标题】：Keras custom data generator for large hdf5 file which does not fit into memory 【发布时间】：2018-04-14 01:21:08 【问题描述】：

我正在尝试使用预训练的 InceptionV3 模型对 food-101 dataset 进行分类，其中包含 101 个类别的食物图像，每个类别 1000 个。到目前为止，我已经将此数据集预处理为单个 hdf5 文件（我认为这与训练时在旅途中加载图像相比是有益的），其中包含以下表格：

数据拆分是标准的 70% 训练、20% 验证、10% 测试，因此例如 valid_img 的大小为 20200*299*299*3。标签是为 Keras 单独编码的，因此 valid_labels 的大小为 20200*101。

这个 hdf5 文件的大小为 27.1 GB，所以它不适合我的记忆。（有 8 GB，虽然在运行 Ubuntu 时实际上可能只有 4-5 GB 可用。另外，我的 GPU 是 GTX 960 和 2 GB 的 VRAM，到目前为止，当我尝试启动时，它看起来像 1.5 GB 可用于 python训练脚本）。我正在使用 TensorFlow 后端。

我的第一个想法是使用 model.train_on_batch() 和这样的双重嵌套 for 循环：

#Loading InceptionV3, adding my fully connected layers, compiling model...    

dataset = h5py.File('/home/uzoltan/PycharmProjects/food-101/food-101_299x299.hdf5', 'r')
    epoch = 50
    for i in range(epoch):
        for i in range(100): #1000 images can fit in the memory easily, this could probably be range(10) too
            train_images = dataset["train_img"][i * 706:(i + 1) * 706, ...]
            train_labels = dataset["train_labels"][i * 706:(i + 1) * 706, ...]
            val_images = dataset["valid_img"][i * 202:(i + 1) * 202, ...]
            val_labels = dataset["valid_labels"][i * 202:(i + 1) * 202, ...]
            model.train_on_batch(x=train_images, y=train_labels, class_weight=None,
                                 sample_weight=None, )

我对这种方法的问题是train_on_batch 为验证或批次改组提供了 0 支持，因此批次在每个时期的顺序不同。

所以我看向model.fit_generator()，它具有提供与fit() 相同功能的良好特性，加上内置的ImageDataGenerator，您可以在以下位置进行图像增强（旋转、水平翻转等）与CPU同时使用，使您的模型可以更加健壮。我的问题是，如果我理解正确，ImageDataGenerator.flow(x,y) 方法需要一次所有样本和标签，但我的训练/验证数据不适合我的 RAM。

这就是我认为自定义数据生成器出现的地方，但是在广泛查看我可以在 Keras GitHub/问题页面上找到的一些示例之后，我仍然不知道应该如何实现自定义生成器，它将从我的 hdf5 文件中批量读取数据。有人可以为我提供一个很好的例子或指针吗？如何将自定义批处理生成器与图像增强相结合？或者，为train_on_batch() 实施某种手动验证和批量改组更容易吗？如果是这样，我也可以在那里使用一些指针。

【问题讨论】：

为什么不能简单地将所有文件提取到单独的目录并使用flow_from_directory函数？ 【参考方案1】：

对于仍在寻找答案的任何人，我围绕 ImageDataGeneator 的 apply_transform 方法做了以下“粗略的包装”。

from numpy.random import uniform, randint
from tensorflow.python.keras.preprocessing.image import ImageDataGenerator
import numpy as np

class CustomImagesGenerator:
    def __init__(self, x, zoom_range, shear_range, rescale, horizontal_flip, batch_size):
        self.x = x
        self.zoom_range = zoom_range
        self.shear_range = shear_range
        self.rescale = rescale
        self.horizontal_flip = horizontal_flip
        self.batch_size = batch_size
        self.__img_gen = ImageDataGenerator()
        self.__batch_index = 0

    def __len__(self):
        # steps_per_epoch, if unspecified, will use the len(generator) as a number of steps.
        # hence this
        return np.floor(self.x.shape[0]/self.batch_size)

    def next(self):
        return self.__next__()

    def __next__(self):
        start = self.__batch_index*self.batch_size
        stop = start + self.batch_size
        self.__batch_index += 1
        if stop > len(self.x):
            raise StopIteration
        transformed = np.array(self.x[start:stop])  # loads from hdf5
        for i in range(len(transformed)):
            zoom = uniform(self.zoom_range[0], self.zoom_range[1])
            transformations = 
                'zx': zoom,
                'zy': zoom,
                'shear': uniform(-self.shear_range, self.shear_range),
                'flip_horizontal': self.horizontal_flip and bool(randint(0,2))
            
            transformed[i] = self.__img_gen.apply_transform(transformed[i], transformations)
        return transformed * self.rescale

可以这样调用：

import h5py
f = h5py.File("my_heavy_dataset_file.hdf5", 'r')
images = f['mydatasets/images']

my_gen = CustomImagesGenerator(
    images, 
    zoom_range=[0.8, 1], 
    shear_range=6, 
    rescale=1./255, 
    horizontal_flip=True, 
    batch_size=64
)

model.fit_generator(my_gen)

【讨论】：

我真的不能再尝试了，接受它，这对我来说是 1.5 年前的 ^^ 但是谢谢，希望它会对某人有所帮助:) @Sam 你的代码看起来很有前途！我试过了，但它给了我错误“'CustomImagesGenerator'对象没有属性'shape'”。您可以在这里查看我是如何实现您的代码的：colab.research.google.com/drive/… @NeStack 我会说你必须为 shape 添加一个 @property 方法 like so 如果它仍然不起作用，请告诉我 @Sam 当我执行from tensorflow.python.keras.utils.data_utils import Sequence 然后将您的代码更改为class CustomImagesGenerator(Sequence): 时，错误消息消失了。这是合法的吗？之后代码继续运行，错误消息“...File”/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/utils/data_utils.py”，第 742 行，按 _run 顺序= list(range(len(self.sequence))) TypeError: 'numpy.float64' 对象不能被解释为整数”。知道有什么问题吗？ @Sam 你能帮我解决上面的错误信息吗？【参考方案2】：

如果我理解正确，您想使用 HDF5 中的数据（不适合内存），同时对其进行数据增强。

我的情况和你一样，我发现这段代码可能会对一些修改有所帮助：

https://gist.github.com/wassname/74f02bc9134897e3fe4e60784f5aaa15

【讨论】：

仅供以后的读者阅读，上面链接的代码会将整个 HDF5 数据集加载到内存中。 HDF5Matrix 类不会加载它，但 ImageDataGenerator 会在调用 .flow 时加载整个内容。【参考方案3】：

这是我使用 h5 文件对每个 epoch 的数据进行随机播放的解决方案。 indices 表示 train 或 val 索引列表。

def generator(h5path, indices, batchSize=128, is_train=True, aug=None):

    db = h5py.File(h5path, "r")
    with open("mean.json") as f:
        mean = json.load(f)
    meanV = np.array([mean["R"], mean["G"], mean["B"]])

    while True:

        np.random.shuffle(indices)
        for i in range(0, len(indices), batchSize):
            t0 = time()
            batch_indices = indices[i:i+batchSize]
            batch_indices.sort()

            by = db["labels"][batch_indices,:]
            bx = db["images"][batch_indices,:,:,:]

            bx[:,:,:,0] -= meanV[0]
            bx[:,:,:,1] -= meanV[1]
            bx[:,:,:,2] -= meanV[2]
            t1=time()

            if is_train:

                #bx = random_crop(bx, (224,224))
                if aug is not None:
                    bx,by = next(aug.flow(bx,by,batchSize))

            yield (bx,by)


h5path='all_224.hdf5'   
model.fit_generator(generator(h5path, train_indices, batchSize=batchSize, is_train=True, aug=aug),
                steps_per_epoch = 20000//batchSize,
                validation_data= generator(h5path, test_indices, is_train=False, batchSize=batchSize), 
                validation_steps = 2424//batchSize,
                epochs=args.epoch, 
                max_queue_size=100,
                callbacks=[checkpoint, early_stop])

【讨论】：

【参考方案4】：

您想编写一个函数，从 HDF5 加载图像，然后将 yields（而不是 returns）作为 numpy 数组加载。这是一个简单的示例，它使用 OpenCV 直接从给定目录中的 .png/.jpg 文件加载图像：

def generate_data(directory, batch_size):
    """Replaces Keras' native ImageDataGenerator."""
    i = 0
    file_list = os.listdir(directory)
    while True:
        image_batch = []
        for b in range(batch_size):
            if i == len(file_list):
                i = 0
                random.shuffle(file_list)
            sample = file_list[i]
            i += 1
            image = cv2.resize(cv2.imread(sample[0]), INPUT_SHAPE)
            image_batch.append((image.astype(float) - 128) / 128)

        yield np.array(image_batch)

显然，您必须修改它以改为从 HDF5 读取。

编写函数后，用法很简单：

model.fit_generator(
generate_data('~/my_data', batch_size),
steps_per_epoch=len(os.listdir('~/my_data')) // batch_size)

再次修改以反映您正在从 HDF5 而非目录读取这一事实。

【讨论】：

keras中有一个为此指定的函数。是的，但是 OP 要求提供一个示例，说明如何为该功能未涵盖的用例编写自定义数据生成器。这回答了这个问题。你是对的，他们可能会更好地从 HDF5 中取出图像并使用flow_from_directory。不——他一次也没有提到flow_from_directory。他提到从h5 加载图像，然后使用flow。 @Jeff Alan 以及有关如何在自定义生成器中包含图像增强功能的任何指示？ @MarcinMożejko 如果一切都失败了，我可能会尝试使用 flow_from_directory 函数，这不是我的选择，原因有两个：我认为直接读取数组更快并且 food-101 数据源仅包含类别的子目录。所以我必须编写额外的代码来将每个类别的 1000 张图像分成 3 种方式。

以上是关于用于不适合内存的大型 hdf5 文件的 Keras 自定义数据生成器的主要内容，如果未能解决你的问题，请参考以下文章