仅使用带有标签的 HDF5Matrix 的 Keras

Posted 2023-03-11

技术标签:

【中文标题】仅使用带有标签的 HDF5Matrix 的 Keras【英文标题】：Using Keras with HDF5Matrix with labels only 【发布时间】：2018-06-08 22:45:35 【问题描述】：

我相信这是我在 Stack Overflow 上的第一个问题，所以如果我没有遵循所有准则，我提前道歉。我最近开始使用 Keras 进行深度学习，由于我使用 h5py 处理 HDF5 文件来管理大型数据集，因此我寻找了一种在非常大的 HDF5 文件上使用 keras 训练模型的方法。我发现最常用的方法是使用keras.utils.io_utils 中的 HDF5Matrix。

我修改了一个 Keras 示例 (mnist.cnn)，如下所示：

'''Trains a simple convnet on the MNIST dataset.

Gets to 99.25% test accuracy after 12 epochs
(there is still a lot of margin for parameter tuning).
16 seconds per epoch on a GRID K520 GPU.
'''

from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K

# My Imports
from os.path import exists
import h5py
from keras.utils.io_utils import HDF5Matrix
batch_size = 128
num_classes = 10
epochs = 12

# input image dimensions
img_rows, img_cols = 28, 28

# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

#-----------------------------------HDF5 files creation---------------------------------------
sample_file_name = "x.hdf5"
solution_file_name = "y.hdf5"
train_name = "train"
test_name = "test"

#Create dataset
if (not exists(sample_file_name)) and (not exists(solution_file_name)):
    samples_file = h5py.File(sample_file_name,mode='a')
    solutions_file = h5py.File(solution_file_name,mode='a')
    samples_train = samples_file.create_dataset(train_name,data=x_train)
    samples_test = samples_file.create_dataset(test_name, data=x_test)
    solution_train = solutions_file.create_dataset(train_name, data=y_train)
    solution_test = solutions_file.create_dataset(test_name, data=y_test)
    samples_file.flush()
    samples_file.close()
    solutions_file.flush()
    solutions_file.close()

x_train = HDF5Matrix(sample_file_name,train_name)
x_test = HDF5Matrix(sample_file_name,test_name)
y_train = HDF5Matrix(solution_file_name,train_name)
y_test = HDF5Matrix(solution_file_name,test_name)
#---------------------------------------------------------------------------------------------

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

# If using HDF5Matrix one needs to disable shuffle
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test),
          shuffle=False)

score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

但是，有些事情让我很担心。在分割问题\多类问题中，类的数量非常多，以分类格式保存解决方案是非常浪费的。此外，这样做意味着一旦您添加了一个新类，整个数据集都应该相应地更改。这就是为什么我认为使用 HDF5Matrix 的归一化功能如下：

'''Trains a simple convnet on the MNIST dataset.

Gets to 99.25% test accuracy after 12 epochs
(there is still a lot of margin for parameter tuning).
16 seconds per epoch on a GRID K520 GPU.
'''

from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K

# My Imports
from os.path import exists
import h5py
from keras.utils.io_utils import HDF5Matrix
batch_size = 128
num_classes = 10
epochs = 12

# input image dimensions
img_rows, img_cols = 28, 28

# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

#-----------------------------------HDF5 files creation---------------------------------------
sample_file_name = "x.hdf5"
solution_file_name = "y.hdf5"
train_name = "train"
test_name = "test"

#Create dataset
if (not exists(sample_file_name)) and (not exists(solution_file_name)):
    samples_file = h5py.File(sample_file_name,mode='a')
    solutions_file = h5py.File(solution_file_name,mode='a')
    samples_train = samples_file.create_dataset(train_name,data=x_train)
    samples_test = samples_file.create_dataset(test_name, data=x_test)
    solution_train = solutions_file.create_dataset(train_name, data=y_train)
    solution_test = solutions_file.create_dataset(test_name, data=y_test)
    samples_file.flush()
    samples_file.close()
    solutions_file.flush()
    solutions_file.close()

x_train = HDF5Matrix(sample_file_name,train_name)
x_test = HDF5Matrix(sample_file_name,test_name)
y_train = HDF5Matrix(solution_file_name,train_name,normalizer=lambda solution: keras.utils.to_categorical(solution,num_classes))
y_test = HDF5Matrix(solution_file_name,test_name,normalizer=lambda solution: keras.utils.to_categorical(solution,num_classes))
#---------------------------------------------------------------------------------------------

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

# If using HDF5Matrix one needs to disable shuffle
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test),
          shuffle=False)

score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

但是，这会产生一个错误，暗示解的形状应该匹配，并且不应该那样使用归一化器：

ValueError: Error when checking target: expected dense_2 to have 2, but got array with shape (60000, 1, 10)

那么，有没有办法将数据保存在 HDF5 中（如果不可能，使用其他格式），并以保存标签（而不是分类向量）的方式使用 Keras，而不会将其转换为回归有问题吗？

【问题讨论】：

【参考方案1】：

由于these 行，您收到此错误。

Keras 在训练前检查输入形状。问题是，如果您调用 .shape，HDF5Matrix 将返回预规范化的形状，然后 Keras 会相信您有一个用于 y_train 的 (60000,) 数组和y_test 的 (10000,)。

但是，当访问矩阵的一个切片时，会应用归一化器，以便例如 y_train[5:7].shape 确实具有最终预期的形状：(2, 10)。

这主要是因为规范化器并不真正期望改变形状，但 Keras 确实可以处理这种情况。

您可以使用fit_generator 而不是fit 来修复它，这样训练只会看到标准化数据：

def generator(features, labels, size):
    while True:
        start, end = 0, size
        while end < len(features):
            s = slice(start, end)
            # you can actually do the normalization here if you want
            yield features[s], labels[s]
            start, end = end, end + size

model.fit_generator(
    generator(x_train, y_train, batch_size),
    steps_per_epoch=len(x_train) // batch_size,
    epochs=1,
    verbose=1, 
    validation_data=generator(x_test, y_test, batch_size),
    validation_steps=len(x_test) // batch_size,
    shuffle=False)

请注意，您可以在生成器函数中进行任何类型的规范化，这对 Keras 是透明的。您可以使用不同的批量大小进行训练和验证。

此外，您必须以相同的方式更改评估：

score = model.evaluate_generator(
    generator(x_test, y_test, batch_size),
    steps=len(x_test) // batch_size)

顺便说一句，我认为您使用规范化器的解决方案是个好主意。

【讨论】：

感谢您这么快回答我的问题。不幸的是，我对这个解决方案有一个小问题。我正在处理一个非常大的数据集，我计划迁移我的大部分代码来解决分段问题。 y 数据集将是图像，因此会占用大量内存。您建议的解决方案不会导致整个 y 数据集加载到内存中（可能会造成溢出）吗？ @DolevShapira 确实我担心可能是这种情况。我已经编辑了我的解决方案以使用生成器，这样您就不会遇到这个问题。非常感谢您的帮助！我认为我可能能够避开 fit_generator 因为它仅通过 HDF5Matrix 引入的复杂性。显然，这毕竟是不可避免的。祝你好运！如果您同意，请将解决方案标记为正确。

以上是关于仅使用带有标签的 HDF5Matrix 的 Keras的主要内容，如果未能解决你的问题，请参考以下文章