恢复培训 tf.keras Tensorboard

Posted 2023-03-27

技术标签:

【中文标题】恢复培训 tf.keras Tensorboard【英文标题】：Resume Training tf.keras Tensorboard 【发布时间】：2021-08-14 18:36:51 【问题描述】：

我在继续训练我的模型并在 tensorboard 上可视化进度时遇到了一些问题。

我的问题是如何在不手动指定任何时期的情况下从同一步骤恢复训练？如果可能的话，只需加载保存的模型，它就可以从保存的优化器中读取global_step 并从那里继续训练。

我在下面提供了一些代码来重现类似的错误。

import tensorflow as tf
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.models import load_model

mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=10, callbacks=[Tensorboard()])
model.save('./final_model.h5', include_optimizer=True)

del model

model = load_model('./final_model.h5')
model.fit(x_train, y_train, epochs=10, callbacks=[Tensorboard()])

您可以使用以下命令运行tensorboard：

tensorboard --logdir ./logs

【问题讨论】：

即使你加载了模型，TensorFlow 也会从起点开始处理指标。此外，这是因为 epoch 重新从 0 开始，而不是从 8 epoch 结束的地方开始。 【参考方案1】：

您可以将函数model.fit() 中的参数initial_epoch 设置为您希望训练开始的纪元数。考虑到模型训练直到达到索引epochs 的时期（而不是epochs 给出的迭代次数）。在您的示例中，如果您想再训练 10 个 epoch，则应该是：

model.fit(x_train, y_train, initial_epoch=9, epochs=19, callbacks=[Tensorboard()])

它将允许您以正确的方式在 Tensorboard 上可视化您的图。有关这些参数的更多详细信息，请参阅docs。

【讨论】：

我知道这种方法，但我更喜欢不需要我指定时代的方法。我认为这不应该是唯一的方法。我自己没有看到指定纪元的问题，因为没有指定它为零，只是因为它默认设置为零。从检查点加载模型时，训练确实从保存的点开始，但不幸的是，无论如何您都必须指定初始时期参数。【参考方案2】：

这是示例代码，以防有人需要。它实现了 Abhinav Anand 提出的想法：

mca = ModelCheckpoint(join(dir, 'model_epoch:03d.h5'),
                      monitor = 'loss',
                      save_best_only = False)
tb = TensorBoard(log_dir = join(dir, 'logs'),
                 write_graph = True,
                 write_images = True)
files = sorted(glob(join(fold_dir, 'model_???.h5')))
if files:
    model_file = files[-1]
    initial_epoch = int(model_file[-6:-3])
    print('Resuming using saved model %s.' % model_file)
    model = load_model(model_file)
else:
    model = nn.model()
    initial_epoch = 0
model.fit(x_train,
          y_train,
          epochs = 100,
          initial_epoch = initial_epoch,
          callbacks = [mca, tb])

将nn.model() 替换为您自己定义模型的函数。

【讨论】：

epochs = training_epochs + initial_epoch 在 fit 调用中确保模型针对指定的时期进行训练，而不是仅针对 training_epochs - initial_epoch 次进行训练。【参考方案3】：

这很简单。在训练模型时创建检查点，然后使用这些检查点从您离开的位置恢复训练。

import tensorflow as tf
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import load_model

mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(512, activation=tf.nn.relu),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=10, callbacks=[Tensorboard()])
model.save('./final_model.h5', include_optimizer=True)

model = load_model('./final_model.h5')

callbacks = list()

tensorboard = Tensorboard()
callbacks.append(tensorboard)

file_path = "model-epoch:02d-loss:.4f.hdf5"

# now here you can create checkpoints and save according to your need
# here period is the no of epochs after which to save the model every time during training
# another option is save_weights_only, for your case it should be false
checkpoints = ModelCheckpoint(file_path, monitor='loss', verbose=1, period=1, save_weights_only=False)
callbacks.append(checkpoints)

model.fit(x_train, y_train, epochs=10, callbacks=callbacks)

在此之后，只需从您想再次恢复训练的地方加载检查点

model = load_model(checkpoint_of_choice)
model.fit(x_train, y_train, epochs=10, callbacks=callbacks)

你就完成了。

如果您对此有更多疑问，请告诉我。

【讨论】：

是的，确实如此。它从保存检查点的确切位置开始，而不是从头开始。我建议您尝试一下，如果您遇到任何问题，请告诉我。您可以在恢复训练后通过查看损失来验证。 @HardianLawi 在这里你也不需要手动指定时期，而是你想从那里运行的时期数。您的代码对我不起作用。我不确定您为什么建议使用 ModelCheckpoint 回调。如果你查看检查点的源代码，它基本上是从模型中调用save 方法，这与我在问题中提供的代码相同我不确定它如何为您工作。您是否重新实例化了回调？因为你不能在训练过程中调用model.save。你的问题清楚地表明你想从你离开的地方开始，而 ModeCheckpoint 确实提供了更多的灵活性。是的，你是对的，它确实调用了 model.save，因为这是保存模型的唯一方法。

以上是关于恢复培训 tf.keras Tensorboard的主要内容，如果未能解决你的问题，请参考以下文章