训练 Keras 模型会产生多个优化器错误

Posted

技术标签:

【中文标题】训练 Keras 模型会产生多个优化器错误【英文标题】:Training a Keras model yields multiple optimizer errors 【发布时间】:2019-12-24 18:37:15 【问题描述】:

所以我需要使用我自己的数据集重新训练 Tiny YOLO。我使用的模型可以在这里找到:keras-yolo3 。

我开始训练时遇到多个优化器错误,添加了错误代码以避免混淆。 而且我注意到即使它应该使用 GPU,训练也很慢,经过一番挖掘,我发现这不是使用 GPU 进行训练。 我应该注意到,在我用于学习训练的另一个较小的网络上使用 GPU,因此从那一侧开始正确设置所有内容,并且在我进行该训练时它们不是这种类型的错误。

由于上述错误,这是否是缓慢且有点 CPU 训练?有人知道我该如何解决这个问题吗?

Using TensorFlow backend.
WARNING: Logging before flag parsing goes to stderr.
2019-08-19 09:45:08.057713: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library nvcuda.dll
2019-08-19 09:45:08.264577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.8475
pciBusID: 0000:01:00.0
2019-08-19 09:45:08.270723: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-08-19 09:45:08.275827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-08-19 09:45:09.214197: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-19 09:45:09.217605: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2019-08-19 09:45:09.219777: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2019-08-19 09:45:09.222399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4712 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
Create Tiny YOLOv3 model with 6 anchors and 80 classes.
Load weights model_data/tiny_yolo_weights.h5.
Freeze the first 42 layers of total 44 layers.
Train on 8298 samples, val on 922 samples, with batch size 32.
Epoch 1/50
2019-08-19 09:45:19.742610: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] shape_optimizer failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2019-08-19 09:45:19.781035: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2019-08-19 09:45:19.935930: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] layout failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2019-08-19 09:45:20.168936: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] shape_optimizer failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2019-08-19 09:45:20.205304: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
258/259 [============================>.] - ETA: 3s - loss: 41.82962019-08-19 10:01:51.053474: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2019-08-19 10:01:51.138957: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] layout failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2019-08-19 10:01:51.243888: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
259/259 [==============================] - 1078s 4s/step - loss: 41.8008 - val_loss: 35.7122

【问题讨论】:

你用的是哪个版本的tensorflow? 训练时,'nvidia-smi' 命令显示什么? @ravikt 我使用的是 tensorflow 的 1.14.0 版本(当时稳定的版本)。 @AshwinGeetD'Sa 不幸的是,我正在接受培训的 PC 出现问题,我目前无法开始培训并按照您的要求使用上述命令。 什么告诉你没有使用 GPU 进行训练?查看日志时,似乎实际上使用了 GPU。这里有人似乎和你有同样的问题,并找到了一个 hacky 解决方案:github.com/qqwweee/keras-yolo3/issues/… 【参考方案1】:

我在这里找到了解决方案:https://github.com/tensorflow/tensorrt/issues/118

你必须在 yolo3/model.py 中更改行(140/141):

box_xy = (K.sigmoid(feats[..., :2]) + grid) / K.cast(grid_shape[::-1], K.dtype(feats))
box_wh = K.exp(feats[..., 2:4]) * anchors_tensor / K.cast(input_shape[::-1], K.dtype(feats))

到:

box_xy = (K.sigmoid(feats[..., :2]) + grid) / K.cast(grid_shape[...,::-1], K.dtype(feats))
box_wh = K.exp(feats[..., 2:4]) * anchors_tensor / K.cast(input_shape[...,::-1], K.dtype(feats))

在我的情况下,也有助于将 batch size8 减少到 4

【讨论】:

对于像我这样努力弄清楚有什么区别的人来说,区别在于 K.cast(grid_shape[::-1] 已更改为 K.cast(grid_shape[. ..,::-1] 同样的 input_shape 已在第二行更改 @piotr-golinski 谢谢!您在哪里将批量大小从 8 更改为 4?

以上是关于训练 Keras 模型会产生多个优化器错误的主要内容,如果未能解决你的问题,请参考以下文章

为啥加载模型时需要加载优化器模型参数

如何使用自定义优化器加载 keras 保存的模型

在 Keras 训练期间动态更改损失函数,无需重新编译优化器等其他模型属性

keras API的使用,神经网络层,优化器

人工智能--Keras网络训练

保存及读取keras模型参数