为啥在第 35 个 epoch 之后训练和验证的准确率会随着小幅度的下降而上升?

Posted

技术标签:

【中文标题】为啥在第 35 个 epoch 之后训练和验证的准确率会随着小幅度的下降而上升?【英文标题】:Why the training and validation accuracies moving up down with little gaps after 35th epoch?为什么在第 35 个 epoch 之后训练和验证的准确率会随着小幅度的下降而上升? 【发布时间】:2021-05-04 16:36:25 【问题描述】:

我已经使用 Mobilenet 作为基础模型实现了这个分类模型。在训练时,训练和验证准确度和损失在一些时期后上下移动(注意从第 34 个时期训练精度开始上下移动,然后其他精度也一样)。根据我的知识曲线看起来很好,但是经过一些时期后,值会上下波动。这是正常的还是我需要改变一些东西?

Epoch 1/50
    6539/6539 [==============================] - 3379s 516ms/step - loss: 2.9090 - accuracy: 0.3196 - top3_acc: 0.4849 - top5_acc: 0.5721 - val_loss: 1.7767 - val_accuracy: 0.5191 - val_top3_acc: 0.7397 - val_top5_acc: 0.8286
    Epoch 2/50
    6539/6539 [==============================] - 3342s 511ms/step - loss: 1.7218 - accuracy: 0.5261 - top3_acc: 0.7464 - top5_acc: 0.8385 - val_loss: 1.5645 - val_accuracy: 0.5651 - val_top3_acc: 0.7857 - val_top5_acc: 0.8669
    Epoch 3/50
    6539/6539 [==============================] - 3337s 510ms/step - loss: 1.5500 - accuracy: 0.5611 - top3_acc: 0.7853 - top5_acc: 0.8693 - val_loss: 1.4635 - val_accuracy: 0.5869 - val_top3_acc: 0.8064 - val_top5_acc: 0.8816
    Epoch 4/50
    6539/6539 [==============================] - 3343s 511ms/step - loss: 1.4469 - accuracy: 0.5859 - top3_acc: 0.8040 - top5_acc: 0.8854 - val_loss: 1.3982 - val_accuracy: 0.6012 - val_top3_acc: 0.8186 - val_top5_acc: 0.8919
    Epoch 5/50
    6539/6539 [==============================] - 3348s 512ms/step - loss: 1.3882 - accuracy: 0.5966 - top3_acc: 0.8153 - top5_acc: 0.8939 - val_loss: 1.3538 - val_accuracy: 0.6126 - val_top3_acc: 0.8260 - val_top5_acc: 0.8981
    Epoch 6/50
    6539/6539 [==============================] - 3340s 511ms/step - loss: 1.3382 - accuracy: 0.6123 - top3_acc: 0.8251 - top5_acc: 0.9011 - val_loss: 1.3192 - val_accuracy: 0.6192 - val_top3_acc: 0.8326 - val_top5_acc: 0.9033
    Epoch 7/50
    6539/6539 [==============================] - 3319s 508ms/step - loss: 1.3060 - accuracy: 0.6195 - top3_acc: 0.8323 - top5_acc: 0.9052 - val_loss: 1.2918 - val_accuracy: 0.6264 - val_top3_acc: 0.8359 - val_top5_acc: 0.9070
    Epoch 8/50
    6539/6539 [==============================] - 3314s 507ms/step - loss: 1.2744 - accuracy: 0.6249 - top3_acc: 0.8383 - top5_acc: 0.9106 - val_loss: 1.2693 - val_accuracy: 0.6312 - val_top3_acc: 0.8399 - val_top5_acc: 0.9106
    Epoch 9/50
    6539/6539 [==============================] - 3316s 507ms/step - loss: 1.2547 - accuracy: 0.6323 - top3_acc: 0.8419 - top5_acc: 0.9133 - val_loss: 1.2502 - val_accuracy: 0.6359 - val_top3_acc: 0.8430 - val_top5_acc: 0.9135
    Epoch 10/50
    6539/6539 [==============================] - 3313s 507ms/step - loss: 1.2271 - accuracy: 0.6375 - top3_acc: 0.8477 - top5_acc: 0.9166 - val_loss: 1.2339 - val_accuracy: 0.6400 - val_top3_acc: 0.8461 - val_top5_acc: 0.9157
    Epoch 11/50
    6539/6539 [==============================] - 3309s 506ms/step - loss: 1.2081 - accuracy: 0.6422 - top3_acc: 0.8503 - top5_acc: 0.9196 - val_loss: 1.2203 - val_accuracy: 0.6429 - val_top3_acc: 0.8489 - val_top5_acc: 0.9169
    Epoch 12/50
    6539/6539 [==============================] - 3315s 507ms/step - loss: 1.1863 - accuracy: 0.6477 - top3_acc: 0.8550 - top5_acc: 0.9216 - val_loss: 1.2080 - val_accuracy: 0.6473 - val_top3_acc: 0.8505 - val_top5_acc: 0.9188
    Epoch 13/50
    6539/6539 [==============================] - 3329s 509ms/step - loss: 1.1789 - accuracy: 0.6497 - top3_acc: 0.8568 - top5_acc: 0.9239 - val_loss: 1.1973 - val_accuracy: 0.6500 - val_top3_acc: 0.8522 - val_top5_acc: 0.9201
    Epoch 14/50
    6539/6539 [==============================] - 3325s 508ms/step - loss: 1.1618 - accuracy: 0.6535 - top3_acc: 0.8590 - top5_acc: 0.9254 - val_loss: 1.1870 - val_accuracy: 0.6523 - val_top3_acc: 0.8546 - val_top5_acc: 0.9215
    Epoch 15/50
    6539/6539 [==============================] - 3324s 508ms/step - loss: 1.1558 - accuracy: 0.6563 - top3_acc: 0.8617 - top5_acc: 0.9262 - val_loss: 1.1783 - val_accuracy: 0.6551 - val_top3_acc: 0.8555 - val_top5_acc: 0.9229
    Epoch 16/50
    6539/6539 [==============================] - 3325s 508ms/step - loss: 1.1380 - accuracy: 0.6618 - top3_acc: 0.8647 - top5_acc: 0.9281 - val_loss: 1.1698 - val_accuracy: 0.6573 - val_top3_acc: 0.8576 - val_top5_acc: 0.9235
    Epoch 17/50
    6539/6539 [==============================] - 3331s 509ms/step - loss: 1.1260 - accuracy: 0.6622 - top3_acc: 0.8662 - top5_acc: 0.9304 - val_loss: 1.1625 - val_accuracy: 0.6590 - val_top3_acc: 0.8593 - val_top5_acc: 0.9248
    Epoch 18/50
    6539/6539 [==============================] - 3327s 509ms/step - loss: 1.1204 - accuracy: 0.6658 - top3_acc: 0.8672 - top5_acc: 0.9299 - val_loss: 1.1569 - val_accuracy: 0.6605 - val_top3_acc: 0.8600 - val_top5_acc: 0.9260
    Epoch 19/50
    6539/6539 [==============================] - 3308s 506ms/step - loss: 1.1093 - accuracy: 0.6667 - top3_acc: 0.8698 - top5_acc: 0.9334 - val_loss: 1.1495 - val_accuracy: 0.6625 - val_top3_acc: 0.8616 - val_top5_acc: 0.9263
    Epoch 20/50
    6539/6539 [==============================] - 3320s 508ms/step - loss: 1.0955 - accuracy: 0.6710 - top3_acc: 0.8726 - top5_acc: 0.9342 - val_loss: 1.1438 - val_accuracy: 0.6660 - val_top3_acc: 0.8621 - val_top5_acc: 0.9274
    Epoch 21/50
    6539/6539 [==============================] - 3362s 514ms/step - loss: 1.0892 - accuracy: 0.6724 - top3_acc: 0.8733 - top5_acc: 0.9355 - val_loss: 1.1385 - val_accuracy: 0.6667 - val_top3_acc: 0.8631 - val_top5_acc: 0.9280
    Epoch 22/50
    6539/6539 [==============================] - 3371s 515ms/step - loss: 1.0852 - accuracy: 0.6733 - top3_acc: 0.8735 - top5_acc: 0.9358 - val_loss: 1.1330 - val_accuracy: 0.6678 - val_top3_acc: 0.8643 - val_top5_acc: 0.9290
    Epoch 23/50
    6539/6539 [==============================] - 3367s 515ms/step - loss: 1.0733 - accuracy: 0.6768 - top3_acc: 0.8753 - top5_acc: 0.9367 - val_loss: 1.1284 - val_accuracy: 0.6686 - val_top3_acc: 0.8647 - val_top5_acc: 0.9293
    Epoch 24/50
    6539/6539 [==============================] - 3362s 514ms/step - loss: 1.0718 - accuracy: 0.6779 - top3_acc: 0.8768 - top5_acc: 0.9375 - val_loss: 1.1240 - val_accuracy: 0.6706 - val_top3_acc: 0.8663 - val_top5_acc: 0.9296
    Epoch 25/50
    6539/6539 [==============================] - 3374s 516ms/step - loss: 1.0589 - accuracy: 0.6805 - top3_acc: 0.8786 - top5_acc: 0.9392 - val_loss: 1.1198 - val_accuracy: 0.6712 - val_top3_acc: 0.8661 - val_top5_acc: 0.9300
    Epoch 26/50
    6539/6539 [==============================] - 3370s 515ms/step - loss: 1.0527 - accuracy: 0.6829 - top3_acc: 0.8786 - top5_acc: 0.9384 - val_loss: 1.1157 - val_accuracy: 0.6721 - val_top3_acc: 0.8669 - val_top5_acc: 0.9303
    Epoch 27/50
    6539/6539 [==============================] - 3349s 512ms/step - loss: 1.0490 - accuracy: 0.6837 - top3_acc: 0.8810 - top5_acc: 0.9391 - val_loss: 1.1118 - val_accuracy: 0.6727 - val_top3_acc: 0.8682 - val_top5_acc: 0.9307
    Epoch 28/50
    6539/6539 [==============================] - 3362s 514ms/step - loss: 1.0460 - accuracy: 0.6849 - top3_acc: 0.8800 - top5_acc: 0.9401 - val_loss: 1.1081 - val_accuracy: 0.6741 - val_top3_acc: 0.8689 - val_top5_acc: 0.9312
    Epoch 29/50
    6539/6539 [==============================] - 3357s 513ms/step - loss: 1.0361 - accuracy: 0.6883 - top3_acc: 0.8819 - top5_acc: 0.9405 - val_loss: 1.1048 - val_accuracy: 0.6751 - val_top3_acc: 0.8696 - val_top5_acc: 0.9318
    Epoch 30/50
    6539/6539 [==============================] - 3344s 511ms/step - loss: 1.0273 - accuracy: 0.6890 - top3_acc: 0.8842 - top5_acc: 0.9421 - val_loss: 1.1023 - val_accuracy: 0.6748 - val_top3_acc: 0.8703 - val_top5_acc: 0.9322
    Epoch 31/50
    6539/6539 [==============================] - 3352s 513ms/step - loss: 1.0210 - accuracy: 0.6911 - top3_acc: 0.8849 - top5_acc: 0.9438 - val_loss: 1.0996 - val_accuracy: 0.6758 - val_top3_acc: 0.8708 - val_top5_acc: 0.9324
    Epoch 32/50
    6539/6539 [==============================] - 3351s 512ms/step - loss: 1.0183 - accuracy: 0.6930 - top3_acc: 0.8861 - top5_acc: 0.9434 - val_loss: 1.0964 - val_accuracy: 0.6776 - val_top3_acc: 0.8711 - val_top5_acc: 0.9328
    Epoch 33/50
    6539/6539 [==============================] - 3334s 510ms/step - loss: 1.0110 - accuracy: 0.6955 - top3_acc: 0.8873 - top5_acc: 0.9432 - val_loss: 1.0939 - val_accuracy: 0.6780 - val_top3_acc: 0.8723 - val_top5_acc: 0.9334
    Epoch 34/50
    6539/6539 [==============================] - 3329s 509ms/step - loss: 1.0023 - accuracy: 0.6967 - top3_acc: 0.8886 - top5_acc: 0.9451 - val_loss: 1.0910 - val_accuracy: 0.6781 - val_top3_acc: 0.8727 - val_top5_acc: 0.9338
    Epoch 35/50
    6539/6539 [==============================] - 3322s 508ms/step - loss: 1.0021 - accuracy: 0.6966 - top3_acc: 0.8891 - top5_acc: 0.9447 - val_loss: 1.0885 - val_accuracy: 0.6785 - val_top3_acc: 0.8730 - val_top5_acc: 0.9342
    Epoch 36/50
    6539/6539 [==============================] - 3323s 508ms/step - loss: 0.9939 - accuracy: 0.6987 - top3_acc: 0.8903 - top5_acc: 0.9462 - val_loss: 1.0864 - val_accuracy: 0.6792 - val_top3_acc: 0.8738 - val_top5_acc: 0.9341
    Epoch 37/50
    6539/6539 [==============================] - 3363s 514ms/step - loss: 0.9941 - accuracy: 0.6988 - top3_acc: 0.8900 - top5_acc: 0.9458 - val_loss: 1.0842 - val_accuracy: 0.6794 - val_top3_acc: 0.8739 - val_top5_acc: 0.9344
    Epoch 38/50
    6539/6539 [==============================] - 3337s 510ms/step - loss: 0.9916 - accuracy: 0.6987 - top3_acc: 0.8904 - top5_acc: 0.9463 - val_loss: 1.0823 - val_accuracy: 0.6804 - val_top3_acc: 0.8743 - val_top5_acc: 0.9347
    Epoch 39/50
    6539/6539 [==============================] - 3323s 508ms/step - loss: 0.9797 - accuracy: 0.7035 - top3_acc: 0.8933 - top5_acc: 0.9469 - val_loss: 1.0800 - val_accuracy: 0.6809 - val_top3_acc: 0.8754 - val_top5_acc: 0.9355
    Epoch 40/50
    6539/6539 [==============================] - 3327s 509ms/step - loss: 0.9802 - accuracy: 0.7013 - top3_acc: 0.8924 - top5_acc: 0.9472 - val_loss: 1.0781 - val_accuracy: 0.6813 - val_top3_acc: 0.8748 - val_top5_acc: 0.9354
    Epoch 41/50
    6539/6539 [==============================] - 3352s 513ms/step - loss: 0.9724 - accuracy: 0.7032 - top3_acc: 0.8939 - top5_acc: 0.9484 - val_loss: 1.0758 - val_accuracy: 0.6819 - val_top3_acc: 0.8757 - val_top5_acc: 0.9354
    Epoch 42/50
    6539/6539 [==============================] - 3343s 511ms/step - loss: 0.9687 - accuracy: 0.7070 - top3_acc: 0.8945 - top5_acc: 0.9493 - val_loss: 1.0746 - val_accuracy: 0.6816 - val_top3_acc: 0.8755 - val_top5_acc: 0.9356
    Epoch 43/50
    6539/6539 [==============================] - 3354s 513ms/step - loss: 0.9641 - accuracy: 0.7090 - top3_acc: 0.8952 - top5_acc: 0.9489 - val_loss: 1.0723 - val_accuracy: 0.6826 - val_top3_acc: 0.8765 - val_top5_acc: 0.9359
    Epoch 44/50
    6539/6539 [==============================] - 3356s 513ms/step - loss: 0.9630 - accuracy: 0.7070 - top3_acc: 0.8963 - top5_acc: 0.9491 - val_loss: 1.0709 - val_accuracy: 0.6827 - val_top3_acc: 0.8765 - val_top5_acc: 0.9363
    Epoch 45/50
    6539/6539 [==============================] - 3346s 512ms/step - loss: 0.9561 - accuracy: 0.7091 - top3_acc: 0.8973 - top5_acc: 0.9499 - val_loss: 1.0694 - val_accuracy: 0.6831 - val_top3_acc: 0.8769 - val_top5_acc: 0.9363
    Epoch 46/50
    2189/6539 [=========>....................] - ETA: 33:10 - loss: 0.9623 - accuracy: 0.7072 - top3_acc: 0.8963 - top5_acc: 0.9485dcs2016csc007@hpc2:~$ 0.8939

损失曲线

训练准确度曲线

验证准确度曲线

【问题讨论】:

【参考方案1】:

与其设置 lr 计划,不如使用基于监控验证损失的可调整学习率可​​能会更好。 keras 回调 ReduceLROnPlateau 使这很容易做到。文档是 here. 我还建议您使用 keras EarlyStopping 回调。文档是here. 设置两者以监控验证丢失。我推荐的代码如下所示

rlronp=tf.keras.callbacks.ReduceLROnPlateau(monitor="val_loss",factor=0.5, patience=1, 
                                          verbose=1)
estop=tf.keras.callbacks.EarlyStopping(monitor="val_loss",patience=4,verbose=1,
                                       restore_best_weights=True)
callbacks=[rlronp, estop]

验证损失在后期的某个水平附近波动是正常的。降低学习率有助于获得较低的值。将验证损失的形状视为 N 空间中的抛物线,其中 N 是可训练参数的数量。在概念图像 中,在后来的 epochs 中,验证损失会减少,直到达到 lr 很大并且损失开始在某个水平附近振荡的点。减少 lr 将能够达到较低的水平。但是,在某些时候,您的模型本质上开始运行在本质上的噪声上,因此振荡将再次开始,或者如果模型开始过度拟合,价值损失可能开始上升。这就是使用 restore_best_weights=True 提前停止的优势,因为当训练完成时,您的模型具有验证损失最低的时期的权重。

【讨论】:

听起来不错,@Gerry P 非常感谢您的回复。但我有一件小事要问。我浏览了您上面提到的文档,并且感觉 ReduceLROnPlateau 可以用于任何数量。因此,由于我的训练损失比我的 val_loss 波动,我不需要将其与训练损失而不是 Val 损失相加?如果我错了,请纠正我? 您可以设置回调来监控您指定为指标的任何数量。例如,如果您指定 'accuracy' 作为指标,那么您可以使用 monitor="accuracy"。如果满意,请将答案标记为已接受 知道了。非常感谢。 :)【参考方案2】:

原因是学习率高。在那之后你需要进一步降低学习率。

即使你使用的是低 LR,你也需要进一步降低它。 看看这个页面中的例子 https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LearningRateScheduler

【讨论】:

非常感谢您的回复。 @Maged 正如你所说,我设置了一个 LearningRateScheduler 并再次开始训练。但要得到结果需要等待一两天:(。之后我会告诉你结果。谢谢你的帮助。

以上是关于为啥在第 35 个 epoch 之后训练和验证的准确率会随着小幅度的下降而上升?的主要内容,如果未能解决你的问题,请参考以下文章

为啥 Keras 损失在第一个 epoch 之后急剧下降?

为啥我的模型在第二个 epoch 过拟合?

验证损失在 3 个 epoch 后增加,但验证准确度不断增加

验证损失不断减少,而训练损失在 3 个 epoch 后开始增加

如何将 Tensorflow 数据集 API 与训练和验证集一起使用

为啥 Keras 的 train_on_batch 在第二个 epoch 产生零损失和准确率?