如何理解训练神经网络类型转换器(BERT)的结果?

Posted

技术标签:

【中文标题】如何理解训练神经网络类型转换器(BERT)的结果?【英文标题】:How to understand the results of training a neural network type transformer (BERT)? 【发布时间】:2021-12-22 20:18:11 【问题描述】:

我正在尝试通过微调来训练 Bertclassifier 进行分类任务,但我无法理解训练期间显示的内容。

我放了我得到的一个小样本


'loss': 1.1328, 'learning_rate': 4.994266055045872e-05, 'epoch': 0.0

'loss': 1.0283, 'learning_rate': 4.942660550458716e-05, 'epoch': 0.02

'eval_loss': 0.994676947593689, 'eval_accuracy': 0.507755277897458, 'eval_f1': array([0.00770713, 0.6359277 , 0.44546742]), 'eval_f1_mi': 0.507755277897458, 'eval_f1_ma': 0.36303408438190915, 'eval_runtime': 10.8296, 'eval_samples_per_second': 428.642, 'eval_steps_per_second': 13.482, 'epoch': 0.02

'loss': 1.0075, 'learning_rate': 4.8853211009174314e-05, 'epoch': 0.05
'eval_loss': 1.0286471843719482, 'eval_accuracy': 0.46122361051271005, 'eval_f1': array([0.25      , 0.48133484, 0.51830986]), 'eval_f1_mi': 0.46122361051271005, 'eval_f1_ma': 0.41654823359462956, 'eval_runtime': 10.8256, 'eval_samples_per_second': 428.796, 'eval_steps_per_second': 13.486, 'epoch': 0.05

'loss': 0.9855, 'learning_rate': 4.827981651376147e-05, 'epoch': 0.07
'eval_loss': 0.9796209335327148, 'eval_accuracy': 0.5320982335200345, 'eval_f1': array([0.14783347, 0.6772202 , 0.2726257 ]), 'eval_f1_mi': 0.5320982335200345, 'eval_f1_ma': 0.36589312424069026, 'eval_runtime': 10.8505, 'eval_samples_per_second': 427.813, 'eval_steps_per_second': 13.456, 'epoch': 0.07

'loss': 1.0022, 'learning_rate': 4.7706422018348626e-05, 'epoch': 0.09
'eval_loss': 0.968146026134491, 'eval_accuracy': 0.5364067212408444, 'eval_f1': array([0.38389789, 0.60565553, 0.5487042 ]), 'eval_f1_mi': 0.5364067212408444, 'eval_f1_ma': 0.5127525387411823, 'eval_runtime': 10.9701, 'eval_samples_per_second': 423.15, 'eval_steps_per_second': 13.309, 'epoch': 0.09

'loss': 0.9891, 'learning_rate': 4.713302752293578e-05, 'epoch': 0.11
'eval_loss': 0.9413465261459351, 'eval_accuracy': 0.556872037914692, 'eval_f1': array([0.37663886, 0.68815745, 0.28154206]), 'eval_f1_mi': 0.556872037914692, 'eval_f1_ma': 0.4487794533693059, 'eval_runtime': 10.9316, 'eval_samples_per_second': 424.642, 'eval_steps_per_second': 13.356, 'epoch': 0.11

'loss': 0.9346, 'learning_rate': 4.655963302752294e-05, 'epoch': 0.14
'eval_loss': 0.9142090082168579, 'eval_accuracy': 0.5769065058164584, 'eval_f1': array([0.19836066, 0.68580399, 0.570319  ]), 'eval_f1_mi': 0.5769065058164584, 'eval_f1_ma': 0.4848278830170361, 'eval_runtime': 10.9471, 'eval_samples_per_second': 424.04, 'eval_steps_per_second': 13.337, 'epoch': 0.14

'loss': 0.9394, 'learning_rate': 4.5986238532110096e-05, 'epoch': 0.16
'eval_loss': 0.8802705407142639, 'eval_accuracy': 0.5857389056441189, 'eval_f1': array([0.30735931, 0.71269565, 0.4255121 ]), 'eval_f1_mi': 0.5857389056441189, 'eval_f1_ma': 0.4818556879387581, 'eval_runtime': 10.9824, 'eval_samples_per_second': 422.677, 'eval_steps_per_second': 13.294, 'epoch': 0.16

'loss': 0.8993, 'learning_rate': 4.541284403669725e-05, 'epoch': 0.18
'eval_loss': 0.8535333871841431, 'eval_accuracy': 0.5980180956484275, 'eval_f1': array([0.37174211, 0.7155305 , 0.41662443]), 'eval_f1_mi': 0.5980180956484275, 'eval_f1_ma': 0.5012990131553724, 'eval_runtime': 10.8245, 'eval_samples_per_second': 428.842, 'eval_steps_per_second': 13.488, 'epoch': 0.18

'loss': 0.9482, 'learning_rate': 4.483944954128441e-05, 'epoch': 0.21
'eval_loss': 0.9535377621650696, 'eval_accuracy': 0.541792330891857, 'eval_f1': array([0.31955151, 0.59248471, 0.57414105]), 'eval_f1_mi': 0.541792330891857, 'eval_f1_ma': 0.4953924209116825, 'eval_runtime': 10.9767, 'eval_samples_per_second': 422.896, 'eval_steps_per_second': 13.301, 'epoch': 0.21

'loss': 0.8488, 'learning_rate': 4.426605504587156e-05, 'epoch': 0.23
'eval_loss': 0.8357231020927429, 'eval_accuracy': 0.6214993537268418, 'eval_f1': array([0.35536603, 0.73122392, 0.50070588]), 'eval_f1_mi': 0.6214993537268418, 'eval_f1_ma': 0.5290986104916023, 'eval_runtime': 10.9206, 'eval_samples_per_second': 425.069, 'eval_steps_per_second': 13.369, 'epoch': 0.23

'loss': 0.8893, 'learning_rate': 4.369266055045872e-05, 'epoch': 0.25
'eval_loss': 0.7578970789909363, 'eval_accuracy': 0.6712623869021973, 'eval_f1': array([0.41198502, 0.77171541, 0.65677419]), 'eval_f1_mi': 0.6712623869021973, 'eval_f1_ma': 0.6134915401312347, 'eval_runtime': 10.9765, 'eval_samples_per_second': 422.902, 'eval_steps_per_second': 13.301, 'epoch': 0.25

'loss': 0.9003, 'learning_rate': 4.311926605504588e-05, 'epoch': 0.28
'eval_loss': 0.791412353515625, 'eval_accuracy': 0.6535975872468763, 'eval_f1': array([0.45641646, 0.76072942, 0.53744893]), 'eval_f1_mi': 0.6535975872468763, 'eval_f1_ma': 0.5848649380875267, 'eval_runtime': 10.9302, 'eval_samples_per_second': 424.696, 'eval_steps_per_second': 13.358, 'epoch': 0.28

'loss': 0.8345, 'learning_rate': 4.2545871559633024e-05, 'epoch': 0.3
'eval_loss': 0.7060380578041077, 'eval_accuracy': 0.6999138302455838, 'eval_f1': array([0.50152905, 0.79205975, 0.64349863]), 'eval_f1_mi': 0.6999138302455838, 'eval_f1_ma': 0.6456958112539298, 'eval_runtime': 10.9475, 'eval_samples_per_second': 424.023, 'eval_steps_per_second': 13.336, 'epoch': 0.3

'loss': 0.8149, 'learning_rate': 4.1972477064220184e-05, 'epoch': 0.32
'eval_loss': 0.6717478036880493, 'eval_accuracy': 0.7259801809564843, 'eval_f1': array([0.50805932, 0.81245738, 0.71325735]), 'eval_f1_mi': 0.7259801809564843, 'eval_f1_ma': 0.6779246805922554, 'eval_runtime': 10.7574, 'eval_samples_per_second': 431.519, 'eval_steps_per_second': 13.572, 'epoch': 0.32

'loss': 0.8343, 'learning_rate': 4.139908256880734e-05, 'epoch': 0.34
'eval_loss': 0.6306226253509521, 'eval_accuracy': 0.7455838000861698, 'eval_f1': array([0.58873995, 0.82795018, 0.70917226]), 'eval_f1_mi': 0.7455838000861698, 'eval_f1_ma': 0.7086207951089967, 'eval_runtime': 10.9006, 'eval_samples_per_second': 425.849, 'eval_steps_per_second': 13.394, 'epoch': 0.34

'loss': 0.7711, 'learning_rate': 4.0825688073394495e-05, 'epoch': 0.37
'eval_loss': 0.6052485108375549, 'eval_accuracy': 0.7619560534252477, 'eval_f1': array([0.62346588, 0.84259464, 0.73186813]), 'eval_f1_mi': 0.7619560534252476, 'eval_f1_ma': 0.7326428851759276, 'eval_runtime': 10.8422, 'eval_samples_per_second': 428.143, 'eval_steps_per_second': 13.466, 'epoch': 0.37

    为什么损失从 1.1328 开始? 为什么学习率在每个时期都在变化而不是固定的?我一开始就把它固定在 5e-5 吗? 如何解释结果?对我来说,模型似乎学得更好,因为每个时期的损失都会减少?但是如何用学习率的变化来解释呢?
training_args = TrainingArguments(
    output_dir='/gpfswork/rech/kpf/umg16uw/results_hf',          
    logging_dir='/gpfswork/rech/kpf/umg16uw/logs',
    do_train=True,
    do_eval=True,
    evaluation_strategy="steps",
    logging_first_step=True,
    logging_steps=10,
    num_train_epochs=2.0,              
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,   
    learning_rate=5e-5,
    weight_decay=0.01
)

【问题讨论】:

【参考方案1】:
    损失从 1.3 开始,这是任意的,因为第一个 epoch 是权重的随机化,因此您很幸运能够尽早准确。 您提供给TrainingArguments 的学习率只是初始学习率,训练方法会自动适应它。 学习率的变化表明初始率可能过高或过低,该方法正在根据每个 epoch 的返回损失和准确率进行自适应以防止数据过拟合或欠拟合。 准确率和损失是跨时期跟踪的好衡量标准,损失越少越好,准确率越高越好,如果您还有accuracy 衡量标准,您可以将accuracyeval_accuracy 进行比较,如果@987654326 @ 变得高于 accuracy 那么你开始过度拟合数据。

【讨论】:

感谢您的宝贵笔记,我现在明白了。我还有一个问题,当我训练模型有时会单独停止训练时,这是否意味着已经设置了一个又名“earlystoppping”的功能,当损失稳定时,训练停止或取决于纪元数,例如,在我的情况下训练停止 3 个纪元之后。还有为什么时代的范围从 0.01 到 3.0.0 ,我认为时代是 1、2 和 3。

以上是关于如何理解训练神经网络类型转换器(BERT)的结果?的主要内容,如果未能解决你的问题,请参考以下文章

如何在 Cloud TPU v2 中使用 SQUAD 2.0 训练 BERT 模型?

为何训练后的BERT模型不能够很容易的实现模型泛化?

BERT关系抽取之R-BERT模型

BERT的几个可能的应用

项目小结训练BERT模型加入到深度学习网络层中——keras_bert库使用填坑指南

项目小结训练BERT模型加入到深度学习网络层中——keras_bert库使用填坑指南