训练 TensorFlow 期间面临的问题(BatchNormV3 错误)

Posted

技术标签:

【中文标题】训练 TensorFlow 期间面临的问题(BatchNormV3 错误)【英文标题】:Problem facing during training TensorFlow (BatchNormV3 error) 【发布时间】:2021-11-15 00:02:08 【问题描述】:

在为机器翻译训练变压器网络期间,GPU 显示此错误。为什么会出现这个问题?

Traceback (most recent call last):
  File "D:/Transformer_MC__translation/model.py", line 64, in <module>
    output = model(train, label)
  File "C:\Users\Devanshu\AppData\Local\Programs\Python\Python38\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1012, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "D:\Transformer_MC__translation\transformer.py", line 36, in call
    enc_src = self.encoder(src, src_mask)
  File "C:\Users\Devanshu\AppData\Local\Programs\Python\Python38\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1012, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "D:\Transformer_MC__translation\encoder.py", line 23, in call
    output = layer(output, output, output, mask)
  File "C:\Users\Devanshu\AppData\Local\Programs\Python\Python38\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1012, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "D:\Transformer_MC__translation\transformerblock.py", line 22, in call
    x = self.dropout(self.norm1(attention+query))
  File "C:\Users\Devanshu\AppData\Local\Programs\Python\Python38\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1012, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "C:\Users\Devanshu\AppData\Local\Programs\Python\Python38\lib\site-packages\tensorflow\python\keras\layers\normalization.py", line 1293, in call
    outputs, _, _ = nn.fused_batch_norm(
  File "C:\Users\Devanshu\AppData\Local\Programs\Python\Python38\lib\site-packages\tensorflow\python\util\dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "C:\Users\Devanshu\AppData\Local\Programs\Python\Python38\lib\site-packages\tensorflow\python\ops\nn_impl.py", line 1660, in fused_batch_norm
    y, running_mean, running_var, _, _, _ = gen_nn_ops.fused_batch_norm_v3(
  File "C:\Users\Devanshu\AppData\Local\Programs\Python\Python38\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 4255, in fused_batch_norm_v3
    _ops.raise_from_not_ok_status(e, name)
  File "C:\Users\Devanshu\AppData\Local\Programs\Python\Python38\lib\site-packages\tensorflow\python\framework\ops.py", line 6862, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: cuDNN launch failure : input shape ([1,4928,256,1]) [Op:FusedBatchNormV3]

这是编码器块

import tensorflow as tf
from selfattention import SelfAttention
from transformerblock import TransformerBlock

class DecoderBlock(tf.keras.layers.Layer):
    def __init__(self, embed_size, head, forward_expansion, dropout):
        super(DecoderBlock, self).__init__()
        self.attention = SelfAttention(embed_size, head)
        self.norm = tf.keras.layers.LayerNormalization()
        self.transformer_block = TransformerBlock(embed_size, head, dropout=dropout, forward_expansion=forward_expansion)
        self.dropout = tf.keras.layers.Dropout(dropout)

    def call(self, inputs, key, value, src_mask, trg_mask):
        attention = self.attention(inputs, inputs, inputs, trg_mask)
        # skip connection
        query = self.dropout(self.norm(attention + inputs))
        print(query.shape)

        output = self.transformer_block(value, key, query, src_mask)

        return output

attention+input 的输出形状为 (64, 80, 250)(Batch size, sentance length, vocab size)

【问题讨论】:

【参考方案1】:

您可以尝试解决问题。当我尝试使用非常大的批量并通过减少它来解决它时,我曾经遇到过这个问题。

减少batch_size 参数。逐渐增加(2、4、8、10 等) 有时会出现此类 cuDNN 内部错误,这是由于库安装不匹配造成的。

确保您正确安装了所有依赖项 (TF+CUDNN+CUDA),并在确定安装正确后减少 batch_size

在你的情况下,我怀疑问题是由于大批量造成的。

【讨论】:

感谢@Timbus Calin 的回答,可能有人发现您编写的这些解决方案很有用。 我面临的问题是由于另一个原因。我在程序中写下了 2 行代码 [ gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=True) , session = tf.compat.v1.InteractiveSession(config=tf.compat.v1.ConfigProto(gpu_options=gpu_options )) ] 导致错误。我想,我遇到了 (GPU out of memory) 的问题,这就是我在代码中应用它的原因。去掉2行代码后,程序运行正常。 确实,您的解决方案适用于 TF 1.X,与内存问题间接相关。老实说,我没想到你有 TF 1.X。我强烈建议您更新到 TensorFlow 2。 感谢您欣赏我写的解决方案,很多人解决了他们的问题,根本不费心去投票/接受或回答。 不,它的 TensorFlow 2.5。

以上是关于训练 TensorFlow 期间面临的问题(BatchNormV3 错误)的主要内容,如果未能解决你的问题,请参考以下文章

如何在 Tensorflow 2 中的模型训练期间捕获任何异常

TensorFlow 在训练期间没有使用我的 M1 MacBook GPU

Tensorflow:您如何在模型训练期间实时监控 GPU 性能?

在 Tensorflow 中训练期间的 GPU 使用率非常低

训练CNN模型图像分类期间的tensorflow NaN损失

在 tensorflow 训练和测试中显示不同的结果