tensorflow.python.framework.errors_impl.ResourceExhaustedError:分配内存失败 [Op:AddV2]

Posted

技术标签:

【中文标题】tensorflow.python.framework.errors_impl.ResourceExhaustedError:分配内存失败 [Op:AddV2]【英文标题】:tensorflow.python.framework.errors_impl.ResourceExhaustedError: failed to allocate memory [Op:AddV2] 【发布时间】:2021-12-07 00:31:59 【问题描述】:

您好,我是 DL 和 TensorFlow 的初学者,

我创建了一个 CNN(你可以看到下面的模型)

model = tf.keras.Sequential()

model.add(tf.keras.layers.Conv2D(filters=64, kernel_size=7, activation="relu", input_shape=[512, 640, 3]))
model.add(tf.keras.layers.MaxPooling2D(2))
model.add(tf.keras.layers.Conv2D(filters=128, kernel_size=3, activation="relu"))
model.add(tf.keras.layers.Conv2D(filters=128, kernel_size=3, activation="relu"))
model.add(tf.keras.layers.MaxPooling2D(2))
model.add(tf.keras.layers.Conv2D(filters=256, kernel_size=3, activation="relu"))
model.add(tf.keras.layers.Conv2D(filters=256, kernel_size=3, activation="relu"))
model.add(tf.keras.layers.MaxPooling2D(2))

model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.5))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dropout(0.5))
model.add(tf.keras.layers.Dense(2, activation='softmax'))

optimizer = tf.keras.optimizers.SGD(learning_rate=0.2) #, momentum=0.9, decay=0.1)
model.compile(optimizer=optimizer, loss='mse', metrics=['accuracy'])

我尝试使用 cpu 构建和训练它,它成功完成(但非常缓慢),所以我决定安装 tensorflow-gpu。 按照https://www.tensorflow.org/install/gpu 中的说明安装了所有东西。

但是现在当我尝试构建模型时,出现了这个错误:

> Traceback (most recent call last):   File
> "C:/Users/thano/Documents/Py_workspace/AI_tensorflow/fire_detection/main.py",
> line 63, in <module>
>     model = create_models.model1()   File "C:\Users\thano\Documents\Py_workspace\AI_tensorflow\fire_detection\create_models.py",
> line 20, in model1
>     model.add(tf.keras.layers.Dense(128, activation='relu'))   File "C:\Python37\lib\site-packages\tensorflow\python\training\tracking\base.py",
> line 530, in _method_wrapper
>     result = method(self, *args, **kwargs)   File "C:\Python37\lib\site-packages\keras\engine\sequential.py", line 217,
> in add
>     output_tensor = layer(self.outputs[0])   File "C:\Python37\lib\site-packages\keras\engine\base_layer.py", line 977,
> in __call__
>     input_list)   File "C:\Python37\lib\site-packages\keras\engine\base_layer.py", line 1115,
> in _functional_construction_call
>     inputs, input_masks, args, kwargs)   File "C:\Python37\lib\site-packages\keras\engine\base_layer.py", line 848,
> in _keras_tensor_symbolic_call
>     return self._infer_output_signature(inputs, args, kwargs, input_masks)   File
> "C:\Python37\lib\site-packages\keras\engine\base_layer.py", line 886,
> in _infer_output_signature
>     self._maybe_build(inputs)   File "C:\Python37\lib\site-packages\keras\engine\base_layer.py", line 2659,
> in _maybe_build
>     self.build(input_shapes)  # pylint:disable=not-callable   File "C:\Python37\lib\site-packages\keras\layers\core.py", line 1185, in
> build
>     trainable=True)   File "C:\Python37\lib\site-packages\keras\engine\base_layer.py", line 663,
> in add_weight
>     caching_device=caching_device)   File "C:\Python37\lib\site-packages\tensorflow\python\training\tracking\base.py",
> line 818, in _add_variable_with_custom_getter
>     **kwargs_for_getter)   File "C:\Python37\lib\site-packages\keras\engine\base_layer_utils.py", line
> 129, in make_variable
>     shape=variable_shape if variable_shape else None)   File "C:\Python37\lib\site-packages\tensorflow\python\ops\variables.py",
> line 266, in __call__
>     return cls._variable_v1_call(*args, **kwargs)   File "C:\Python37\lib\site-packages\tensorflow\python\ops\variables.py",
> line 227, in _variable_v1_call
>     shape=shape)   File "C:\Python37\lib\site-packages\tensorflow\python\ops\variables.py",
> line 205, in <lambda>
>     previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)   File "C:\Python37\lib\site-packages\tensorflow\python\ops\variable_scope.py",
> line 2626, in default_variable_creator
>     shape=shape)   File "C:\Python37\lib\site-packages\tensorflow\python\ops\variables.py",
> line 270, in __call__
>     return super(VariableMetaclass, cls).__call__(*args, **kwargs)   File
> "C:\Python37\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py",
> line 1613, in __init__
>     distribute_strategy=distribute_strategy)   File "C:\Python37\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py",
> line 1740, in _init_from_args
>     initial_value = initial_value()   File "C:\Python37\lib\site-packages\keras\initializers\initializers_v2.py",
> line 517, in __call__
>     return self._random_generator.random_uniform(shape, -limit, limit, dtype)   File
> "C:\Python37\lib\site-packages\keras\initializers\initializers_v2.py",
> line 973, in random_uniform
>     shape=shape, minval=minval, maxval=maxval, dtype=dtype, seed=self.seed)   File
> "C:\Python37\lib\site-packages\tensorflow\python\util\dispatch.py",
> line 206, in wrapper
>     return target(*args, **kwargs)   File "C:\Python37\lib\site-packages\tensorflow\python\ops\random_ops.py",
> line 315, in random_uniform
>     result = math_ops.add(result * (maxval - minval), minval, name=name)   File
> "C:\Python37\lib\site-packages\tensorflow\python\util\dispatch.py",
> line 206, in wrapper
>     return target(*args, **kwargs)   File "C:\Python37\lib\site-packages\tensorflow\python\ops\math_ops.py",
> line 3943, in add
>     return gen_math_ops.add_v2(x, y, name=name)   File "C:\Python37\lib\site-packages\tensorflow\python\ops\gen_math_ops.py",
> line 454, in add_v2
>     _ops.raise_from_not_ok_status(e, name)   File "C:\Python37\lib\site-packages\tensorflow\python\framework\ops.py",
> line 6941, in raise_from_not_ok_status
>     six.raise_from(core._status_to_exception(e.code, message), None)   File "<string>", line 3, in raise_from
> tensorflow.python.framework.errors_impl.ResourceExhaustedError: failed
> to allocate memory [Op:AddV2]

任何想法可能是什么问题?

【问题讨论】:

您使用的是什么 GPU,它有多少 VRAM?另外,您在训练模型时使用的 batch_size 是什么? 请分享您的培训代码 【参考方案1】:

您收到的错误消息tensorflow.python.framework.errors_impl.ResourceExhaustedError: failed to allocate memory [Op:AddV2] 可能表明您的 GPU 没有足够的内存来运行您要运行的训练作业。您使用的是什么 GPU,它有多少 vRAM?

当涉及到训练时的“Out Of Memory” (OOM) 错误时,最直接的做法是减少batch_size 超参数

没有直接的方法可以确定您在训练时可以使用的最大 batch_size 是否适合您的 GPU 的可用 vRAM,而不是反复试验。但是,一般规则是使用 2 的幂(例如 81632)。

【讨论】:

感谢您的回复!是的,根据错误消息,这是有道理的,但是在第一个密集层期间创建模型时会出现错误。它没有达到训练阶段,所以我认为问题不在于batch_size。 (顺便说一句,我使用 sgd 优化器。那是一次一个训练实例,对吗?) GPU:GTX 1050,专用显存=2048MB,总显存=10206MB @Thanos 如果您可以分享您的培训代码,我们将最容易为您提供帮助。如果它足够短,您可以编辑原始问题以包含它,或者在您的代码中包含指向 GitHub 存储库的链接。 1.啊,对了,我错过了。 2GB 的视频内存不算多,但足以让 Tensorflow 为您的第一个密集层构建权重。您可以尝试通过杀死其他一些应用程序来释放您的视频内存吗? 2. 优化器不会确定批量大小,这在您调用训练函数时确定,通常为Model.fit(),请参阅documentation and note the batch_size argument 即使模型很简单,GTX 1050 的输入分辨率对我来说似乎也很高,您是否也尝试过减小 [512, 640, 3] 参数? @MrK。不,我没有尝试过。【参考方案2】:

该错误告诉您它无法分配与您使用的一样多的 VRAM。解决此类问题的最简单方法是将批量大小减少到适合您 GPU 的 VRAM 的数字。

【讨论】:

以上是关于tensorflow.python.framework.errors_impl.ResourceExhaustedError:分配内存失败 [Op:AddV2]的主要内容,如果未能解决你的问题,请参考以下文章