Tensorflow GPU错误CUDA_ERROR_OUT_OF_MEMORY:内存不足

Posted

技术标签:

【中文标题】Tensorflow GPU错误CUDA_ERROR_OUT_OF_MEMORY:内存不足【英文标题】:Tensorflow GPU error CUDA_ERROR_OUT_OF_MEMORY: out of memory 【发布时间】:2019-01-08 14:58:46 【问题描述】:

我是 tensorflow 的新手,在 GPU 中运行它时遇到一些问题,在 CPU 中一切正常。

当我运行以下命令检查 tensorflow 安装时:

python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))" 

我收到此错误:

    2019-01-08 18:49:51.551078: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/myUsername/.local/lib/python3.6/site-packages/tensorflow/python/ops/random_ops.py", line 73, in random_normal
    shape_tensor = _ShapeTensor(shape)
  File "/home/myUsername/.local/lib/python3.6/site-packages/tensorflow/python/ops/random_ops.py", line 44, in _ShapeTensor
    return ops.convert_to_tensor(shape, dtype=dtype, name="shape")
  File "/home/myUsername/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1050, in convert_to_tensor
    as_ref=False)
  File "/home/myUsername/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1146, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/home/myUsername/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 229, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)
  File "/home/myUsername/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 179, in constant
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/home/myUsername/.local/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 99, in convert_to_eager_tensor
    handle = ctx._handle  # pylint: disable=protected-access
  File "/home/myUsername/.local/lib/python3.6/site-packages/tensorflow/python/eager/context.py", line 319, in _handle
    self._initialize_handle_and_devices()
  File "/home/myUsername/.local/lib/python3.6/site-packages/tensorflow/python/eager/context.py", line 267, in _initialize_handle_and_devices
    self._context_handle = pywrap_tensorflow.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InternalError: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 12788498432

还有下面的例子:

import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(512, activation=tf.nn.relu),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

我收到此错误:

2019-01-08 18:53:07.267303: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
Traceback (most recent call last):
  File "test_keras.py", line 17, in <module>
    model.fit(x_train, y_train, epochs=5)
  File "/home/myUsername/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1639, in fit
    validation_steps=validation_steps)
  File "/home/myUsername/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 215, in fit_loop
    outs = f(ins_batch)
  File "/home/myUsername/.local/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 2947, in __call__
    session = get_session()
  File "/home/myUsername/.local/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 465, in get_session
    _SESSION = session_module.Session(config=get_default_session_config())
  File "/home/myUsername/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1551, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/home/myUsername/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 676, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 12788498432

关于如何解决这个问题的任何说明????

我的系统描述是:

python3 -V

Python 3.6.7

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

nvidia-smi

Tue Jan  8 18:37:03 2019       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 390.87                 Driver Version: 390.87                    |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  TITAN Xp            Off  | 00000000:17:00.0 Off |                  N/A |
    | 23%   31C    P8    16W / 250W |  12176MiB / 12196MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  GeForce GTX 1070    Off  | 00000000:65:00.0  On |                  N/A |
    |  0%   48C    P8    13W / 180W |   7768MiB /  8118MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+

CuDNN 版本

 cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 0
#define CUDNN_PATCHLEVEL 5
--
#define CUDNN_VERSION    (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#include "driver_types.h"

张量流版本

python3
Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.__version__
'1.12.0'

【问题讨论】:

Titan Xp 的所有内存都在使用中(GTX 1070 也是如此)。您似乎已经切断了 nvidia-smi 输出中显示哪些进程正在使用 GPU 的部分。在不知道机器上发生的任何其他事情的情况下,您可以: 1 重新启动。 2. 再次运行nvidia-smi,并验证 Titan Xp 内存大部分可用,3. 重试问题中的第一个命令。 【参考方案1】:

罗伯特·克罗维拉感谢您的回答。

我按照您告诉我的步骤进行操作,但我遇到了同样的问题。这就是结果。如您所见,内存使用量非常小,titan xp 为 2mb,GTX1070 为 902mb。

nvidia-smi

Wed Jan  9 10:56:55 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.87                 Driver Version: 390.87                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:17:00.0 Off |                  N/A |
| 23%   22C    P8     8W / 250W |      2MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1070    Off  | 00000000:65:00.0  On |                  N/A |
|  0%   37C    P8    10W / 180W |    902MiB /  8118MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1      1355      G   /usr/lib/xorg/Xorg                            40MiB |
|    1      1588      G   /usr/bin/gnome-shell                          81MiB |
|    1      3342      G   /usr/lib/xorg/Xorg                           439MiB |
|    1      3535      G   /usr/bin/gnome-shell                         227MiB |
|    1      8880      G   ...uest-channel-token=15629967551314695332   109MiB |
|    1     26921      G   /usr/bin/nvidia-settings                       0MiB |
+-----------------------------------------------------------------------------+

当我安装 tensorflow 时,我按照本教程 link 进行操作,主要区别在于我在 unbuntu 18.10 中安装了 tensorflow 1.12

【讨论】:

以上是关于Tensorflow GPU错误CUDA_ERROR_OUT_OF_MEMORY:内存不足的主要内容,如果未能解决你的问题,请参考以下文章

训练模型上的 TensorFlow 错误(在 GPU 上)

Tensorflow:GPU上矩阵乘法(NaN)的错误结果

tensorflow的gpu版本错误

Tensorflow-gpu 问题(CUDA 运行时错误:设备内核映像无效)

使用TensorFlow-GPU + Python多处理时的错误?

Tensorflow 2.2 GPU - 安装哪个 cuDNN 库?