未能从设备分配 158.06M(165740544 字节):CUDA_ERROR_OUT_OF_MEMORY

Posted

技术标签:

【中文标题】未能从设备分配 158.06M(165740544 字节):CUDA_ERROR_OUT_OF_MEMORY【英文标题】:failed to allocate 158.06M (165740544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 【发布时间】:2018-05-04 06:57:56 【问题描述】:

我应该如何解决这个错误?

[jalal@goku bin]$ source activate deep_emotion
(deep_emotion) [jalal@goku bin]$ python
Python 3.5.4 | packaged by conda-forge | (default, Nov  4 2017, 10:11:29)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import keras
Using Theano backend.
>>> quit()
(deep_emotion) [jalal@goku bin]$ export KERAS_BACKEND=tensorflow
(deep_emotion) [jalal@goku bin]$ python
Python 3.5.4 | packaged by conda-forge | (default, Nov  4 2017, 10:11:29)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import keras
Using TensorFlow backend.
2017-11-20 17:49:18.666294: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 17:49:18.666337: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 17:49:18.666347: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 17:49:18.666354: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 17:49:18.666363: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 17:49:19.196610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.6705
pciBusID 0000:05:00.0
Total memory: 10.91GiB
Free memory: 158.06MiB
2017-11-20 17:49:19.426132: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x42e9db0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 17:49:19.426768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.6705
pciBusID 0000:06:00.0
Total memory: 10.91GiB
Free memory: 398.44MiB
2017-11-20 17:49:19.427277: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 1
2017-11-20 17:49:19.427309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y Y
2017-11-20 17:49:19.427323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 1:   Y Y
2017-11-20 17:49:19.427347: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0)
2017-11-20 17:49:19.427362: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0)
2017-11-20 17:49:19.429776: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 158.06M (165740544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
>>> quit()
(deep_emotion) [jalal@goku bin]$ conda list | grep keras
keras                     2.0.9                    py35_0    conda-forge
(deep_emotion) [jalal@goku bin]$ conda list | grep tensorflow
tensorflow-gpu            1.3.0                         0
tensorflow-gpu-base       1.3.0           py35cuda8.0cudnn6.0_1
tensorflow-tensorboard    0.1.5                    py35_0

系统信息如下:

$ uname -a
Linux goku.bu.edu 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

(deep_emotion) [jalal@goku bin]$ nvidia-smi
Mon Nov 20 17:51:50 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:05:00.0  On |                  N/A |
|  0%   25C    P8    19W / 250W |  10862MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:06:00.0 Off |                  N/A |
|  0%   36C    P8    19W / 250W |  10622MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2062      G   /usr/bin/X                                   183MiB |
|    0      2779      G   /usr/bin/gnome-shell                         176MiB |
|    0      3298      C   /cs/software/anaconda3/bin/python          10341MiB |
|    0      4350      G   ...-token=2BC290A510039A38C05EF3ECBAA5E5E5    78MiB |
|    0      5212      G   /usr/lib64/firefox/plugin-container            5MiB |
|    0     32257      G   /proc/self/exe                                64MiB |
|    1      3298      C   /cs/software/anaconda3/bin/python          10611MiB |
+-----------------------------------------------------------------------------+

【问题讨论】:

1.把其他人踢下机器。 2. 重启。 3.重新运行你的python/keras/tensorflow脚本,不要先运行theano。 【参考方案1】:

感谢 Robert Crovella 的建议。重启机器解决问题:

[jalal@goku ~]$ source activate deep_emotion
(deep_emotion) [jalal@goku ~]$ export KERAS_BACKEND=tensorflow
(deep_emotion) [jalal@goku ~]$ python
Python 3.5.4 | packaged by conda-forge | (default, Nov  4 2017, 10:11:29) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import keras
Using TensorFlow backend.
2017-11-20 18:43:28.424658: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 18:43:28.424690: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 18:43:28.424727: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 18:43:28.424734: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 18:43:28.424745: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 18:43:28.951509: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.6705
pciBusID 0000:05:00.0
Total memory: 10.91GiB
Free memory: 10.44GiB
2017-11-20 18:43:29.172079: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x31d6630 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 18:43:29.172825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties: 
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.6705
pciBusID 0000:06:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-11-20 18:43:29.173970: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 1 
2017-11-20 18:43:29.174019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y Y 
2017-11-20 18:43:29.174034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 1:   Y Y 
2017-11-20 18:43:29.174055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0)
2017-11-20 18:43:29.174070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0)
>>> import tensorflow
>>> 

【讨论】:

以上是关于未能从设备分配 158.06M(165740544 字节):CUDA_ERROR_OUT_OF_MEMORY的主要内容,如果未能解决你的问题,请参考以下文章

机器分配

P2066 机器分配

洛谷P2066 机器分配

洛谷 P2066 机器分配

洛谷 p2066 机器分配

洛谷 P2066 机器分配