未能从设备分配 158.06M(165740544 字节):CUDA_ERROR_OUT_OF_MEMORY
Posted
技术标签:
【中文标题】未能从设备分配 158.06M(165740544 字节):CUDA_ERROR_OUT_OF_MEMORY【英文标题】:failed to allocate 158.06M (165740544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 【发布时间】:2018-05-04 06:57:56 【问题描述】:我应该如何解决这个错误?
[jalal@goku bin]$ source activate deep_emotion
(deep_emotion) [jalal@goku bin]$ python
Python 3.5.4 | packaged by conda-forge | (default, Nov 4 2017, 10:11:29)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import keras
Using Theano backend.
>>> quit()
(deep_emotion) [jalal@goku bin]$ export KERAS_BACKEND=tensorflow
(deep_emotion) [jalal@goku bin]$ python
Python 3.5.4 | packaged by conda-forge | (default, Nov 4 2017, 10:11:29)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import keras
Using TensorFlow backend.
2017-11-20 17:49:18.666294: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 17:49:18.666337: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 17:49:18.666347: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 17:49:18.666354: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 17:49:18.666363: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 17:49:19.196610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.6705
pciBusID 0000:05:00.0
Total memory: 10.91GiB
Free memory: 158.06MiB
2017-11-20 17:49:19.426132: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x42e9db0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 17:49:19.426768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.6705
pciBusID 0000:06:00.0
Total memory: 10.91GiB
Free memory: 398.44MiB
2017-11-20 17:49:19.427277: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 1
2017-11-20 17:49:19.427309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y Y
2017-11-20 17:49:19.427323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 1: Y Y
2017-11-20 17:49:19.427347: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0)
2017-11-20 17:49:19.427362: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0)
2017-11-20 17:49:19.429776: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 158.06M (165740544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
>>> quit()
(deep_emotion) [jalal@goku bin]$ conda list | grep keras
keras 2.0.9 py35_0 conda-forge
(deep_emotion) [jalal@goku bin]$ conda list | grep tensorflow
tensorflow-gpu 1.3.0 0
tensorflow-gpu-base 1.3.0 py35cuda8.0cudnn6.0_1
tensorflow-tensorboard 0.1.5 py35_0
系统信息如下:
$ uname -a
Linux goku.bu.edu 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
和
(deep_emotion) [jalal@goku bin]$ nvidia-smi
Mon Nov 20 17:51:50 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:05:00.0 On | N/A |
| 0% 25C P8 19W / 250W | 10862MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:06:00.0 Off | N/A |
| 0% 36C P8 19W / 250W | 10622MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2062 G /usr/bin/X 183MiB |
| 0 2779 G /usr/bin/gnome-shell 176MiB |
| 0 3298 C /cs/software/anaconda3/bin/python 10341MiB |
| 0 4350 G ...-token=2BC290A510039A38C05EF3ECBAA5E5E5 78MiB |
| 0 5212 G /usr/lib64/firefox/plugin-container 5MiB |
| 0 32257 G /proc/self/exe 64MiB |
| 1 3298 C /cs/software/anaconda3/bin/python 10611MiB |
+-----------------------------------------------------------------------------+
【问题讨论】:
1.把其他人踢下机器。 2. 重启。 3.重新运行你的python/keras/tensorflow脚本,不要先运行theano。 【参考方案1】:感谢 Robert Crovella 的建议。重启机器解决问题:
[jalal@goku ~]$ source activate deep_emotion
(deep_emotion) [jalal@goku ~]$ export KERAS_BACKEND=tensorflow
(deep_emotion) [jalal@goku ~]$ python
Python 3.5.4 | packaged by conda-forge | (default, Nov 4 2017, 10:11:29)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import keras
Using TensorFlow backend.
2017-11-20 18:43:28.424658: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 18:43:28.424690: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 18:43:28.424727: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 18:43:28.424734: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 18:43:28.424745: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-11-20 18:43:28.951509: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.6705
pciBusID 0000:05:00.0
Total memory: 10.91GiB
Free memory: 10.44GiB
2017-11-20 18:43:29.172079: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x31d6630 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-11-20 18:43:29.172825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.6705
pciBusID 0000:06:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-11-20 18:43:29.173970: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 1
2017-11-20 18:43:29.174019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y Y
2017-11-20 18:43:29.174034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 1: Y Y
2017-11-20 18:43:29.174055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0)
2017-11-20 18:43:29.174070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0)
>>> import tensorflow
>>>
【讨论】:
以上是关于未能从设备分配 158.06M(165740544 字节):CUDA_ERROR_OUT_OF_MEMORY的主要内容,如果未能解决你的问题,请参考以下文章