RuntimeError: GPU:0 上的 CUDA 运行时隐式初始化失败。状态：所有支持 CUDA 的设备都忙或不可用

Posted 2023-04-15

技术标签:

【中文标题】RuntimeError: GPU:0 上的 CUDA 运行时隐式初始化失败。状态：所有支持 CUDA 的设备都忙或不可用【英文标题】：RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable 【发布时间】：2021-07-12 22:29:19 【问题描述】：

问题：当我运行以下命令时

python -c "import tensorflow as tf; tf.test.is_gpu_available(); print('version :' + tf.__version__)"

错误：

RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable

详情：

警告：tensorflow:From :1: is_gpu_available（来自 tensorflow.python.framework.test_util）已弃用，将在未来版本中删除。更新说明：请改用tf.config.list_physical_devices('GPU')。 2021-04-18 21:02:51.839069: I tensorflow/core/platform/cpu_feature_guard.cc:143] 您的 CPU 支持未编译此 TensorFlow 二进制文件以使用的指令：AVX2 AVX512F FMA 2021-04-18 21:02:51.846775：I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU 频率：2500000000 Hz 2021-04-18 21:02:51.847076: I tensorflow/compiler/xla/service/service.cc:168] XLA 服务 0x7fc3bc000b20 为平台主机初始化（这不保证会使用 XLA）。设备： 2021-04-18 21:02:51.847104：I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor 设备 (0)：主机，默认版本 2021-04-18 21:02:51.849876: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcuda.so.1 2021-04-18 21:02:51.911161：W tensorflow/compiler/xla/service/platform_util.cc:210] 无法为 CUDA：0 创建 StreamExecutor：为 CUDA 设备序号 0 初始化 StreamExecutor 失败：内部：对 cuDevicePrimaryCtxRetain 的调用失败：CUDA_ERROR_UNKNOWN：未知错误 2021-04-18 21:02:51.911285：我 tensorflow/compiler/jit/xla_gpu_device.cc:161] 忽略可见的 XLA_GPU_JIT 设备。设备号为 0，原因：内部：找不到平台 CUDA 支持的设备 2021-04-18 21:02:51.911546: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] 从 SysFS 读取的成功 NUMA 节点具有负值 (-1)，但必须至少有一个 NUMA 节点，所以返回NUMA 节点零 2021-04-18 21:02:51.912210: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] 找到具有以下属性的设备 0： pciBusID：0000:00:07.0 名称：GRID T4-4Q 计算能力：7.5 coreClock：1.59GHz coreCount：40 deviceMemorySize：3.97GiB deviceMemoryBandwidth：298.08GiB/s 2021-04-18 21:02:51.912446: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcudart.so.10.1 2021-04-18 21:02:51.914362: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcublas.so.10 2021-04-18 21:02:51.916358: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcufft.so.10 2021-04-18 21:02:51.916679: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcurand.so.10 2021-04-18 21:02:51.918787: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcusolver.so.10 2021-04-18 21:02:51.919993: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcusparse.so.10 2021-04-18 21:02:51.924652: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcudnn.so.7 2021-04-18 21:02:51.924792: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] 从 SysFS 读取的成功 NUMA 节点具有负值 (-1)，但必须至少有一个 NUMA 节点，所以返回NUMA 节点零 2021-04-18 21:02:51.925488: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] 从 SysFS 读取的成功 NUMA 节点具有负值 (-1)，但必须至少有一个 NUMA 节点，所以返回NUMA 节点零 2021-04-18 21:02:51.926100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] 添加可见 gpu 设备：0 2021-04-18 21:02:51.926146: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcudart.so.10.1 回溯（最近一次通话最后）：文件“”，第 1 行，在文件“/home/miniconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py”，第 324 行，在 new_func 返回函数（*args，**kwargs）文件“/home/miniconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/framework/test_util.py”，第 1496 行，在 is_gpu_available 对于 device_lib.list_local_devices() 中的 local_device：文件“/home/miniconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/client/device_lib.py”，第 43 行，在 list_local_devices _pywrap_device_lib.list_devices(serialized_config) 中的 _convert(s)

系统信息：

操作系统平台和发行版（例如，Linux Ubuntu 16.04）：ubuntu 18.04 移动设备（例如 iPhone 8、Pixel 2、Samsung Galaxy）如果问题发生在移动设备上：云服务器 TensorFlow 安装自（源代码或二进制文件）：源代码 TensorFlow 版本：2.2.0。 Python 版本：3.7.7 使用 virtualenv 安装？点子？康达？：点子和康达。 Bazel 版本（如果从源代码编译）：2..0.0 GCC/编译器版本（如果从源代码编译）：7.5 CUDA/cuDNN 版本：CUDA 10.1 & cuDNN 7.6.5 GPU 型号和内存： 00:07.0 VGA 兼容控制器： NVIDIA Corporation 设备 1eb8 (rev a1) (prog-if 00 [VGA 控制器])。子系统：NVIDIA Corporation Device 130e。物理插槽：7 标志：总线主机，快速devsel，延迟 0，IRQ 37 fc000000 处的内存（32 位，non-prefetchable）[大小=16M] e0000000 处的内存（64 位，prefetchable）[大小=256M] 内存在 fa000000 (64-bit, non-prefetchable) [size=32M] c500 的 I/O 端口 [size=128] 能力：[68] MSI：Enable+ Count=1/1 Maskable- 64bit+ 使用的内核驱动程序：nvidia 内核模块：nvidiafb、nouveau、nvidia_drm、nvidia

我试图寻找这个问题的解决方案，但没有一个能解决它：

https://forums.developer.nvidia.com/t/all-cuda-capable-devices-are-busy-or-unavailable-what-is-wrong/112858

https://github.com/tensorflow/tensorflow/issues/41990

Tensorflow-GPU Error: "RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable"

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#recommended-post

https://github.com/tensorflow/tensorflow/issues/48558

https://programmersought.com/article/94034772029/

【问题讨论】：

尝试运行像 vectorAdd 这样实际使用 GPU 的 CUDA 示例代码。你得到什么结果？好的，我尝试按照本教程运行vectorAdd：olcf.ornl.gov/tutorials/cuda-vector-addition，我得到了module: command not found和aprun: command not found。我也试过sudo apt-get install environment-modules，但没有解决问题。我用./vecAdd.out 运行vecAdd.out，输出为final result: 0.000000 是的，那个精彩的教程没有错误检查。几乎没用。运行名为vectorAdd 的CUDA 示例代码（就像您运行名为deviceQuery 的CUDA 示例代码一样）。无论如何，教程表明最终结果应为 1.000，因此该代码无法正常工作。基本上，您无法在该机器上正确运行 CUDA 代码。可能是 CUDA 安装损坏、GRID 许可问题或其他原因。好吧，你说得对。我去了/usr/local/cuda-10.1/samples/0_Simple/vectorAdd 并运行./vectorAdd，然后我得到

[Vector addition of 50000 elements] Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)!

我怎么知道这是由于CUDA 安装损坏还是GRID 许可问题或其他原因？ 【参考方案1】：

我可以确认评论中提到的情况。

我在使用 Ubuntu VM、在 VMware ESXi 主机上执行并为 v100 Nvidia GPU 使用 vGPU 分区时遇到了问题。

我遇到了同样的错误，我已经尝试更改 cuda 版本并下载为特定 CUDA 版本编译的 (pip) 软件，这并没有解决问题，错误：

tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable

在我的情况下，我忘记在/etc/nvidia/grid.conf 中设置许可证服务器，我得到了完全相同的错误，所以在我的情况下，这是一个 GRID 许可证问题...修复网格配置文件并重新启动解决了这个问题。

【讨论】：

以上是关于RuntimeError: GPU:0 上的 CUDA 运行时隐式初始化失败。状态：所有支持 CUDA 的设备都忙或不可用的主要内容，如果未能解决你的问题，请参考以下文章

RuntimeError: CUDA out of memory. Tried to allocate 600.00 MiB (GPU 0； 23.69 GiB total capacity)

RuntimeError：/pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:15____ 不支持多目标

RuntimeError: CUDA out of memory. Tried to allocate 600.00 MiB (GPU 0； 23.69 GiB total capacity)(代码片

RuntimeError: CUDA out of memory. Tried to allocate 170.00 MiB (GPU 0； 3.82 GiB total capacity； 1.94

E-02内存不足RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 2.00 GiB total capac