Cuda 运行时错误 cudaErrorNoDevice:未检测到支持 CUDA 的设备
Posted
技术标签:
【中文标题】Cuda 运行时错误 cudaErrorNoDevice:未检测到支持 CUDA 的设备【英文标题】:Cuda Runtime Error cudaErrorNoDevice: no CUDA-capable device is detected 【发布时间】:2019-03-18 10:04:29 【问题描述】:我正在为 CUDA 8.0 使用 Chainer、Cupy。 我正在尝试使用 python3.5 脚本训练机器学习模型,但出现此错误:
cupy.cuda.runtime.CUDARuntimeError: cudaErrorNoDevice: no CUDA-capable
我能做些什么来解决它?
我尝试在其上训练深度学习模型的机器的环境详细信息,该模型提供了有关 nvidi-smi、echo CUDA_PATH、echo LD_LIBRARY_PATH 的详细信息:
root@awsml04:~# nvidia-smi
Thu Mar 21 10:37:19 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 38C P0 24W / 300W | 0MiB / 16152MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
检查 CUDA 路径
root@awsml04:~# echo $CUDA_PATH
/usr/local/cuda/bin:/usr/local/cuda-9.0
检查 LD_LIBRARY_PATH:
root@awsml04:~# echo $LD_LIBRARY_PATH
/usr/local/cuda/lib64LD_LIBRARY_PATH:+:/usr/local/cuda-9.0/lib64:/usr/local/cuda/lib64LD_LIBRARY_PATH:+:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:
检查环境 | grep CUDA 路径:
root@awsml04:~# env | grep CUDA
CUDA_PATH=/usr/local/cuda/bin:
LD_LIBRARY_PATH_WITH_DEFAULT_CUDA=/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:
LD_LIBRARY_PATH_WITHOUT_CUDA=/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:
检查python3路径
root@awsml04:~# which python3
/usr/bin/python3
检查 pip 路径
root@awsml04:~# which pip3
/usr/bin/pip3
检查已安装的 python 库和版本详细信息:
root@awsml04:~# pip3 freeze
absl-py==0.7.1
alabaster==0.7.12
alembic==1.0.8
appdirs==1.4.3
APScheduler==3.5.3
astor==0.7.1
astroid==2.1.0
awscli==1.16.76
Babel==2.6.0
backcall==0.1.0
beautifulsoup4==4.4.1
bleach==1.5.0
blinker==1.3
bokeh==1.0.3
boto==2.49.0
boto3==1.9.72
botocore==1.12.72
certifi==2018.11.29
chainer==5.3.0
chainerui==0.3.0
chardet==3.0.4
Click==7.0
cloud-init==18.5
cloudpickle==0.6.1
colorama==0.3.9
command-not-found==0.3
configobj==5.0.6
cpplint==1.3.0
cryptography==1.2.3
cycler==0.10.0
dask==1.0.0
decorator==4.3.0
defer==1.0.6
defusedxml==0.5.0
docutils==0.14
easydict==1.9
entrypoints==0.2.3
enum34==1.1.6
environment-kernels==1.1.1
fastrlock==0.4
filelock==2.0.13
Flask==1.0.2
future==0.17.1
gast==0.2.2
glog==0.3.1
graphviz==0.10.1
grpcio==1.19.0
h5py==2.7.1
hibagent==1.0.1
html5lib==0.9999999
idna==2.8
imagesize==1.1.0
ipykernel==5.1.0
ipyparallel==6.2.3
ipython==7.2.0
ipython-genutils==0.2.0
ipywidgets==7.4.2
isort==4.3.4
itsdangerous==1.1.0
jedi==0.13.2
Jinja2==2.10
jmespath==0.9.3
jsonpatch==1.10
jsonpointer==1.9
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.4
jupyter-console==6.0.0
jupyter-core==4.4.0
Keras==2.2.4
Keras-Applications==1.0.7
Keras-Preprocessing==1.0.9
kiwisolver==1.0.1
language-selector==0.1
lazy-object-proxy==1.3.1
lxml==3.5.0
Mako==1.0.7
Markdown==2.6.10
MarkupSafe==1.1.0
matplotlib==3.0.2
mccabe==0.6.1
mistune==0.8.4
mock==2.0.0
msgpack==0.6.1
nbconvert==5.4.0
nbformat==4.4.0
networkx==2.2
nose==1.3.7
notebook==5.7.4
numpy==1.15.1
oauthlib==1.0.3
olefile==0.44
opencv-python==3.4.1.15
packaging==18.0
pandas==0.23.4
pandocfilters==1.4.2
parso==0.3.1
pbr==5.1.3
pexpect==4.6.0
pickleshare==0.7.5
Pillow==4.3.0
prettytable==0.7.2
prometheus-client==0.5.0
prompt-toolkit==2.0.7
protobuf==3.7.0
ptyprocess==0.6.0
pyasn1==0.4.5
pycups==1.9.73
pycurl==7.43.0
pydot==1.4.1
pygal==2.4.0
Pygments==2.3.1
pygobject==3.20.0
PyJWT==1.3.0
pylint==2.2.2
pyparsing==2.2.0
pyserial==3.0.1
python-apt==1.1.0b1+ubuntu0.16.4.2
python-dateutil==2.6.1
python-debian==0.1.27
python-editor==1.0.4
python-gflags==3.1.2
python-systemd==231
pytz==2017.3
PyWavelets==1.0.1
pyxdg==0.25
PyYAML==3.13
pyzmq==17.1.2
qtconsole==4.4.3
requests==2.21.0
roman==2.0.0
rsa==3.4.2
s3transfer==0.1.13
scikit-image==0.14.1
scikit-learn==0.20.2
scipy==1.2.0
screen-resolution-extra==0.0.0
seaborn==0.9.0
Send2Trash==1.5.0
six==1.12.0
snowballstemmer==1.2.1
Sphinx==1.8.3
sphinx-rtd-theme==0.1.9
sphinxcontrib-websupport==1.1.0
SQLAlchemy==1.3.1
ssh-import-id==5.5
system-service==0.3
tensorboard==1.12.2
tensorflow==1.12.0
tensorflow-estimator==1.13.0
tensorflow-gpu==1.12.0
tensorflow-tensorboard==0.4.0rc3
termcolor==1.1.0
terminado==0.8.1
testpath==0.4.2
toolz==0.9.0
tornado==5.1.1
tqdm==4.19.5
traitlets==4.3.2
typed-ast==1.1.1
tzlocal==1.5.1
ufw==0.35
unattended-upgrades==0.1
urllib3==1.24.1
virtualenv==15.0.1
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.13
widgetsnbextension==3.4.2
wrapt==1.10.11
xkit==0.0.0
chainer CUDA 信息:
root@awsml04:~# python3 -c "import chainer; print(chainer.print_runtime_info())"
/usr/lib/python3.5/site-packages/chainer/backends/cuda.py:98: UserWarning: cuDNN is not enabled.
Please reinstall CuPy after you install cudnn
(see https://docs-cupy.chainer.org/en/stable/install.html#install-cudnn).
'cuDNN is not enabled.\n'
/usr/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Platform: Linux-4.4.0-1077-aws-x86_64-with-Ubuntu-16.04-xenial
Chainer: 5.3.0
NumPy: 1.15.1
CuPy:
CuPy Version : 5.3.0
CUDA Root : /usr/local/cuda/bin:/usr/local/cuda-9.0
CUDA Build Version : 9000
CUDA Driver Version : 9000
CUDA Runtime Version : 9000
cuDNN Build Version : None
cuDNN Version : None
NCCL Build Version : 2307
NCCL Runtime Version : 2307
iDeep: Not Available
None
root@awsml04:~# python3 -c "import cupy; print(cupy.empty((3, 3)))"
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
完整的错误回溯:
stacktrace.py
Exception in main training loop: cudaErrorNoDevice: no CUDA-capable
device is detected Traceback (most recent call last):
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/trainer.py", line 302, in run
entry.extension(self)
File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
self.gen.throw(type, value, traceback)
File "/root/.see-master/lib/python3.5/site-packages/chainer/reporter.py", line 98, in scope
yield
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/trainer.py", line 299, in run
update()
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/updater.py", line 223, in update
self.update_core()
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 195, in update_core
self.setup_workers()
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 186, in setup_workers
with cuda.Device(self._devices[0]): File "cupy/cuda/device.pyx", line 106, in cupy.cuda.device.Device.__enter__
File "cupy/cuda/runtime.pyx", line 164, in cupy.cuda.runtime.getDevice
File "cupy/cuda/runtime.pyx", line 136, in
cupy.cuda.runtime.check_status Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
File "chainer/train_svhn.py", line 258, in <module>
trainer.run()
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/trainer.py", line 313, in run
six.reraise(*sys.exc_info())
File "/usr/lib/python3.5/site-packages/six.py", line 693, in reraise
raise value
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/trainer.py", line 302, in run
entry.extension(self)
File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
self.gen.throw(type, value, traceback)
File "/root/.see-master/lib/python3.5/site-packages/chainer/reporter.py", line 98, in scope
yield
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/trainer.py", line 299, in run
update()
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/updater.py", line 223, in update
self.update_core()
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 195, in update_core
self.setup_workers()
File "/root/.see-master/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 186, in setup_workers
with cuda.Device(self._devices[0]): File "cupy/cuda/device.pyx", line 106, in cupy.cuda.device.Device.__enter__
File "cupy/cuda/runtime.pyx", line 164, in cupy.cuda.runtime.getDevice
File "cupy/cuda/runtime.pyx", line 136, in cupy.cuda.runtime.check_status
cupy.cuda.runtime.CUDARuntimeError: cudaErrorNoDevice: no CUDA-capable device is detected
【问题讨论】:
你的机器有 NVIDIA GPU 吗? 是的 NVIDIA Tesla V100 因为我使用的是 aws p3 2xlarge 实例。 CUDA 8.0 不支持 Volta (V100)。使用 CUDA 9.0 或更高版本。 是的,我正在使用 cuda 9.0 @HarshalBhamare 你的第一句话包含“我正在使用”和“Cuda 8.0”这个词 【参考方案1】:没有足够的信息来猜测错误的原因,但我只是建议你做一些事情。
重要提示:在完成以下所有操作之前,请勿注销、分离或关闭您的 shell。
$ export CUDA_PATH=/usr/local/cuda-9.0
$ export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64
$ pip3 uninstall -y chainer cupy cupy-cuda80 cupy-cuda90 cupy-cuda92
$ pip3 install cupy-cuda90 --no-cache-dir && pip3 install chainer --no-cache-dir
$ git clone https://github.com/chainer/chainer.git && cd chainer && git checkout v5.3.0
$ python3 examples/mnist/train_mnist.py --gpu 0
如果可行,随后尝试再次运行您的脚本。
【讨论】:
如何安装chainer,因为您从源代码提供了测试方法,这适用于最后两个命令的chainer 抱歉,我修复了评论以使用 GPU。顺便说一句,最后两个命令工作正常吗? 重要提示:这个问题可能已经解决了,因为在提问者的当前环境(github.com/chainer/chainer/issues/6596)中找不到关于 GPU 的错误。 是的,由于您的支持,您是正确的,现在正在出现不同的错误。先生,您能帮我解决这个错误吗? 1.最好为另一个麻烦发布另一个问题。 2.如果您克服了本题所指的问题,请标记“已解决”。【参考方案2】:在我这边,我的真实代码遇到了这个错误(一堆导入有点复杂):
import A
import B
import cupy as cp
import ...
def main(...):
...(bunch of operations)...
an_array = cp.zeros((10, 10, 10), dtype=cp.float64)
cp.cuda.Stream.null.synchronize() # <- Failed here, obtained: ..."cudaErrorNoDevice: no CUDA-capable device is detected"...
...
但是,当我像这样运行一个简单的 test.py 时,它按预期运行:
import cupy as cp
x_gpu = cp.zeros((10, 10, 10), dtype=cp.float64)
cp.cuda.Stream.null.synchronize() # <- Now OK!
所以经过几次测试后,我意识到我的原始代码可以通过在我的真实代码之前放置一个随机无用数组来通过:
import A
import B
import cupy as cp
import ...
useless_array_hack = cp.zeros((10, 10, 10), dtype=cp.float64) # I guess this allow the code to load useful resources (like dlls) that will be use by the real code as well
def main(...):
...(bunch of operations)...
an_array = cp.zeros((10, 10, 10), dtype=cp.float64)
cp.cuda.Stream.null.synchronize() # Now OK!
...
这不是一个完美的解决方案,但它确实达到了它的目的。
环境说明:
Windows 10 python 3.8 一个可用的 gpu cupy 和 cudatoolkit 安装在 conda 虚拟环境中【讨论】:
以上是关于Cuda 运行时错误 cudaErrorNoDevice:未检测到支持 CUDA 的设备的主要内容,如果未能解决你的问题,请参考以下文章
如何修复 google colab 上的 cuda 运行时错误?
RuntimeError:cuda 运行时错误(710):设备端断言触发于
Tensorflow-gpu 问题(CUDA 运行时错误:设备内核映像无效)
运行时错误:CUDA 在训练结束时内存不足并且不保存模型;火炬
我收到“运行时 API 错误:设备序号无效。”当我使用 GTX 590 在 Ubuntu 10.04 上运行 cuda 代码时