Tensorflow 没有检测到我的 GPU。我该怎么办(2021 年 5 月)?

Posted

技术标签:

【中文标题】Tensorflow 没有检测到我的 GPU。我该怎么办(2021 年 5 月)?【英文标题】:Tensorflow is not detecting my GPUs. What shall I do (May 2021)? 【发布时间】:2021-07-29 21:15:07 【问题描述】:

TF 版本:2.4.1 CUDA 版本:11.1

tf.test_is_gpu_available() -- 返回 --> FALSE tf.test.is_built_with_cuda() -- 返回 --> TRUE

我尝试将 TF 恢复到 2.4.0,但没有成功

我也试过了:

$ pip uninstall tensorflow

$ pip install tensorflow-gpu

但似乎没有任何效果,TF 只是没有检测到我的 GPU

编辑 1:

nvcc --version 的输出:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

nvidia-smi 的输出

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   35C    P8    23W / 300W |     23MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 3090    Off  | 00000000:43:00.0 Off |                  N/A |
| 30%   40C    P8    27W / 300W |      5MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 3090    Off  | 00000000:81:00.0 Off |                  N/A |
| 64%   63C    P2   179W / 300W |  24043MiB / 24268MiB |     59%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2362      G   /usr/lib/xorg/Xorg                  9MiB |
|    0   N/A  N/A      2564      G   /usr/bin/gnome-shell               12MiB |
|    1   N/A  N/A      2362      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      2362      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A     14304      C   python3                         24035MiB |
+-----------------------------------------------------------------------------+

在运行 tf.test.is_gpu_avaliable() 时,我收到以下警告:

WARNING:tensorflow:From Spell_correction.py:35: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2021-05-07 21:46:21.855460: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-07 21:46:21.856690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:43:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-05-07 21:46:21.856716: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-05-07 21:46:21.856735: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-05-07 21:46:21.856747: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-05-07 21:46:21.856759: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-05-07 21:46:21.856771: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-05-07 21:46:21.856829: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.1/lib64
2021-05-07 21:46:21.856846: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-05-07 21:46:21.856856: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-05-07 21:46:21.856863: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-05-07 21:46:21.942589: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-07 21:46:21.942626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-05-07 21:46:21.942633: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 

另一个观察:

Pytorch 正在检测 GPU,而 TF 没有。

torch.cuda.is_available() --> TRUE tf.test.is_gpu_available() --> FALSE

【问题讨论】:

(nvcc --version) 的输出是什么? 请运行一些 tensorflow 代码并将输出包含在您的问题中,这包含关键信息,例如加载任何 CUDA 库和检测您的 GPU。任何其他信息都是无用的。 2.4.1 使用 CUDA 11.0。您不能使用 CUDA 11.1 作为 CUDA 11.0 的替代品。 @RobertCrovella,我检查过:TF: 2.4.1 CUDA: 11.0 cuDNN: 8 但还是和上面一样 你的意思是你仍然看到这样的输出:Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.1/lib64 在这种情况下你没有正确设置你的LD_LIBRARY_PATH 【参考方案1】:

如果你使用 ubuntu 20.04,我建议按照here 的步骤进行操作。我最近遇到了同样的问题。

你有

NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   35C    P8    23W / 300W |     23MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A

尝试获取最新版本的 NVIDIA 465Cuda 11.3。对于我的情况,nvidia-smi 如下:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |

我做了什么;

(1) 我完全卸载了 NVIDIA 和 CUDA see here 小心点。

sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get install ubuntu-desktop
sudo rm /etc/X11/xorg.conf
echo 'nouveau' | sudo tee -a /etc/modules

(2) 我下载了NVIDIA,下载.run 文件并简单地运行sudo bash NVIDIA*.run (3) 我下载了cuDNN 并按照here 的描述执行以下操作

tar -xzvf cudnn-11.3-.*.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

同时检查 .bashrc 文件以及描述的here:

cd ~

gedit .bashrcnano .bashrc

#最后加上这个:

export PATH=/usr/local/cuda/bin$PATH:+:$PATH
export PATH=/usr/local/cuda-11.3/bin$PATH:+:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.3/lib64\$LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda11.3/targets/x86_64linux\$LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH

那么,pip install tensorflow-gpu==2.4.1

【讨论】:

这里是nvcc --versionnvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Sun_Mar_21_19:15:46_PDT_2021 Cuda compilation tools, release 11.3, V11.3.58 Build cuda_11.3.r11.3/compiler.29745058_0

以上是关于Tensorflow 没有检测到我的 GPU。我该怎么办(2021 年 5 月)?的主要内容,如果未能解决你的问题,请参考以下文章

tensorflow docker gpu 图像未检测到我的 GPU

如何在没有 GPU 的情况下运行 tensorflow?

没有NVIDIA的Tensorflow-GPU,可能吗?

Tensorflow 未检测到 CUDA 设备

python使用Tensorflow检测GPU运行与使用Pytorch

Tensorflow 在 CPU 而不是 RTX 3000 系列 GPU 上训练