在 Ubuntu 20.4 LTS 上使用 GPU (CUDA-11.0) 运行代码时的 TensorFlow 问题

Posted

技术标签:

【中文标题】在 Ubuntu 20.4 LTS 上使用 GPU (CUDA-11.0) 运行代码时的 TensorFlow 问题【英文标题】:TensorFlow issue when running code with GPU (CUDA-11.0) on Ubuntu 20.4 LTS 【发布时间】:2021-06-05 21:41:30 【问题描述】:

无法加载动态库“libcusparse.so.11”; dlerror:libcusparse.so.11:无法打开共享对象文件:没有这样的文件或目录

有人可以帮我解决上述问题吗?

当我尝试执行以下代码时:

import tensorflow as tf
if __name__ == '__main__':
    print(tf.test.is_built_with_cuda())
    print(tf.config.list_physical_devices('GPU'))

我收到以下错误日志:

2021-03-07 23:47:41.236741: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
True
2021-03-07 23:47:41.953930: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-03-07 23:47:41.954322: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
[]
2021-03-07 23:47:41.981245: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-03-07 23:47:41.981758: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:06:00.0 name: GeForce GTX 970 computeCapability: 5.2
coreClock: 1.329GHz coreCount: 13 deviceMemorySize: 3.94GiB deviceMemoryBandwidth: 208.91GiB/s
2021-03-07 23:47:41.981769: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-03-07 23:47:41.983137: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-03-07 23:47:41.983159: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-03-07 23:47:41.984153: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-03-07 23:47:41.984274: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-03-07 23:47:41.985206: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-03-07 23:47:41.985276: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2021-03-07 23:47:41.985339: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-03-07 23:47:41.985344: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

Process finished with exit code 0

我已手动检查文件夹 /usr/local/cuda-11.0/lib64,但在日志 libcusparse.so.11 中也找不到提到的文件.


我已经按照官方TensorFlow安装步骤link

环境:

操作系统:Ubuntu 20.04.2 LTS GPU Geforce GTX 970 驱动程序版本 450.102.04 Cuda 工具包 V11.0 Cudnn V8.0.4.30 (不完全确定如何检查) Anaconda venv 中的 Python V3.7

【问题讨论】:

您的 CUDA 工具包安装以某种方式损坏 - 根据here,您应该拥有 cuSparse 11.1 和指向 libcusparse.so.11 的符号链接。如果你不这样做,有些东西坏了 我如何检查这个?如果它坏了,我应该从 PC 中完全删除 CUDA 文件(我已经使用“purge”命令完成了)并重新安装 CUDA 或重新安装整个 Ubuntu? 【参考方案1】:

只需重新安装 Ubuntu 并使用 Lambda-Stack 的“one-liner”即可解决问题。

LAMBDA_REPO=$(mktemp) && \
wget -O$LAMBDA_REPO https://lambdalabs.com/static/misc/lambda-stack-repo.deb && \
sudo dpkg -i $LAMBDA_REPO && rm -f $LAMBDA_REPO && \
sudo apt-get update && sudo apt-get install -y lambda-stack-cuda
sudo reboot

【讨论】:

你有哪个 Ubuntu 版本?尝试逐行执行上面的命令(直到 & 符号),让我们知道哪一行执行失败。

以上是关于在 Ubuntu 20.4 LTS 上使用 GPU (CUDA-11.0) 运行代码时的 TensorFlow 问题的主要内容,如果未能解决你的问题,请参考以下文章

ubuntu系统docker20.4版本安装nvidia-container-runtime3.11.0-1版本(离线安装nvidia-docker)

在 Ubuntu 20.4 上使用 postgresql 设置的 Django cookiecutter 无法迁移

RIDE 工具无法在 Ubuntu 20.4 上安装

无法在 Ubuntu 20.4 服务器上运行 npm install

Set up Tensorflow-gpu with Docker on Ubuntu 18.04 LTS

统信UOS(ubuntu20.4)+B525 linux上使用摄像头