ubuntu18.04配置deepo深度学习环境(cuda + cudnn + nvidia-docker + deepo)--超级细致,并把遇到的错误和所有解决方案都列出来了

Posted ERROR_LESS

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了ubuntu18.04配置deepo深度学习环境(cuda + cudnn + nvidia-docker + deepo)--超级细致,并把遇到的错误和所有解决方案都列出来了相关的知识,希望对你有一定的参考价值。

0 了解本机基本信息

0 参考文档

主要整体是这篇
1.安装cuda和cudnn
2.安装cuda和cudnn
3.安装cuda和cudnn
4.安装cuda和cudnn
1.安装nvidia-docker2
2.安装nvidia-docker2
利用deepo做深度学习环境-官方英文
利用deepo做深度学习环境-中文翻译

1 显卡信息

nvidia-smi

root@master:/home/hqc# ubuntu-drivers devices
	== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
	modalias : pci:v000010DEd00001B06sv000010DEsd0000120Fbc03sc00i00
	vendor   : NVIDIA Corporation
	model    : GP102 [GeForce GTX 1080 Ti]
	driver   : nvidia-driver-460-server - distro non-free
	driver   : nvidia-driver-450-server - distro non-free
	driver   : nvidia-driver-390 - distro non-free
	driver   : nvidia-driver-418-server - distro non-free
	driver   : nvidia-driver-470 - distro non-free
	driver   : nvidia-driver-470-server - distro non-free
	driver   : nvidia-driver-460 - distro non-free
	driver   : nvidia-driver-495 - distro non-free recommended
	driver   : xserver-xorg-video-nouveau - distro free builtin

提示信息recommend495版本,因此无需重新安装。

2 查看是否安装了cuda/cudnn

root@master:/home/hqc# cat /usr/local/cuda/version.txt
	cat: /usr/local/cuda/version.txt: 没有那个文件或目录
	
root@master:/home/hqc# nvcc -V
	
	Command 'nvcc' not found, but can be installed with:
	
	apt install nvidia-cuda-toolkit

都没有,参考这篇博客

3 关于cuda和cudnn的说明

deepo这个镜像中已经封装了cuda和cudnn,同时直接配置好了绝大多数深度学习的环境。

那为啥还要在本机上安装cuda和cudnn呢?
因为本地开发需要,或者拿到一个现成的深度学习程序需要本地先测试一下是否可运行。

1 安装cuda

nvidia官网下载

别的版本cuda下载

1 下载

root@master:/home/hqc# wget https://developer.download.nvidia.com/compute/cuda/11.5.1/local_installers/cuda_11.5.1_495.29.05_linux.run


下载速度也太慢了🤪

2 执行

root@master:/home/hqc# sudo sh cuda_11.5.1_495.29.05_linux.run


选择continue
出现原因:可能是验证nivdia-docker2时拉取了一个11.0版本的cuda

输入accept

注:一定不能再次安装driver
操作:移到driver项,按enter键即去掉勾选。然后install。

3 成功

root@master:/home/hqc# sudo sh cuda_11.5.1_495.29.05_linux.run
	===========
	= Summary =
	===========
	
	Driver:   Not Selected
	Toolkit:  Installed in /usr/local/cuda-11.5/
	Samples:  Installed in /root/, but missing recommended libraries
	
	Please make sure that
	 -   PATH includes /usr/local/cuda-11.5/bin
	 -   LD_LIBRARY_PATH includes /usr/local/cuda-11.5/lib64, or, add /usr/local/cuda-11.5/lib64 to /etc/ld.so.conf and run ldconfig as root
	
	To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.5/bin
	***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 495.00 is required for CUDA 11.5 functionality to work.
	To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
	    sudo <CudaInstaller>.run --silent --driver
	
	Logfile is /var/log/cuda-installer.log

出现此输出时便代表安装成功

4 配置

root@master:/home/hqc# vi ~/.bashrc

# 在文件结尾添上这两句指令
export PATH="/usr/local/cuda-11.5/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.5/lib64:$LD_LIBRARY_PATH"

# source一下使之生效
root@master:/home/hqc# source ~/.bashrc

5 验证

root@master:/home/hqc# cd /usr/local/cuda-11.5/samples/1_Utilities/deviceQuery

root@master:/usr/local/cuda-11.5/samples/1_Utilities/deviceQuery# sudo make
	/usr/local/cuda/bin/nvcc -ccbin g++ -I../../common/inc  -m64    --threads 0 --std=c++11 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o deviceQuery.o -c deviceQuery.cpp
	nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
	/usr/local/cuda/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o deviceQuery deviceQuery.o 
	nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
	mkdir -p ../../bin/x86_64/linux/release
	cp deviceQuery ../../bin/x86_64/linux/release

root@master:/usr/local/cuda-11.5/samples/1_Utilities/deviceQuery# ./deviceQuery
	./deviceQuery Starting...
	
	 CUDA Device Query (Runtime API) version (CUDART static linking)
	
	Detected 1 CUDA Capable device(s)
	
	Device 0: "NVIDIA GeForce GTX 1080 Ti"
	  CUDA Driver Version / Runtime Version          11.5 / 11.5
	  CUDA Capability Major/Minor version number:    6.1
	  Total amount of global memory:                 11178 MBytes (11721506816 bytes)
	  (028) Multiprocessors, (128) CUDA Cores/MP:    3584 CUDA Cores
	  GPU Max Clock rate:                            1582 MHz (1.58 GHz)
	  Memory Clock rate:                             5505 Mhz
	  Memory Bus Width:                              352-bit
	  L2 Cache Size:                                 2883584 bytes
	  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
	  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
	  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
	  Total amount of constant memory:               65536 bytes
	  Total amount of shared memory per block:       49152 bytes
	  Total shared memory per multiprocessor:        98304 bytes
	  Total number of registers available per block: 65536
	  Warp size:                                     32
	  Maximum number of threads per multiprocessor:  2048
	  Maximum number of threads per block:           1024
	  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
	  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
	  Maximum memory pitch:                          2147483647 bytes
	  Texture alignment:                             512 bytes
	  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
	  Run time limit on kernels:                     Yes
	  Integrated GPU sharing Host Memory:            No
	  Support host page-locked memory mapping:       Yes
	  Alignment requirement for Surfaces:            Yes
	  Device has ECC support:                        Disabled
	  Device supports Unified Addressing (UVA):      Yes
	  Device supports Managed Memory:                Yes
	  Device supports Compute Preemption:            Yes
	  Supports Cooperative Kernel Launch:            Yes
	  Supports MultiDevice Co-op Kernel Launch:      Yes
	  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
	  Compute Mode:
	     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
	
	deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.5, CUDA Runtime Version = 11.5, NumDevs = 1
	Result = PASS

最后出现Result = PASS,才最终说明安装成功。

6 查看

root@master:/usr/local/cuda-11.5/samples/1_Utilities/deviceQuery# nvcc -V
	nvcc: NVIDIA (R) Cuda compiler driver
	Copyright (c) 2005-2021 NVIDIA Corporation
	Built on Thu_Nov_18_09:45:30_PST_2021
	Cuda compilation tools, release 11.5, V11.5.119
	Build cuda_11.5.r11.5/compiler.30672275_0

Build cuda_11.5.r11.5/compiler.30672275_0

2 安装cudnn

官网下载

登录之前需要注册会员,可能会报一些错误,注册好了登录还需要填一些东西,麻烦,随便填好了。网速也慢。

1 下载

下载cuDNN Library for Linux即可,安装cuDNN v8.3.0版本


下载速度好慢阿,等待吧。

2 安装

# 进入下载安装包的目录进行查看
root@master:/home/hqc# cd 下载
root@master:/home/hqc/下载# ls
	Anaconda3-5.3.1-Linux-x86_64.sh     iwlwifi-cc-46.3cfab8da.0
	cudnn-11.5-linux-x64-v8.3.0.98.tgz  iwlwifi-cc-46.3cfab8da.0.tgz

# 解压缩
root@master:/home/hqc/下载# tar -zxvf cudnn-11.5-linux-x64-v8.3.0.98.tgz
	cuda/include/cudnn.h
	cuda/include/cudnn_adv_infer.h
	cuda/include/cudnn_adv_infer_v8.h
	cuda/include/cudnn_adv_train.h
	cuda/include/cudnn_adv_train_v8.h
	cuda/include/cudnn_backend.h
	cuda/include/cudnn_backend_v8.h
	cuda/include/cudnn_cnn_infer.h
	cuda/include/cudnn_cnn_infer_v8.h
	...

# 复制解压出的cuda文件到用户文件夹中
root@master:/home/hqc/下载# cp cuda/lib64/* /usr/local/cuda-11.5/lib64/
root@master:/home/hqc/下载# cp cuda/include/* /usr/local/cuda-11.5/include/
root@master:/home/hqc/下载# cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
# 没有任何输出

# 更改一种方法仍然没有输出
root@master:/home/hqc/下载# sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
root@master:/home/hqc/下载# sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/ 
root@master:/home/hqc/下载# sudo chmod a+r /usr/local/cuda/include/cudnn.h 
root@master:/home/hqc/下载# sudo chmod a+r /usr/local/cuda/lib64/libcudnn*
root@master:/home/hqc/下载# cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

查看cudnn信息不输出问题-参考评论
目前还没解决。----已解决

3 验证

root@master:/home/hqc/下载# cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
	#define CUDNN_MAJOR 8
	#define CUDNN_MINOR 3
	#define CUDNN_PATCHLEVEL 0
	--
	#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
	
	#endif /* CUDNN_VERSION_H */
	
# 代表版本为cudnn8.3.0

更改为cudnn_version.h即可,因为最新的版本信息在cudnn_version.h里了,不在cudnn.h里

3 安装nivdia-docker2

按官网安装教程操作

查看官网发现:不需要在本机上安装CUDA,只需要有驱动即可
因此决定,在下载cuda和cudnn的同时安装一下nivdia-docker。

具体指令安装nvidia-docker2

1 加入源

root@master:/home/hqc# distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \\
>    && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \\
>    && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
	OK
	deb https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) /
	#deb https://nvidia.github.io/libnvidia-container/experimental/ubuntu18.04/$(ARCH) /
	deb https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/$(ARCH) /
	#deb https://nvidia.github.io/nvidia-container-runtime/experimental/ubuntu18.04/$(ARCH) /
	deb https://nvidia.github.io/nvidia-docker/ubuntu18.04/$(ARCH) /

2 更新

root@master:/home/hqc# sudo apt-get update

3 下载

root@master:/home/hqc# sudo apt-get install -y nvidia-docker2
	正在读取软件包列表... 完成
	正在分析软件包的依赖关系树       
	正在读取状态信息... 完成       
	下列软件包是自动安装的并且现在不需要了:
	  chromium-codecs-ffmpeg-extra lib32gcc1 libc6-i386 libopencore-amrnb0 libopencore-amrwb0
	  linux-hwe-5.4-headers-5.4.0-42
	使用'sudo apt autoremove'来卸载它(它们)。
	将会同时安装下列软件:
	  libnvidia-container-tools libnvidia-container1 nvidia-container-toolkit
	下列【新】软件包将被安装:
	  libnvidia-container-tools libnvidia-container1 nvidia-container-toolkit nvidia-docker2
	升级了 0 个软件包,新安装了 4 个软件包,要卸载 0 个软件包,有 123 个软件包未被升级。
	需要下载 1,075 kB 的归档。
	解压缩后会消耗 4,747 kB 的额外空间。
	获取:1 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64  libnvidia-container1 1.7.0-1 [69.5 kB]
	获取:2 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64  libnvidia-container-tools 1.7.0-1 [22.7 kB]
	获取:3 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64  nvidia-container-toolkit 1.7.0-1 [977 kB]
	获取:4 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64  nvidia-docker2 2.8.0-1 [5,528 B]
	已下载 1,075 kB,耗时 6(170 kB/s)                                                                 
	正在选中未选择的软件包 libnvidia-container1:amd64。
	(正在读取数据库 ... 系统当前共安装有 221226 个文件和目录。)
	正准备解包 .../libnvidia-container1_1.7.0-1_amd64.deb  ...
	正在解包 libnvidia-container1:amd64 (1.7.0-1) ...
	正在选中未选择的软件包 libnvidia-container-tools。
	正准备解包 .../libnvidia-container-tools_1.7.0-1_amd64.deb  ...
	正在解包 libnvidia-container-tools (1.7.0-1) ...
	正在选中未选择的软件包 nvidia-container-toolkit。
	正准备解包 .../nvidia-container-toolkit_1.7.0-1_amd64.deb  ...
	正在解包 nvidia-container-toolkit (1.7.0-1) ...
	正在选中未选择的软件包 nvidia-docker2。
	正准备解包 .../nvidia-docker2_2.8.0-1_all.deb  ...
	正在解包 nvidia-docker2 (2.8.0-1) ...
	正在设置 libnvidia-container1:amd64 (1.7.0-1) ...
	正在设置 libnvidia-container-tools (1.7.0-1) ...
	正在设置 nvidia-container-toolkit (1.7.0-1) ...
	正在设置 nvidia-docker2 (2.8.0-1以上是关于ubuntu18.04配置deepo深度学习环境(cuda + cudnn + nvidia-docker + deepo)--超级细致,并把遇到的错误和所有解决方案都列出来了的主要内容,如果未能解决你的问题,请参考以下文章

Ubuntu18.04 + CUDA9.0 + cuDNN7.3 + Tensorflow-gpu-1.12 + Jupyter Notebook深度学习环境配置

从零到一保姆级Ubuntu深度学习服务器环境配置教程

ubuntu18.04上安装anaconda-python深度学习环境

docker部署ubuntu18.04深度学习环境——cuda11.1cudnn8.0.5pytorch1.8.0

远程服务器基于docker容器的深度学习环境配置(支持GPU)

Ubuntu18.04 配置TensorRT6.0爬坑记录