nvidia-smi:Failed to initialize NVML: Driver/library version mismatch

Posted 刘润森!

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了nvidia-smi:Failed to initialize NVML: Driver/library version mismatch相关的知识,希望对你有一定的参考价值。

在公司电脑上,经常遇到Failed to initialize NVML: Driver/library version mismatch

其实呢,就是显卡和Driver版本不匹配。

(base) ng@ng-Z390:/home/lrs/KAIR-master$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

有人说删驱动,其实很傻逼的,如果有驱动,删了浪费时间。

查看nvcc,就知道有驱动了。


(base) ng@ng-Z390:/home/lrs/KAIR-master$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0

查看nvidia的version

(base) ng@ng-Z390:/home/lrs/KAIR-master$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  460.73.01  Thu Apr  1 21:40:36 UTC 2021
GCC version:  gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) 

Failed to initialize NVML: Driver/library version mismatch最正确的方法是sudo dkms install -m nvidia -v 460.73.01460.73.01是版本。

如果安装报错,就查看对应的log。

unset ARCH; [ ! -h /usr/bin/cc ] && export CC=/usr/bin/gcc; env NV_VERBOSE=1 'make' -j16 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=5.4.0-73-generic IGNORE_XEN_PRESENCE=1 IGNORE_CC_MISMATCH=1 SYSSRC=/lib/modules/5.4.0-73-generic/build LD=/usr/bin/ld.bfd modules....(bad exit status: 2)
Error! Bad return status for module build on kernel: 5.4.0-73-generic (x86_64)
Consult /var/lib/dkms/nvidia/460.73.01/build/make.log for more information.

我的log是/var/lib/dkms/nvidia/460.73.01/build/make.log

下面是log 报错的原因

cc: error: unrecognized command line option ‘-fstack-protector-strong’
make[2]: *** [/var/lib/dkms/nvidia/460.73.01/build/nvidia/nv-acpi.o] Error 1
Makefile:1760: recipe for target '/var/lib/dkms/nvidia/460.73.01/build' failed
make[1]: *** [/var/lib/dkms/nvidia/460.73.01/build] Error 2
make[1]: 离开目录“/usr/src/linux-headers-5.4.0-73-generic”
Makefile:80: recipe for target 'modules' failed
make: *** [modules] Error 2

这个cc: error: unrecognized command line option ‘-fstack-protector-strong’基本上是C++编译的问题,因此建议换gcc版本

之前是4.7的,更了4.8或者7的都没有问题。

ubuntu安装gcc

sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update 
sudo apt-get install gcc-7
sudo apt-get install g++-7


(base) ng@ng-Z390:~/miniconda3$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.7 99
(base) ng@ng-Z390:~/miniconda3$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 100


在设置gcc设置软链接可能会出现错误,下面是具体的解决方法:

修改软连接

查看博客:https://blog.csdn.net/recher_He1107/article/details/106739850

如果没有问题,就设置默认gcc版本,再安装sudo dkms install -m nvidia -v 460.73.01

sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 100
(base) ng@ng-Z390:~/miniconda3$ sudo dkms install -m nvidia -v 460.73.01

安装好了,就基本没有问题,如果出现什么文件存在,其实之前安装报错,文件存在,删除就可以了

(base) ng@ng-Z390:~$ nvidia-smi
Mon Jun 28 14:03:35 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:02:00.0 Off |                  N/A |
| 25%   64C    P0    50W / 250W |      0MiB / 11016MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

这几天发现又出现了问题,看了一个nvidia 的版本突然变成了460.80,按照上面的方法,重新了安装了460.80

 sudo dkms install -m nvidia -v 460.80

我于是在ubuntu18.04 配置禁止升级并安装NVIDIA显卡驱动

修改配置文件/etc/apt/apt.conf.d/10periodic
#0是关闭,1是开启,将所有值改为0
(base) ng@ng-Z390:/etc/apt/apt.conf.d$ cat 10periodic 
APT::Periodic::Update-Package-Lists "0";
APT::Periodic::Download-Upgradeable-Packages "0";
APT::Periodic::AutocleanInterval "0";
APT::Periodic::Unattended-Upgrade "1";



(base) ng@ng-Z390:/etc/apt/apt.conf.d$ cat 10periodic
APT::Periodic::Update-Package-Lists "0";
APT::Periodic::Download-Upgradeable-Packages "0";
APT::Periodic::AutocleanInterval "0";
APT::Periodic::Unattended-Upgrade "0";


(base) ng@ng-Z390:/etc/apt/apt.conf.d$ sudo apt-mark hold linux-image-generic linux-headers-generic 
linux-image-generic 已经设置为保留。
linux-headers-generic 已经设置为保留





以上是关于nvidia-smi:Failed to initialize NVML: Driver/library version mismatch的主要内容,如果未能解决你的问题,请参考以下文章

Failed to initialize NVML: Driver/library version mismatch

init : Failed to spawn readahead-collector main process :unable to execute ...

nvidia-smi报错:NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driver 原因及避坑解决方案

JDWP Transport dt_socket failed to initialize, TRANSPORT_INIT(510) 解决

RuntimeError: Failed to init API, possibly an invalid tessdata path: C:UsersylpPycharmProjectsun(示例代

Gradle sync failed: Gradle sync failed: Timeout waiting to lock cp_init remapped class cache for a2h