TensorFlow:在Linux上安装nvidia-docker环境,解决显卡切换问题,只需要几步就可以成功安装,安装之后登陆不了界面,只能变成服务器模式命令后执行了。

Posted fly-iot

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了TensorFlow:在Linux上安装nvidia-docker环境,解决显卡切换问题,只需要几步就可以成功安装,安装之后登陆不了界面,只能变成服务器模式命令后执行了。相关的知识,希望对你有一定的参考价值。

目录

前言


TensorFlow分栏:
https://blog.csdn.net/freewebsys/category_6872378.html
本文的原文连接是:
https://blog.csdn.net/freewebsys/article/details/108971807

未经博主允许不得转载。
博主CSDN地址是:https://blog.csdn.net/freewebsys
博主掘金地址是:https://juejin.cn/user/585379920479288
博主知乎地址是:https://www.zhihu.com/people/freewebsystem

1,关于nvidia-docker


做模型训练,最好还是使用intel的CPU,保不齐有啥问题再AMD上的跑不起来。
然后最好是带核显的CPU,这样界面使用核显。
然后显卡就专门用来做模型训练使用。
同时因为不同的算法,都需要使用显卡,还是用docker切换环境最方便。

提示:最好使用没有用的电脑折腾,有点风险!!!做好数据备份!!!
而且一旦安装了nvidia驱动,就无法登录桌面了。报错:
提示错误:
Failed to use bus name org.freedesktop.DisplayManager, do you have appropriate permissions?

其实也可以把bios的设置切换回去,但是这样显卡就被占用了,资源就更少了。

之前的安装经验,这次精简了下。

https://blog.csdn.net/freewebsys/article/details/105269765

2,首先要关闭切换bios,默认使用集成显卡,禁用nouveau


依次进入CHIPSET–>System Agent configuration 将primary display设定为PEG或者是IGFX;
Internal graphics 设定为AUTO
这样就修改成集成显卡使用了。

然后禁用:Disable Nouveau,是个开源
sudo vim /etc/modprobe.d/blacklist-nouveau.conf

blacklist nouveau
options nouveau modeset=0

修改完成,更新再重启:

sudo update-initramfs -u
sudo reboot

然后从nvidia的官网找到自己显卡的驱动,我的这个是个老显卡 gtx1650 4G显存的

下载驱动:

https://www.nvidia.com/Download/index.aspx?lang=en-us

然后就可以执行了安装驱动了:

3,安装nvidia的驱动和nvidia-docker2


必须关闭x-server

sudo  /etc/init.d/lightdm stop
#还依赖 gcc 库直接把工具包都安装上:
$ sudo apt install build-essential 

然后 按住 ctrl + alt + F1 切换到另外一个 tty1 终端上进行安装。

sudo ./NVIDIA-Linux-x86_64-525.89.02.run


之后就可以执行 nvidia-smi 查看设备了:

$ nvidia-smi 
Tue Mar  7 22:15:34 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 46%   54C    P0    13W /  75W |      0MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

安装完成驱动之后就可以安装nvidia-docker了也是特别简单:
直接增加源进行安装:

# 先安装docker:
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
# 把当前用户加入到 docker 组;
sudo gpasswd -a $USER docker
# 更新docker组
newgrp docker
# 增加自动启动
sudo systemctl enable docker

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit nvidia-docker2

查看配置,已经又nvidia runtime了,再加上中国镜像和配置数据路径:

# cat /etc/docker/daemon.json 

   "runtimes": 
        "nvidia": 
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        
   ,
  "data-root": "/data/docker",
  "registry-mirrors" : [
      "http://registry.docker-cn.com"
    ],
  "insecure-registries" : [
      "registry.docker-cn.com"
    ]

执行简单测试,使用TensorFlow 官方的GPU镜像即可:

docker run --name gpt2gpu -itd -v `pwd`:/data --gpus all  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -e NVIDIA_VISIBLE_DEVICES=all  tensorflow/tensorflow:latest-gpu

执行 python 脚本测试下:

# 先登录到 tensorflow gpu 容器中
docker exec -it gpt2gpu bash
# 执行测试脚本:

# python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2023-03-07 14:05:45.153075: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-07 14:05:47.126280: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-07 14:05:47.166058: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-07 14:05:47.166419: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-07 14:05:47.167285: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-07 14:05:47.168368: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-07 14:05:47.168727: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-07 14:05:47.169057: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-07 14:05:48.338063: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-07 14:05:48.338243: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-07 14:05:48.338375: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-07 14:05:48.338496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 2622 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1650, pci bus id: 0000:01:00.0, compute capability: 7.5
tf.Tensor(45.326996, shape=(), dtype=float32)

可以看到已经在使用 NVIDIA GeForce GTX 1650 显卡了。

3,最后可以执行gpt-2-simple的项目了,然后显存太小OOM了


运行之前的gpt-2-simple的项目,使用gpu镜像,结果就OOM了。
看来4G内存还是太小了。

return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: failed to allocate memory
	 [[node gradients/model/h3/attn/Max_grad/Cast]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

本文的原文连接是:
https://blog.csdn.net/freewebsys/article/details/108971807

以上是关于TensorFlow:在Linux上安装nvidia-docker环境,解决显卡切换问题,只需要几步就可以成功安装,安装之后登陆不了界面,只能变成服务器模式命令后执行了。的主要内容,如果未能解决你的问题,请参考以下文章

为 tensorflow 升级 CUDA 和 cuDNN 的最佳实践

在linux/ubuntu上安装Tensorflow

在Linux服务器上安装tensorflow记录

tensorflow安装

tensorflow gpu安装配置(windows,linux)

为啥 conda 无法在 Windows 上正确安装 tensorflow gpu?