docker-compose 找不到 nvidia 驱动程序

Posted

技术标签:

【中文标题】docker-compose 找不到 nvidia 驱动程序【英文标题】:docker-compose can't found nvidia dirver 【发布时间】:2021-08-24 09:06:42 【问题描述】:

我正在尝试运行clara train example,但是当我执行startClaraTrainNoteBooks.sh 时,容器找不到nvidia 驱动程序。 我已经知道脚本执行docker-compose.yml。于是我测试了docker-compose是否能找到nvidia驱动:

services:
  test:
    image: nvidia/cuda:10.2-base
    command: nvidia-smi
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            capabilities: [gpu]
            device_ids: ['0']

输出:

USER@test:~$ docker-compose up
WARNING: Found orphan containers (hp_nvsmi_1) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up.
Starting hp_test_1 ... done
Attaching to hp_test_1
test_1  | Mon Jun  7 09:01:44 2021
test_1  | +-----------------------------------------------------------------------------+
test_1  | | NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |
test_1  | |-------------------------------+----------------------+----------------------+
test_1  | | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
test_1  | | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
test_1  | |                               |                      |               MIG M. |
test_1  | |===============================+======================+======================|
test_1  | |   0  GeForce RTX 206...  Off  | 00000000:01:00.0 Off |                  N/A |
test_1  | |  0%   34C    P8    17W / 215W |    100MiB /  7979MiB |      0%      Default |
test_1  | |                               |                      |                  N/A |
test_1  | +-------------------------------+----------------------+----------------------+
test_1  |
test_1  | +-----------------------------------------------------------------------------+
test_1  | | Processes:                                                                  |
test_1  | |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
test_1  | |        ID   ID                                                   Usage      |
test_1  | |=============================================================================|
test_1  | +-----------------------------------------------------------------------------+
hp_test_1 exited with code 0

但是startClaraTrainNoteBooks.sh cna 找不到它。

root@claratrain:/claraDevDay# nvidia-smi 
root@claratrain:/claraDevDay# 

其实startDocker.sh可以找到驱动。

root@c7c2d5597eb8:/claraDevDay# nvidia-smi 
Mon Jun  7 09:11:43 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 206...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   35C    P8    17W / 215W |    100MiB /  7979MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@c7c2d5597eb8:/claraDevDay# 

我该怎么办?

【问题讨论】:

【参考方案1】:

docker-compose.yml 脚本需要像这样重写并正常工作:

# SPDX-License-Identifier: Apache-2.0

version: "3.8"
services:
  claratrain:
    container_name: claradevday-pt
    hostname: claratrain
    ##### use vanilla clara train docker
    #image: nvcr.io/nvidia/clara-train-sdk:v4.0
    ##### to build image with GPU dashboard inside jupyter lab
    build:
      context: ./dockerWGPUDashboardPlugin/    # Project root
      dockerfile: ./Dockerfile                 # Relative to context
    image: clara-train-nvdashboard:v4.0
    depends_on:
      - tritonserver
    ports:
      - "3030:8888"  # Jupyter lab port
      - "3031:5000"  # AIAA port
    ipc: host
    volumes:
      - $TRAIN_DEV_DAY_ROOT:/claraDevDay/
      - /raid/users/aharouni/data:/data/
    command: "jupyter lab /claraDevDay --ip 0.0.0.0 --allow-root --no-browser --config /claraDevDay/scripts/jupyter_notebook_config.py"
#    command: tail -f /dev/null
#    tty: true
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [ gpu ]
              # To specify certain GPU uncomment line below
              #device_ids: ['0,3']
#############################################################
  tritonserver:
    image: nvcr.io/nvidia/tritonserver:21.02-py3
    container_name: aiaa-triton
    hostname: tritonserver
    restart: unless-stopped
    command: >
      sh -c "chmod 777 /triton_models &&
        /opt/tritonserver/bin/tritonserver \
          --model-store /triton_models \
          --model-control-mode="poll" \
          --repository-poll-secs=5 \
          --log-verbose $TRITON_VERBOSE"
    volumes:
      - $TRAIN_DEV_DAY_ROOT/AIAA/workspace/triton_models:/triton_models
#    shm_size: 1gb
#    ulimits:
#      memlock: -1
#      stack: 67108864
#    logging:
#      driver: json-file

【讨论】:

以上是关于docker-compose 找不到 nvidia 驱动程序的主要内容,如果未能解决你的问题,请参考以下文章

Docker-compose:/usr/local/bin/docker-compose:第 1 行:Not:找不到命令

Nginx 从 docker-compose 运行返回“在上游找不到主机”

由于绑定挂载,Dockerfile 和 docker-compose 找不到节点模块

docker-compose 构建失败,找不到文件但文件实际存在

在 Docker-Compose 上运行时找不到模块

运行时错误:在您的系统上找不到 NVIDIA 驱动程序