docker-compose 找不到 nvidia 驱动程序
Posted
技术标签:
【中文标题】docker-compose 找不到 nvidia 驱动程序【英文标题】:docker-compose can't found nvidia dirver 【发布时间】:2021-08-24 09:06:42 【问题描述】:我正在尝试运行clara train example,但是当我执行startClaraTrainNoteBooks.sh 时,容器找不到nvidia 驱动程序。
我已经知道脚本执行docker-compose.yml。于是我测试了docker-compose
是否能找到nvidia驱动:
services:
test:
image: nvidia/cuda:10.2-base
command: nvidia-smi
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
device_ids: ['0']
输出:
USER@test:~$ docker-compose up
WARNING: Found orphan containers (hp_nvsmi_1) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up.
Starting hp_test_1 ... done
Attaching to hp_test_1
test_1 | Mon Jun 7 09:01:44 2021
test_1 | +-----------------------------------------------------------------------------+
test_1 | | NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
test_1 | |-------------------------------+----------------------+----------------------+
test_1 | | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
test_1 | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
test_1 | | | | MIG M. |
test_1 | |===============================+======================+======================|
test_1 | | 0 GeForce RTX 206... Off | 00000000:01:00.0 Off | N/A |
test_1 | | 0% 34C P8 17W / 215W | 100MiB / 7979MiB | 0% Default |
test_1 | | | | N/A |
test_1 | +-------------------------------+----------------------+----------------------+
test_1 |
test_1 | +-----------------------------------------------------------------------------+
test_1 | | Processes: |
test_1 | | GPU GI CI PID Type Process name GPU Memory |
test_1 | | ID ID Usage |
test_1 | |=============================================================================|
test_1 | +-----------------------------------------------------------------------------+
hp_test_1 exited with code 0
但是startClaraTrainNoteBooks.sh
cna 找不到它。
root@claratrain:/claraDevDay# nvidia-smi
root@claratrain:/claraDevDay#
其实startDocker.sh可以找到驱动。
root@c7c2d5597eb8:/claraDevDay# nvidia-smi
Mon Jun 7 09:11:43 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 206... Off | 00000000:01:00.0 Off | N/A |
| 0% 35C P8 17W / 215W | 100MiB / 7979MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@c7c2d5597eb8:/claraDevDay#
我该怎么办?
【问题讨论】:
【参考方案1】:docker-compose.yml
脚本需要像这样重写并正常工作:
# SPDX-License-Identifier: Apache-2.0
version: "3.8"
services:
claratrain:
container_name: claradevday-pt
hostname: claratrain
##### use vanilla clara train docker
#image: nvcr.io/nvidia/clara-train-sdk:v4.0
##### to build image with GPU dashboard inside jupyter lab
build:
context: ./dockerWGPUDashboardPlugin/ # Project root
dockerfile: ./Dockerfile # Relative to context
image: clara-train-nvdashboard:v4.0
depends_on:
- tritonserver
ports:
- "3030:8888" # Jupyter lab port
- "3031:5000" # AIAA port
ipc: host
volumes:
- $TRAIN_DEV_DAY_ROOT:/claraDevDay/
- /raid/users/aharouni/data:/data/
command: "jupyter lab /claraDevDay --ip 0.0.0.0 --allow-root --no-browser --config /claraDevDay/scripts/jupyter_notebook_config.py"
# command: tail -f /dev/null
# tty: true
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [ gpu ]
# To specify certain GPU uncomment line below
#device_ids: ['0,3']
#############################################################
tritonserver:
image: nvcr.io/nvidia/tritonserver:21.02-py3
container_name: aiaa-triton
hostname: tritonserver
restart: unless-stopped
command: >
sh -c "chmod 777 /triton_models &&
/opt/tritonserver/bin/tritonserver \
--model-store /triton_models \
--model-control-mode="poll" \
--repository-poll-secs=5 \
--log-verbose $TRITON_VERBOSE"
volumes:
- $TRAIN_DEV_DAY_ROOT/AIAA/workspace/triton_models:/triton_models
# shm_size: 1gb
# ulimits:
# memlock: -1
# stack: 67108864
# logging:
# driver: json-file
【讨论】:
以上是关于docker-compose 找不到 nvidia 驱动程序的主要内容,如果未能解决你的问题,请参考以下文章
Docker-compose:/usr/local/bin/docker-compose:第 1 行:Not:找不到命令
Nginx 从 docker-compose 运行返回“在上游找不到主机”
由于绑定挂载,Dockerfile 和 docker-compose 找不到节点模块