NCCL分布式训练报错

Posted 2023-05-17

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了NCCL分布式训练报错相关的知识，希望对你有一定的参考价值。

参考技术A 调试的时候遇到问题：
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8

于是在环境变量里添加了
export NCCL_DEBUG=info

发现是没有多余共享内存的问题
include/shm.h:48 NCCL WARN Error while creating shared memory segment ...

于是修改docker容器共享内存的大小
最简单的方法是重新建一个容器，在run的时候添加参数-shm-size 6G，但是由于需要重新配置内网穿透，故采用直接修改docker文件的方式

一般原始大小为64M，这是远远不够的

将ShmSize后面加个“22”大概扩大了100倍

此时再查看共享内存大小，发现就变为了6.3G

完美解决windows系统raise RuntimeError(“Distributed package doesn‘t have NCCL “

在训练时出现如下问题：

File "C:\\Users\\urser\\anaconda3\\lib\\site-packages\\torch\\distributed\\distributed_c10d.py", line 597, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL "
RuntimeError: Distributed package doesn't have NCCL built in

从文字上来看，错误提示很明显了，没有NCCL

而windows不支持NCCL backend.

我们看下官方文档：

As of PyTorch v1.8, Windows supports all collective communications backend but NCCL, If the init_method argument of init_process_group() points to a file it must adhere to the following schema:

而要解决这个问题也很简单，不使用NCCL backend.就可以了。

只需要一行代码就可以解决问题。

获取解决方案：https://ai.52learn.online/11955

以上是关于NCCL分布式训练报错的主要内容，如果未能解决你的问题，请参考以下文章

Some NCCL operations have failed or timed out.

TensorFlow分布式部署单机多卡

错误：某些 NCCL 操作失败或超时

蓝昶：谷歌分布式机器学习优化实践

docker容器内运行pytorch多gpu报错 RuntimeError: NCCL Error 2: unhandled system error