错误:某些 NCCL 操作失败或超时

Posted

技术标签:

【中文标题】错误:某些 NCCL 操作失败或超时【英文标题】:Error: Some NCCL operations have failed or timed out 【发布时间】:2021-12-10 02:32:33 【问题描述】:

在 4 个 A6000 GPU 上运行分布式训练时,我收到以下错误:

[E ProcessGroupNCCL.cpp:630] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803710 milliseconds before timing out.       
                                                                                                                                                        [E ProcessGroupNCCL.cpp:390] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.                                                                                 

terminate called after throwing an instance of 'std::runtime_error'                                                                                                        
what():  [Rank 2] Watchdog caught collective operation timeout: 
WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804406 milliseconds before timing out.        

[E ProcessGroupNCCL.cpp:390] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

我使用标准的 NVidia PyTorch docker。有趣的是,训练对于小数据集效果很好,但对于更大的数据集,我得到了这个错误。所以我可以确认训练代码是正确的并且确实有效。

没有实际的运行时错误或任何其他信息可以在任何地方获得实际的错误消息。

【问题讨论】:

【参考方案1】:

以下两个已经解决了这个问题:

将 CUDA 的默认 SHM(共享内存)增加到 10g(我认为 1g 也可以)。您可以通过传递--shm-size=10g 在 docker run 命令中执行此操作。我也通过--ulimit memlock=-1export NCCL_P2P_LEVEL=NVL

调试提示

要检查当前的 SHM,

df -h
# see the row for shm

要查看 NCCL 调试消息:

export NCCL_DEBUG=INFO

为 GPU 到 GPU 的通信链路运行 p2p 带宽测试:

cd /usr/local/cuda/samples/1_Utilities/p2pBandwidthLatencyTest
sudo make
./p2pBandwidthLatencyTest

对于 A6000 4 GPU 盒,打印如下:

矩阵显示每对GPU之间的带宽和P2P,它应该很高。

【讨论】:

以上是关于错误:某些 NCCL 操作失败或超时的主要内容,如果未能解决你的问题,请参考以下文章

在svn上提交大文件时出现超时错误

邮递员/纽曼在失败的情况下重试

MySQL 连接中 IP 或端口错误导致连接超时的解决方案

频繁“发生网络错误(如超时、连接中断或无法访问主机)。”使用 Firebase

请求接口失败怎么解决

Xamarin Zebra Sdk - 蓝牙打印“读取失败,插座可能关闭或超时,读取ret:-1”