错误：某些 NCCL 操作失败或超时

Posted 2023-03-27

技术标签:

【中文标题】错误：某些 NCCL 操作失败或超时【英文标题】：Error: Some NCCL operations have failed or timed out 【发布时间】：2021-12-10 02:32:33 【问题描述】：

在 4 个 A6000 GPU 上运行分布式训练时，我收到以下错误：

[E ProcessGroupNCCL.cpp:630] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803710 milliseconds before timing out.       
                                                                                                                                                        [E ProcessGroupNCCL.cpp:390] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.                                                                                 

terminate called after throwing an instance of 'std::runtime_error'                                                                                                        
what():  [Rank 2] Watchdog caught collective operation timeout: 
WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804406 milliseconds before timing out.        

[E ProcessGroupNCCL.cpp:390] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

我使用标准的 NVidia PyTorch docker。有趣的是，训练对于小数据集效果很好，但对于更大的数据集，我得到了这个错误。所以我可以确认训练代码是正确的并且确实有效。

没有实际的运行时错误或任何其他信息可以在任何地方获得实际的错误消息。

【问题讨论】：

【参考方案1】：

以下两个已经解决了这个问题：

将 CUDA 的默认 SHM（共享内存）增加到 10g（我认为 1g 也可以）。您可以通过传递--shm-size=10g 在 docker run 命令中执行此操作。我也通过--ulimit memlock=-1。 export NCCL_P2P_LEVEL=NVL。

调试提示

要检查当前的 SHM，

df -h
# see the row for shm

要查看 NCCL 调试消息：

export NCCL_DEBUG=INFO

为 GPU 到 GPU 的通信链路运行 p2p 带宽测试：

cd /usr/local/cuda/samples/1_Utilities/p2pBandwidthLatencyTest
sudo make
./p2pBandwidthLatencyTest

对于 A6000 4 GPU 盒，打印如下：

矩阵显示每对GPU之间的带宽和P2P，它应该很高。

【讨论】：

以上是关于错误：某些 NCCL 操作失败或超时的主要内容，如果未能解决你的问题，请参考以下文章