Some NCCL operations have failed or timed out.

Posted 2022-12-30 Alex_996

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Some NCCL operations have failed or timed out.相关的知识，希望对你有一定的参考价值。

背景：在两台服务器上通过torchrun进行分布式模型训练。

报错：Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

翻译：某些 NCCL 操作失败或超时。由于 CUDA 的异步特性内核，后续的 GPU 操作可能会在损坏/不完整的数据上运行。为了避免这种不一致，我们正在取消整个过程。

完整报错日志：

Some NCCL operations have failed
 or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/inc
omplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error: unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also 
caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is conn
ection closure by a peer.
Fatal Python error: Aborted

Thread 0x00007ff9b3476700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b2c75700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b3c77700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b547a700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b4478700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b5c7b700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b4c79700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b67fc700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b7fff700 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 316 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/queue.py", line 180 in get
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/tensorboard/summary/writer/event_file_writer.py", l
ine 227 in run
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007ff9d3ae2700 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 316 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 574 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007ffa68d63180 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2070 i
n all_gather
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 79 in all_gather    
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 152 in all_gather    
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/utils.py", line 1016 in all_gathe
r_dp_groups
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1805
 in step
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1814 in _take_mo
del_step
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1913 in step    
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1756 in _inner_train
ing_loop
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1498 in train       
  File "/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py", line 126 in train
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__
init__.py", line 345 in wrapper
  File "/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py", line 132 in <module>
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA 
kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taki
ng the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error: unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also 
caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is conn
ection closure by a peer.
Fatal Python error: Aborted

Thread 0x00007f75b7fff700 (most recent call first):
<no Python frame>

Thread 0x00007f75c4ffd700 (most recent call first):
<no Python frame>

Thread 0x00007f75c57fe700 (most recent call first):
<no Python frame>

Thread 0x00007f75c5fff700 (most recent call first):
<no Python frame>

Thread 0x00007f75def7d700 (most recent call first):
<no Python frame>

Thread 0x00007f75df77e700 (most recent call first):
<no Python frame>

Thread 0x00007f75dff7f700 (most recent call first):
<no Python frame>

Thread 0x00007f72f37fe700 (most recent call first):
<no Python frame>

Thread 0x00007f7655632700 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 316 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 574 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007f76ea8b2180 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/cuda/__init__.py", line 496 in synchronize   
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/utils/timer.py", line 189 in stop        
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1915 in step    
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1756 in _inner_train
ing_loop
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1498 in train       
  File "/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py", line 126 in train
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__
init__.py", line 345 in wrapper
  File "/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py", line 132 in <module>
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120955 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120956 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120957 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120958 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120959 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120960 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120961 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120962 closing signal SIGHUP
Traceback (most recent call last):
  File "/home/liuzhaofeng/anaconda3/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__
init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main       
    run(args)
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run        
    elastic_launch(
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __
call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 236, in la
unch_agent
    result = agent.run()
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 125
, in wrapper
    result = f(*args, **kwargs)
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", lin
e 709, in run
    result = self._invoke_run(role)
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", lin
e 850, in _invoke_run
    time.sleep(monitor_interval)
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/api.py", 
line 60, in _terminate_process_handler
    raise SignalException(f"Process os.getpid() got signal: sigval", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 4120889 got signal: 1

重点：

Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error: unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.

翻译：

某些 NCCL 操作失败或超时。由于 CUDA 的异步特性
内核，后续的 GPU 操作可能会在损坏/不完整的数据上运行。为了避免这种不一致，我们正在取消整个过程。
抛出“std:：runtime_error”实例后调用terminate
what（）：NCCL错误：未处理的系统错误，NCCL版本2.10.3
ncclSystemError：系统调用（例如socket、malloc）或外部库调用失败或设备错误。这也可能是由远程对等端意外退出引起的，您可以检查NCCL警告以了解失败原因，并查看对等端是否关闭了连接。

看样子貌似是两台机器没有同步导致的报错，并且这个问题也是偶发性的，可以先重启一下看看能不能解决。

以上是关于Some NCCL operations have failed or timed out.的主要内容，如果未能解决你的问题，请参考以下文章