Some NCCL operations have failed or timed out.

Posted Alex_996

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Some NCCL operations have failed or timed out.相关的知识,希望对你有一定的参考价值。

背景:在两台服务器上通过torchrun进行分布式模型训练。

报错:Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

翻译:某些 NCCL 操作失败或超时。由于 CUDA 的异步特性 内核,后续的 GPU 操作可能会在损坏/不完整的数据上运行。为了避免这种不一致,我们正在取消整个过程。

完整报错日志:

Some NCCL operations have failed
 or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/inc
omplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error: unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also 
caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is conn
ection closure by a peer.
Fatal Python error: Aborted

Thread 0x00007ff9b3476700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b2c75700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b3c77700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b547a700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b4478700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b5c7b700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b4c79700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b67fc700 (most recent call first):
<no Python frame>

Thread 0x00007ff9b7fff700 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 316 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/queue.py", line 180 in get
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/tensorboard/summary/writer/event_file_writer.py", l
ine 227 in run
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007ff9d3ae2700 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 316 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 574 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007ffa68d63180 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2070 i
n all_gather
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 79 in all_gather    
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 152 in all_gather    
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/utils.py", line 1016 in all_gathe
r_dp_groups
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1805
 in step
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1814 in _take_mo
del_step
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1913 in step    
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1756 in _inner_train
ing_loop
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1498 in train       
  File "/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py", line 126 in train
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__
init__.py", line 345 in wrapper
  File "/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py", line 132 in <module>
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA 
kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taki
ng the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error: unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also 
caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is conn
ection closure by a peer.
Fatal Python error: Aborted

Thread 0x00007f75b7fff700 (most recent call first):
<no Python frame>

Thread 0x00007f75c4ffd700 (most recent call first):
<no Python frame>

Thread 0x00007f75c57fe700 (most recent call first):
<no Python frame>

Thread 0x00007f75c5fff700 (most recent call first):
<no Python frame>

Thread 0x00007f75def7d700 (most recent call first):
<no Python frame>

Thread 0x00007f75df77e700 (most recent call first):
<no Python frame>

Thread 0x00007f75dff7f700 (most recent call first):
<no Python frame>

Thread 0x00007f72f37fe700 (most recent call first):
<no Python frame>

Thread 0x00007f7655632700 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 316 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 574 in wait
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007f76ea8b2180 (most recent call first):
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/cuda/__init__.py", line 496 in synchronize   
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/utils/timer.py", line 189 in stop        
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1915 in step    
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1756 in _inner_train
ing_loop
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1498 in train       
  File "/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py", line 126 in train
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__
init__.py", line 345 in wrapper
  File "/home/liuzhaofeng/nlg_pipeline/gpt2/dialog/finetune_gpt2.py", line 132 in <module>
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120955 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120956 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120957 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120958 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120959 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120960 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120961 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4120962 closing signal SIGHUP
Traceback (most recent call last):
  File "/home/liuzhaofeng/anaconda3/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__
init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main       
    run(args)
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run        
    elastic_launch(
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __
call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 236, in la
unch_agent
    result = agent.run()
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 125
, in wrapper
    result = f(*args, **kwargs)
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", lin
e 709, in run
    result = self._invoke_run(role)
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", lin
e 850, in _invoke_run
    time.sleep(monitor_interval)
  File "/home/liuzhaofeng/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/api.py", 
line 60, in _terminate_process_handler
    raise SignalException(f"Process os.getpid() got signal: sigval", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 4120889 got signal: 1

重点:

Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error: unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.

翻译:

某些 NCCL 操作失败或超时。由于 CUDA 的异步特性
内核,后续的 GPU 操作可能会在损坏/不完整的数据上运行。为了避免这种不一致,我们正在取消整个过程。
抛出“std::runtime_error”实例后调用terminate
what():NCCL错误:未处理的系统错误,NCCL版本2.10.3
ncclSystemError:系统调用(例如socket、malloc)或外部库调用失败或设备错误。这也可能是由远程对等端意外退出引起的,您可以检查NCCL警告以了解失败原因,并查看对等端是否关闭了连接。

看样子貌似是两台机器没有同步导致的报错,并且这个问题也是偶发性的,可以先重启一下看看能不能解决。

以上是关于Some NCCL operations have failed or timed out.的主要内容,如果未能解决你的问题,请参考以下文章

Some NCCL operations have failed or timed out.

解决RuntimeError: Distributed package doesn‘t have NCCL built in

解决RuntimeError: Distributed package doesn‘t have NCCL built in

完美解决windows系统raise RuntimeError(“Distributed package doesn‘t have NCCL “

Android SDK packages some licences have not been accepted

Android SDK packages some licences have not been accepted