docker容器内运行pytorch多gpu报错 RuntimeError: NCCL Error 2: unhandled system error

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了docker容器内运行pytorch多gpu报错 RuntimeError: NCCL Error 2: unhandled system error相关的知识,希望对你有一定的参考价值。

参考技术A 错误1.docker容器内运行pytorch多gpu报错 RuntimeError: NCCL Error 2: unhandled system error
在启动容器的时候加上 -e NVIDIA_VISIBLE_DEVICES=0,1,2,3

docker run --runtime=nvidia --net="host" -e NVIDIA_VISIBLE_DEVICES=0,1,2,3 --shm-size 8g -it huangzc/reid:v1 /bin/bash

错误2.RuntimeError: DataLoader worker (pid 53617) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

启动容器的时候增加交换内存 --shm-size 8g

使用 pytorch 闪电进行多 GPU 训练时出错

【中文标题】使用 pytorch 闪电进行多 GPU 训练时出错【英文标题】:Getting error in multi-gpu training with pytorch lightning 【发布时间】:2021-05-03 14:29:06 【问题描述】:

以下代码可在单个 GPU 上运行,但在使用多个 GPU 时会引发错误 RuntimeError: grad 只能为标量输出隐式创建

代码

    def forward(                                                                
            self,                                                               
            input_ids,                                                          
            attention_mask=None,                                                
            decoder_input_ids=None,                                             
            decoder_attention_mask=None,                                        
            lm_labels=None                                                      
    ):                                                                          
        return self.model(                                                      
            input_ids,                                                          
            attention_mask=attention_mask,                                      
            decoder_input_ids=decoder_input_ids,                                
            decoder_attention_mask=decoder_attention_mask,                      
            labels=lm_labels,                                                   
        )                                                                       
                                                                                
    def _step(self, batch):                                                     
        lm_labels = batch["target_ids"]                                         
        # lm_labels[lm_labels[:, :] == self.tokenizer.pad_token_id] = -100      
        outputs = self(                                                         
            input_ids=batch["source_ids"],                                      
            attention_mask=batch["source_mask"],                                
            lm_labels=lm_labels,                                                
            decoder_attention_mask=batch['target_mask']                         
        )                                                                       
                                                                                
        loss = outputs[0]                                                       
                                                                                
        return loss    
    def training_step(self, batch, batch_idx):                                  
        loss = self._step(batch)                                                
        return "loss": loss  

损失值是一个缩放器: 张量(12.8875,设备='cuda:1',grad_fn=NllLossBackward) 此错误背后的原因可能是什么?

Traceback(最近一次调用最后一次): 文件“training_trial.py”,第 390 行,在 trainer.fit(模型) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”,第 510 行,适合 结果 = self.accelerator_backend.train() 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py”,第 57 行,在火车中 返回 self.train_or_test() 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py”,第 74 行,在 train_or_test 结果 = self.trainer.train() 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”,第 561 行,在火车中 self.train_loop.run_training_epoch() 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py”,第 549 行,在 run_training_epoch batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py”,第 704 行,在 run_training_batch self.optimizer_step(优化器,opt_idx,batch_idx,train_step_and_backward_closure) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py”,第 490 行,在 optimizer_step 使用_lbfgs=is_lbfgs, 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py”,第 1296 行,在 optimizer_step optimizer.step(closure=optimizer_closure) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py”,第 286 行,步骤 self.__optimizer_step(*args,closure=closure, profiler_name=profiler_name, **kwargs) _optimizer_step 中的文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py”,第 144 行 optimizer.step(闭包=闭包,*args,**kwargs) 包装器中的文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/torch/optim/lr_scheduler.py”,第 67 行 返回包装(*args,**kwargs) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/transformers/optimization.py”,第 318 行,步骤 损失=关闭() 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py”,第 699 行,位于 train_step_and_backward_closure self.trainer.hiddens 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py”,第 802 行,在 training_step_and_backward self.backward(结果,优化器,opt_idx) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py”,第 829 行,向后 result.closure_loss,优化器,opt_idx,*args,**kwargs 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py”,第 109 行,向后 model.backward(closure_loss,优化器,opt_idx,*args,**kwargs) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py”,第 1162 行,向后 loss.backward(*args, **kwargs) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/torch/tensor.py”,第 221 行,向后 torch.autograd.backward(自我,渐变,retain_graph,create_graph) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/torch/autograd/init.py”,第 126 行,向后 grad_tensors = make_grads(tensors, grad_tensors) _make_grads 中的文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/torch/autograd/init.py”,第 50 行 raise RuntimeError("grad 只能为标量输出隐式创建") RuntimeError: grad 只能为标量输出隐式创建

【问题讨论】:

【参考方案1】:

添加 training_step_end() 参考:https://github.com/PyTorchLightning/pytorch-lightning/issues/4073

 def training_step_end(self, training_step_outputs):
        return 'loss': training_step_outputs['loss'].sum()

【讨论】:

以上是关于docker容器内运行pytorch多gpu报错 RuntimeError: NCCL Error 2: unhandled system error的主要内容,如果未能解决你的问题,请参考以下文章

pytorch 单机多gpu运行

pytorch单机多卡训练

Pytorch——任意多卡GPU运行网络

Pytorch - GPU ID 指定 pytorch gpu 指定

使用 docker 容器中的 GPU?

Docker容器内多进程管理(草稿)