docker容器内运行pytorch多gpu报错 RuntimeError: NCCL Error 2: unhandled system error
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了docker容器内运行pytorch多gpu报错 RuntimeError: NCCL Error 2: unhandled system error相关的知识,希望对你有一定的参考价值。
参考技术A 错误1.docker容器内运行pytorch多gpu报错 RuntimeError: NCCL Error 2: unhandled system error在启动容器的时候加上 -e NVIDIA_VISIBLE_DEVICES=0,1,2,3
docker run --runtime=nvidia --net="host" -e NVIDIA_VISIBLE_DEVICES=0,1,2,3 --shm-size 8g -it huangzc/reid:v1 /bin/bash
错误2.RuntimeError: DataLoader worker (pid 53617) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
启动容器的时候增加交换内存 --shm-size 8g
使用 pytorch 闪电进行多 GPU 训练时出错
【中文标题】使用 pytorch 闪电进行多 GPU 训练时出错【英文标题】:Getting error in multi-gpu training with pytorch lightning 【发布时间】:2021-05-03 14:29:06 【问题描述】:以下代码可在单个 GPU 上运行,但在使用多个 GPU 时会引发错误 RuntimeError: grad 只能为标量输出隐式创建
代码
def forward(
self,
input_ids,
attention_mask=None,
decoder_input_ids=None,
decoder_attention_mask=None,
lm_labels=None
):
return self.model(
input_ids,
attention_mask=attention_mask,
decoder_input_ids=decoder_input_ids,
decoder_attention_mask=decoder_attention_mask,
labels=lm_labels,
)
def _step(self, batch):
lm_labels = batch["target_ids"]
# lm_labels[lm_labels[:, :] == self.tokenizer.pad_token_id] = -100
outputs = self(
input_ids=batch["source_ids"],
attention_mask=batch["source_mask"],
lm_labels=lm_labels,
decoder_attention_mask=batch['target_mask']
)
loss = outputs[0]
return loss
def training_step(self, batch, batch_idx):
loss = self._step(batch)
return "loss": loss
损失值是一个缩放器: 张量(12.8875,设备='cuda:1',grad_fn=NllLossBackward) 此错误背后的原因可能是什么?
Traceback(最近一次调用最后一次): 文件“training_trial.py”,第 390 行,在 trainer.fit(模型) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”,第 510 行,适合 结果 = self.accelerator_backend.train() 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py”,第 57 行,在火车中 返回 self.train_or_test() 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py”,第 74 行,在 train_or_test 结果 = self.trainer.train() 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”,第 561 行,在火车中 self.train_loop.run_training_epoch() 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py”,第 549 行,在 run_training_epoch batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py”,第 704 行,在 run_training_batch self.optimizer_step(优化器,opt_idx,batch_idx,train_step_and_backward_closure) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py”,第 490 行,在 optimizer_step 使用_lbfgs=is_lbfgs, 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py”,第 1296 行,在 optimizer_step optimizer.step(closure=optimizer_closure) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py”,第 286 行,步骤 self.__optimizer_step(*args,closure=closure, profiler_name=profiler_name, **kwargs) _optimizer_step 中的文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py”,第 144 行 optimizer.step(闭包=闭包,*args,**kwargs) 包装器中的文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/torch/optim/lr_scheduler.py”,第 67 行 返回包装(*args,**kwargs) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/transformers/optimization.py”,第 318 行,步骤 损失=关闭() 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py”,第 699 行,位于 train_step_and_backward_closure self.trainer.hiddens 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py”,第 802 行,在 training_step_and_backward self.backward(结果,优化器,opt_idx) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py”,第 829 行,向后 result.closure_loss,优化器,opt_idx,*args,**kwargs 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py”,第 109 行,向后 model.backward(closure_loss,优化器,opt_idx,*args,**kwargs) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py”,第 1162 行,向后 loss.backward(*args, **kwargs) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/torch/tensor.py”,第 221 行,向后 torch.autograd.backward(自我,渐变,retain_graph,create_graph) 文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/torch/autograd/init.py”,第 126 行,向后 grad_tensors = make_grads(tensors, grad_tensors) _make_grads 中的文件“/home/nvarshn2/.conda/envs/pytorch_lightning_new_env/lib/python3.7/site-packages/torch/autograd/init.py”,第 50 行 raise RuntimeError("grad 只能为标量输出隐式创建") RuntimeError: grad 只能为标量输出隐式创建
【问题讨论】:
【参考方案1】:添加 training_step_end() 参考:https://github.com/PyTorchLightning/pytorch-lightning/issues/4073
def training_step_end(self, training_step_outputs):
return 'loss': training_step_outputs['loss'].sum()
【讨论】:
以上是关于docker容器内运行pytorch多gpu报错 RuntimeError: NCCL Error 2: unhandled system error的主要内容,如果未能解决你的问题,请参考以下文章