Pytorch Parallel KeyError Bug
Posted SoaringPigeon
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Pytorch Parallel KeyError Bug相关的知识,希望对你有一定的参考价值。
报错:
Traceback (most recent call last):
File "train_point_corr.py", line 122, in <module>
main()
File "train_point_corr.py", line 44, in main
return main_train(model_class_pointer, hparams, parser)
File "train_point_corr.py", line 111, in main_train
trainer.fit(model)
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
self.dispatch()
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
self.accelerator.start_training(self)
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
self.training_type_plugin.start_training(trainer)
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
self._results = trainer.run_train()
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
self.train_loop.run_training_epoch()
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 492, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 631, in run_training_batch
split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 742, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 293, in training_step
training_step_output = self.trainer.accelerator.training_step(args)
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 156, in training_step
return self.training_type_plugin.training_step(*args)
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/dp.py", line 94, in training_step
return self.model(*args, **kwargs)
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
KeyError: Caught KeyError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/overrides/data_parallel.py", line 74, in forward
output = super().forward(*inputs, **kwargs)
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/overrides/base.py", line 48, in forward
output = self.module.training_step(*inputs, **kwargs)
File "/home2/djc/DPC/models/shape_corr_trainer.py", line 60, in training_step
batch = self(batch)
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home2/djc/DPC/models/DeepPointCorr/CrossPointCorr.py", line 230, in forward
# dense features, similarity, and cross reconstruction
File "/home2/djc/DPC/models/DeepPointCorr/CrossPointCorr.py", line 130, in forward_source_target
###transformers
File "/home2/djc/DPC/models/DeepPointCorr/CrossPointCorr.py", line 118, in compute_cross_features
src_pos=source_pe.transpose(0,1) if self.hparams.transformer_encoder_has_pos_emb else None,
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home2/djc/DPC/models/sub_models/cross_attention/transformers.py", line 41, in forward
src_pos=src_pos, tgt_pos=tgt_pos)
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home2/djc/DPC/models/sub_models/cross_attention/transformers.py", line 256, in forward
src_key_padding_mask, tgt_key_padding_mask, src_pos, tgt_pos)
File "/home2/djc/DPC/models/sub_models/cross_attention/transformers.py", line 214, in forward_pre
src_w_pos = self.with_pos_embed(src2, src_pos)
File "/home2/djc/DPC/models/sub_models/cross_attention/transformers.py", line 119, in with_pos_embed
return tensor if pos is None else tensor + pos
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/traceback.py", line 197, in format_stack
return format_list(extract_stack(f, limit=limit))
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/traceback.py", line 211, in extract_stack
stack = StackSummary.extract(walk_stack(f), limit=limit)
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/traceback.py", line 360, in extract
linecache.checkcache(filename)
File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/linecache.py", line 79, in checkcache
del cache[filename]
KeyError: '/home2/djc/DPC/models/DeepPointCorr/CrossPointCorr.py'
关键报错:
KeyError: Caught KeyError in replica 0 on device 0.
...
...
return format_list(extract_stack(f, limit=limit))
stack = StackSummary.extract(walk_stack(f), limit=limit)
linecache.checkcache(filename)
del cache[filename]
KeyError: '/home2/djc/DPC/models/DeepPointCorr/CrossPointCorr.py'
bug出现特点:
- 非Parallel情况下不会出现此bug。
- 在运行此模型的时候,进行了相关的平凡修改模型。例如添加高参数,修改训练计划(training schedule options),测试模型等。
解决方法:
别在模型运行的时候修改、测试此模型,虽然理论上the running code should stick with the previous version instead the unpdated one,但是有这bug怎么办呢,运行程序的时候享受生活吧~lol
Reference:
When modified the model python file, the pytorch will raise the KeyError of this file #43120
以上是关于Pytorch Parallel KeyError Bug的主要内容,如果未能解决你的问题,请参考以下文章
Pytorch:“KeyError:在 DataLoader 工作进程 0 中捕获 KeyError。”
尝试修改 pytorch-example 时出现 KeyError
pytorch分布式训练 DistributedSamplerDistributedDataParallel