Pytorch Parallel KeyError Bug

Posted SoaringPigeon

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Pytorch Parallel KeyError Bug相关的知识,希望对你有一定的参考价值。

报错:

Traceback (most recent call last):                                                                                                                                                                                          
  File "train_point_corr.py", line 122, in <module>                                                                                                                                                                         
    main()                                                                                                                                                                                                                  
  File "train_point_corr.py", line 44, in main                                                                                                                                                                              
    return main_train(model_class_pointer, hparams, parser)                                                                                                                                                                 
  File "train_point_corr.py", line 111, in main_train                                                                                                                                                                       
    trainer.fit(model)                                                                                                                                                                                                      
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit                                                                                                   
    self.dispatch()                                                                                                                                                                                                         
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch                                                                                              
    self.accelerator.start_training(self)                                                                                                                                                                                   
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training                                                                                
    self.training_type_plugin.start_training(trainer)                                                                                                                                                                       
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training                                                             
    self._results = trainer.run_train()                                                                                                                                                                                     
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train                                                                                             
    self.train_loop.run_training_epoch()                                                                                                                                                                                    
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 492, in run_training_epoch                                                                              
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)                                                                                                                                                
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 631, in run_training_batch                                                                              
    split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens                                                                                                                                                        
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 742, in training_step_and_backward                                                                      
    result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)                                                                                                                                                   
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 293, in training_step
    training_step_output = self.trainer.accelerator.training_step(args)                                       
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 156, in training_step
    return self.training_type_plugin.training_step(*args)                                                     
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/dp.py", line 94, in training_step
    return self.model(*args, **kwargs)                 
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)                                                                   
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)                                                   
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()                                   
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)                           
KeyError: Caught KeyError in replica 0 on device 0.                                                           
Original Traceback (most recent call last):                                                                   
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)                  
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)                                                                   
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/overrides/data_parallel.py", line 74, in forward
    output = super().forward(*inputs, **kwargs)                                                               
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/pytorch_lightning/overrides/base.py", line 48, in forward
    output = self.module.training_step(*inputs, **kwargs)                                                     
  File "/home2/djc/DPC/models/shape_corr_trainer.py", line 60, in training_step
    batch = self(batch)                                
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)                                                                   
  File "/home2/djc/DPC/models/DeepPointCorr/CrossPointCorr.py", line 230, in forward
    # dense features, similarity, and cross reconstruction                                                    
  File "/home2/djc/DPC/models/DeepPointCorr/CrossPointCorr.py", line 130, in forward_source_target
    ###transformers                                    
  File "/home2/djc/DPC/models/DeepPointCorr/CrossPointCorr.py", line 118, in compute_cross_features
    src_pos=source_pe.transpose(0,1) if self.hparams.transformer_encoder_has_pos_emb else None,
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)                                                                   
  File "/home2/djc/DPC/models/sub_models/cross_attention/transformers.py", line 41, in forward
    src_pos=src_pos, tgt_pos=tgt_pos)                  
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)                                                                   
  File "/home2/djc/DPC/models/sub_models/cross_attention/transformers.py", line 256, in forward
    src_key_padding_mask, tgt_key_padding_mask, src_pos, tgt_pos)                                             
  File "/home2/djc/DPC/models/sub_models/cross_attention/transformers.py", line 214, in forward_pre
    src_w_pos = self.with_pos_embed(src2, src_pos)                                                            
  File "/home2/djc/DPC/models/sub_models/cross_attention/transformers.py", line 119, in with_pos_embed
    return tensor if pos is None else tensor + pos                                                            
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/traceback.py", line 197, in format_stack
    return format_list(extract_stack(f, limit=limit))                                                         
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/traceback.py", line 211, in extract_stack
    stack = StackSummary.extract(walk_stack(f), limit=limit)                                                  
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/traceback.py", line 360, in extract
    linecache.checkcache(filename)                     
  File "/home2/djc/anaconda3/envs/DPC/lib/python3.6/linecache.py", line 79, in checkcache
    del cache[filename]                                
KeyError: '/home2/djc/DPC/models/DeepPointCorr/CrossPointCorr.py'

关键报错:

KeyError: Caught KeyError in replica 0 on device 0. 
...
...
return format_list(extract_stack(f, limit=limit))
stack = StackSummary.extract(walk_stack(f), limit=limit) 
linecache.checkcache(filename)
del cache[filename]
KeyError: '/home2/djc/DPC/models/DeepPointCorr/CrossPointCorr.py'

bug出现特点:

  1. 非Parallel情况下不会出现此bug。
  2. 在运行此模型的时候,进行了相关的平凡修改模型。例如添加高参数,修改训练计划(training schedule options),测试模型等。

解决方法:
别在模型运行的时候修改、测试此模型,虽然理论上the running code should stick with the previous version instead the unpdated one,但是有这bug怎么办呢,运行程序的时候享受生活吧~lol

Reference:
When modified the model python file, the pytorch will raise the KeyError of this file #43120

以上是关于Pytorch Parallel KeyError Bug的主要内容,如果未能解决你的问题,请参考以下文章

Pytorch Parallel KeyError Bug

Pytorch:“KeyError:在 DataLoader 工作进程 0 中捕获 KeyError。”

尝试修改 pytorch-example 时出现 KeyError

pytorch分布式训练 DistributedSamplerDistributedDataParallel

如何在 Pytorch 中将一维 IntTensor 转换为 int

pytorch multi-gpu train