使用 torch.load 时出现运行时错误“存储大小错误：”

Posted 2023-03-29

技术标签:

【中文标题】使用 torch.load 时出现运行时错误“存储大小错误：”【英文标题】：There is a Runtime error "storage has wrong size:" when I use torch.load 【发布时间】：2021-07-14 09:31:22 【问题描述】：

我在调用 torch.load("pthfilename") 时收到“RuntimeError storage has wrong size”。我的模型在多个 GPU 上进行了训练，我使用以下代码保存了模型：

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"
device = torch.device(arg.local_rank)
net = Net().to(device)
net = torch.nn.parallel.DistributedDataParallel(net, device_ids=[arg.local_rank])
torch.save(net.state_dict(), "0.pth"))

错误是：

Traceback (most recent call last):
  File "/root/PycharmProjects/test.py", line 8, in <module>
    model_dict = torch.load("0.pth")
  File "torch/serialization.py", line 529, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "torch/serialization.py", line 709, in _legacy_load
    deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected -4916312287391674656 got 24

【问题讨论】：

你能帮帮我吗？ 【参考方案1】：

如果您使用多进程类型（例如：DistributedDataParallel）训练您的模型，您应该在保存模型时分配一个 local_rank。

refer this link, this 和 this ，希望这个解决方案可以帮助到你。

def save_checkpoint(epoch, model, best_top5, optimizer, 
                        is_best=False, 
                        filename='checkpoint.pth.tar'):
    state = 
        'epoch': epoch+1, 'state_dict': model.state_dict(),
        'best_top5': best_top5, 'optimizer' : optimizer.state_dict(),
    
    torch.save(state, filename)

if args.local_rank == 0:
    if is_best: save_checkpoint(epoch, model, best_top5, optimizer, is_best=True, filename='model_best.pth.tar')

【讨论】：

以上是关于使用 torch.load 时出现运行时错误“存储大小错误：”的主要内容，如果未能解决你的问题，请参考以下文章