如何解决这个 pytorch 两个设备错误

Posted 2023-03-16

技术标签:

【中文标题】如何解决这个 pytorch 两个设备错误【英文标题】：How can I solve this pytorch two devices error 【发布时间】：2022-01-05 13:32:52 【问题描述】：

我在使用 PyTorch 时遇到了问题：预计所有张量都在同一个设备上，但发现至少有两个设备，cpu 和 cuda:0！（在方法 wrapper_addmm 中检查参数 mat1 的参数时）

model = nn.Sequential(
        nn.Linear(622, 512),
        nn.ReLU(),
        nn.Linear(512, 256),
        nn.ReLU(),
        nn.Linear(256, 5),
    ).to(device)

    loss_fn = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

    train_loader = Data.DataLoader(
        dataset=train_dataset,
        batch_size=32,
        shuffle=True,
        num_workers=0,
    )

    test_loader = Data.DataLoader(
        dataset=test_dataset,
        batch_size=100,
        shuffle=True,
        num_workers=0,
    )

    best_acc = 0
    best_model = model.cpu().state_dict().copy()
    # train_acc = 0
    # test_acc = 0
    for epoch in range(20):
        for step, (batch_x, batch_y) in enumerate(train_loader):
            batch_x = batch_x.to(device)
            batch_y = batch_y.to(device)
            print(batch_x)
            print(batch_x.device, 0)
            out = model(batch_x.to(device)).cuda()
            print(out.device, 1)
            loss = loss_fn(out, batch_y.long())
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            train_acc = np.mean((torch.argmax(out, 1) == batch_y).cpu().numpy())

            with torch.no_grad():
                for batch_x, batch_y in test_loader:
                    batch_x = batch_x.to(device)
                    batch_y = batch_y.to(device)
                    print(batch_x.device, 2)
                    out = model(batch_x)
                    print(batch_x.device, 3)
                    test_acc = np.mean((torch.argmax(out, 1) == batch_y).cpu().numpy())
            if test_acc > best_acc:
                best_acc = test_acc
                best_model = model.cpu().state_dict().copy()

谁能帮忙解释一下，我整天都在研究这个......

【问题讨论】：

源代码是'out = model(batch_x)'，它会触发这个错误，所以我把它改成'out = model(batch_x.to(device)).cuda()'，stiil有同样的错误。 【参考方案1】：

请注意，.to() 在应用于nn.Modules 和torch.tensors 时具有不同的行为：while for torch.tensor .to(device) creates a copy of the tensor on the device, with nn.Module .to(device) operates in place。

在您的代码中，您将模型移至 CPU：

best_model = model.cpu().state_dict().copy()

确保在将模型移回 cpu 后将其移回 device。

【讨论】：