在 pytorch 中进行第一次 epoch 训练后系统挂起

Posted

技术标签:

【中文标题】在 pytorch 中进行第一次 epoch 训练后系统挂起【英文标题】:System hangs after first epoch training in pytorch 【发布时间】:2018-11-27 10:36:36 【问题描述】:

所以,我尝试使用 GitHub 存储库中的 ImageNet 示例在 PyTorch 中训练 ResNet 模型。

这是我的 train 方法的样子(与示例中的几乎相似)

def train(train_loader, model, criterion, optimizer, epoch):
    batch_time = AverageMeter()
    data_time = AverageMeter()
    losses = AverageMeter()
    top1 = AverageMeter()
    top5 = AverageMeter()

    args = get_args()

    # switch to train mode
    model.train()

    end = time.time()

    for i, (input, target) in enumerate(train_loader):
        print(i)
        # data loading time
        data_time.update(time.time() - end)

        if cuda:
            target = target.cuda(async = True)
            input_var = torch.autograd.Variable(input).cuda()
        else:
            input_var = torch.autograd.Variable(input)

        target_var = torch.autograd.Variable(target)

        # compute output
        output = model(input_var)
        loss = criterion(output, target_var)

        # measure accuracy and record loss
        prec1, prec5 = accuracy(output.data, target, topk=(1, 5))
        losses.update(loss.item(), input.size(0))
        top1.update(prec1.item(), input.size(0))
        # top5.update(prec5.item(), input.size(0))

        # compute gradient and do optimizer step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        #measure elapsed time
        batch_time.update(time.time() - end)
        end = time.time()

        # print to console and write logs to tensorboard
        if i % args.print_freq == 0:
            print('Epoch: [0][1/2]\t'
                  'Time batch_time.val:.3f (batch_time.avg:.3f)\t'
                  'Data data_time.val:.3f (data_time.avg:.3f)\t'
                  'Loss loss.val:.4f (loss.avg:.4f)\t'
                  'Prec@1 top1.val:.3f (top1.avg:.3f)\t'.format(
                epoch, i, len(train_loader), batch_time=batch_time,
                data_time=data_time, loss=losses, top1=top1, top5=top5))
            niter = epoch * len(train_loader) + i
            # writer.add_scalar('Train/Loss', losses.val, niter)
            # writer.add_scalar('Train/Prec@1', top1.val, niter)
            # writer.add_scalar('Train/Prec@5', top5.val, niter)

系统信息: GPU:英伟达 Titan XP 内存:32 Gb

PyTorch:0.4.0

当我运行这段代码时,训练从 epoch 0 开始

Epoch: [0][0/108]   Time 5.644 (5.644)  Data 1.929 (1.929)  Loss 6.9052 (6.9052)    Prec@1 0.000 (0.000)

然后远程服务器自动断开连接。它发生了五次。

这是数据加载器:

#Load the Data --> TRAIN
    traindir = 'train'
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    train_dataset = datasets.ImageFolder(traindir, transforms.Compose([
            transforms.RandomResizedCrop(224),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize,
        ]))
    train_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=args.batch_size, shuffle=True, num_workers=args.num_workers,
        pin_memory=cuda
    )

    # Load the data --> Validation
    valdir = 'valid'
    valid_loader = torch.utils.data.DataLoader(
        datasets.ImageFolder(valdir, transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            normalize,
        ])),
        batch_size=args.batch_size, shuffle=False, num_workers=args.num_workers,
        pin_memory=cuda
    )

    if args.evaluate:
        validate(valid_loader, model, criterion, epoch=0)
        return

    # Start
    for epoch in range(args.start_epoch, args.epochs):
        adjust_learning_rate(optimizer, epoch)

        # train for epoch
        train(train_loader, model, criterion, optimizer, epoch)

        # evaluate on valid
        prec1 = validate(valid_loader, model, criterion, epoch)

        # remember best prec1 and save checkpoint
        is_best = prec1 > best_prec1
        best_prec1 = max(prec1, best_prec1)
        save_checkpoint(
            'epoch': epoch + 1,
            'arch': args.arch,
            'state_dict': model.state_dict(),
            'best_prec1': best_prec1,
            'optimizer': optimizer.state_dict()
        , is_best)

有了这个加载器的参数:

args.num_workers = 4
args.batch_size = 32
pin_memory = torch.cuda.is_available()

我的方法有问题吗?

【问题讨论】:

【参考方案1】:

似乎是 pytorch 的数据加载器中的一个错误。

尝试 args.num_workers = 0

【讨论】:

以上是关于在 pytorch 中进行第一次 epoch 训练后系统挂起的主要内容,如果未能解决你的问题,请参考以下文章

PyTorch Lightning 中的批量测试及其存在的问题

colab pytorch保存模型

Pytorch模型保存与加载,并在加载的模型基础上继续训练

PyTorch 中的高效指标评估

给训练踩踩油门——Pytorch加速数据读取

pytorch实现断点续训