RuntimeError:cuda 运行时错误(710):设备端断言触发于

Posted

技术标签:

【中文标题】RuntimeError:cuda 运行时错误(710):设备端断言触发于【英文标题】:RuntimeError: cuda runtime error (710) : device-side assert triggered at 【发布时间】:2020-04-07 10:09:30 【问题描述】:

用pytorch训练图像分类得到关注 错误信息K


RuntimeError Traceback(最近调用 最后)在 29 打印(len(train_loader.dataset),len(valid_loader.dataset)) 30 #休息 ---> 31 train_loss, train_acc ,model= train(model, device, train_loader, optimizer, criteria) 32 valid_loss,valid_acc,model = 评估(模型,设备,valid_loader,标准) 33

in train(model, device, iterator, 优化器,标准) 21 acc = calculate_accuracy(fx, y) 22 #打印(“5。”) ---> 23 loss.backward() 24 25 优化器.step()

~/venv/lib/python3.7/site-packages/torch/tensor.py 在后向(自我, 梯度,retain_graph,create_graph) 164 种产品。默认为False。 第165章 --> 166 torch.autograd.backward(自我,渐变,retain_graph,create_graph) 167 168 def register_hook(自我,钩子):

~/venv/lib/python3.7/site-packages/torch/autograd/init.py 向后(张量,grad_tensors,retain_graph,create_graph, 毕业变量) 97 变量._execution_engine.run_backward( 98 个张量,grad_tensors,retain_graph,create_graph, ---> 99 allow_unreachable=True) #allow_unreachable 标志 100 101

RuntimeError: cuda 运行时错误 (710) : 触发设备端断言 在/pytorch/aten/src/THC/generic/THCTensorMath.cu:26

相关代码块在这里

def train(model, device, iterator, optimizer, criterion):

print('train')
epoch_loss = 0
epoch_acc = 0

model.train()


for (x, y) in iterator:
    #print(x,y)
    x,y = x.cuda(), y.cuda()
    #x = x.to(device)
    #y = y.to(device)
    #print('1')
    optimizer.zero_grad()
    #print('2')
    fx = model(x)
    #print('3')
    loss = criterion(fx, y)
    #print("4.loss->",loss)
    acc = calculate_accuracy(fx, y)
    #print("5.")
    loss.backward()

    optimizer.step()

    epoch_loss += loss.item()
    epoch_acc += acc.item()

return epoch_loss / len(iterator), epoch_acc / len(iterator),model


    EPOCHS = 5
    SAVE_DIR = 'models'
    MODEL_SAVE_PATH = os.path.join(SAVE_DIR, 'please.pt')
    from torch.utils.data import DataLoader
    best_valid_loss = float('inf')

    if not os.path.isdir(f'SAVE_DIR'):
        os.makedirs(f'SAVE_DIR')
    print("start")
    for epoch in range(EPOCHS):
        print('================================',epoch ,'================================')
        for i , (train_idx, valid_idx) in enumerate(zip(train_indexes, valid_indexes)):
            print(i,train_idx,valid_idx,len(train_idx),len(valid_idx))

            traindf = df_train.iloc[train_index, :].reset_index()
            validdf = df_train.iloc[valid_index, :].reset_index()

            #traindf = df_train
            #validdf = df_train

            train_dataset = TrainDataset(traindf, mode='train', transforms=data_transforms)
            valid_dataset = TrainDataset(validdf, mode='valid', transforms=data_transforms)

            train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
            valid_loader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False)



            print(len(train_loader.dataset),len(valid_loader.dataset))
            #break
            train_loss, train_acc ,model= train(model, device, train_loader, optimizer, criterion)
            valid_loss, valid_acc,model = evaluate(model, device, valid_loader, criterion)

            if valid_loss < best_valid_loss:
                best_valid_loss = valid_loss
                torch.save(model,MODEL_SAVE_PATH)

            print(f'| Epoch: epoch+1:02 | Train Loss: train_loss:.3f | Train Acc: train_acc*100:05.2f% | Val. Loss: valid_loss:.3f | Val. Acc: valid_acc*100:05.2f% |')

splits = zip(train_indexes, valid_indexes) [ 3692 3696 3703 ... 30733 30734 30735] [ 0 1 2 ... 4028 4041 4046] [ 0 1 2 ... 30733 30734 30735] [3692 3696 3703 ... 7986 7991 8005] [ 0 1 2 ... 30733 30734 30735] [ 7499 7500 7502 ... 11856 11858 11860] [ 0 1 2 ... 30733 30734 30735] [11239 11274 11280 ... 15711 15716 15720] [ 0 1 2 ... 30733 30734 30735] [15045 15051 15053 ... 19448 19460 19474] [ 0 1 2 ... 30733 30734 30735] [18919 18920 18926 ... 23392 23400 23402] [ 0 1 2 ... 30733 30734 30735] [22831 22835 22846 ... 27118 27120 27124] [ 0 1 2 ... 27118 27120 27124] [26718 26721 26728 ... 30733 30734 30735]

【问题讨论】:

【参考方案1】:

你的损失函数是什么?

我也遇到了这个错误。我的问题是 multi-class 分类,而我使用的是 crossEntropy 损失。

正如documentations 中所说,标签应在[0, C-1] 范围内,其中C 是类数。 但是我的标签不在范围内,当我为标签使用正确的值时,一切正常。

【讨论】:

以上是关于RuntimeError:cuda 运行时错误(710):设备端断言触发于的主要内容,如果未能解决你的问题,请参考以下文章

如何解决这个奇怪的错误:“RuntimeError: CUDA error: out of memory”

RuntimeError: GPU:0 上的 CUDA 运行时隐式初始化失败。状态:所有支持 CUDA 的设备都忙或不可用

在pytorch中运行py代码出现如下错误,求大神帮助 RuntimeError: CUDA error: unknown error

解决RuntimeError: CUDA error: invalid device symbol问题(已解决)

运行时错误:CUDA 内存不足。试图分配...但内存是空的

Pytorch RuntimeError: CUDA error: out of memory at loss.backward() , 使用 CPU 时没有错误