运行时错误:CUDA 在训练结束时内存不足并且不保存模型;火炬

Posted

技术标签:

【中文标题】运行时错误:CUDA 在训练结束时内存不足并且不保存模型;火炬【英文标题】:Runtime error: CUDA out of memory by the end of training and doesn’t save model; pytorch 【发布时间】:2021-08-21 10:28:21 【问题描述】:

我在数据科学和 pytorch 方面的经验并不多,而且我在这里至少实现任何东西都有问题(目前我正在为分段任务制作 NN)。存在某种内存问题,尽管它没有任何意义 - 每个时期占用的内存都比上升时期要少得多

import torch
from torch import nn
from torch.autograd import Variable
from torch.nn import Linear, ReLU6, CrossEntropyLoss, Sequential, Conv2d, MaxPool2d, Module, Softmax, Softplus ,BatchNorm2d, Dropout, ConvTranspose2d
import torch.nn.functional as F
from torch.nn import LeakyReLU,Tanh
from torch.optim import Adam, SGD
import numpy as np
import cv2 as cv
def train(epoch,model,criterion, x_train, y_train, loss_val):
    model.train()
    tr_loss = 0
    # getting the training set
    x_train, y_train = Variable(x_train), Variable(y_train)
    # converting the data into GPU format

    # clearing the Gradients of the model parameters
    optimizer.zero_grad()
    
    # prediction for training and validation set
    output_train = model(x_train)
    # computing the training and validation loss
    loss_train = criterion(output_train, y_train)
    train_losses.append(loss_train)
    # computing the updated weights of all the model parameters
    loss_train.backward()
    optimizer.step()
    tr_loss = loss_train.item()
    return loss_train
        # printing the validation loss
        
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 96, (3,3), padding=1)
        self.conv11= nn.Conv2d(96, 96, (3,3), padding=1)
        self.conv12= nn.Conv2d(96, 96, (3,3), padding=1)
        self.pool  = nn.MaxPool2d((2,2), 2)
        self.conv2 = nn.Conv2d(96, 192, (3,3), padding=1)
        self.conv21 = nn.Conv2d(192, 192, (3,3), padding=1)
        self.conv22 = nn.Conv2d(192, 192, (3,3), padding=1)
        self.b = BatchNorm2d(96)
        self.b1 = BatchNorm2d(192)
        self.b2 = BatchNorm2d(384)
        self.conv3 = nn.Conv2d(192,384,(3,3), padding=1)
        self.conv31= nn.Conv2d(384,384,(3,3), padding=1)
        self.conv32= nn.Conv2d(384,384,(3,3), padding=1)
        self.lin1   = nn.Linear(384*16*16, 256*2*2, 1)
        self.lin2   = nn.Linear(256*2*2, 16*16, 1)
        self.uppool = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=False)
        self.upconv1= nn.ConvTranspose2d(385,192,(3,3), padding=1)
        self.upconv11=nn.ConvTranspose2d(192,32,(3,3), padding=1)
        self.upconv12=nn.ConvTranspose2d(32,1,(3,3), padding=1)
        self.upconv2= nn.ConvTranspose2d(193,96,(3,3), padding=1)
        self.upconv21= nn.ConvTranspose2d(96,16,(3,3), padding=1)
        self.upconv22= nn.ConvTranspose2d(16,1,(3,3), padding=1)
        self.upconv3= nn.ConvTranspose2d(97,16,(3,3), padding=1)
        self.upconv4= nn.ConvTranspose2d(16,8,(3,3), padding=1)
        self.upconv6= nn.ConvTranspose2d(8,1,(3,3), padding=1)
    def forward(self, x):
        m=Tanh()
        x1=self.b(m(self.conv12(m(self.conv11(m(self.conv1(x)))))))
        x = self.pool(x1)
        x2=self.b1(m(self.conv22(m(self.conv21(m(self.conv2(x)))))))
        x = self.pool(x2)
        x3=self.b2(m(self.conv32(m(self.conv31(m(self.conv3(x)))))))
        x=self.pool(x3)
        x = x.view(-1, 16*16*384)
        x = m(self.lin1(x))
        x = m(self.lin2(x))
        x = x.view(1, 1, 16, 16)
        x=torch.cat((x,self.pool(x3)),1)
        x = self.uppool(m(self.upconv12(m(self.upconv11(m(self.upconv1(x)))))))
        
        x=torch.cat((x,self.pool(x2)),1)
        x = self.uppool(m(self.upconv22(m(self.upconv21(m(self.upconv2(x)))))))
        
        x=torch.cat((x,self.pool(x1)),1)
        x = (self.uppool(m(self.upconv3(x))))
        x = (m(self.upconv4(x)))
        l=Softplus()
        x= l(self.upconv6(x))
        return x
train_data=[]
for path in range(1000):
    n="".join(["0" for i in range(5-len(str(path)))])+str(path)
    paths="00000\\"+n+".png"
    train_data.append(cv.imread(paths))
for path in range(2000,3000):
    n="".join(["0" for i in range(5-len(str(path)))])+str(path)
    paths="02000\\"+n+".png"
    train_data.append(cv.imread(paths))
train_output=[]
for path in range(1,2001):
    n="outputs\\"+str(path)+".jpg"
    train_output.append(cv.imread(n))
data=torch.from_numpy((np.array(train_data,dtype=float).reshape(2000,3,128,128)/255)).reshape(2000,3,128,128)
data_cuda=torch.tensor(data.to('cuda'), dtype=torch.float32)

output=torch.from_numpy(np.array(train_output,dtype=float).reshape(2000,3,128,128))[:,2].view(2000,1,128,128)*2
output_cuda=torch.tensor(output.to('cuda'),dtype=torch.float32)
model=Net()
optimizer = Adam(model.parameters(), lr=0.1)
criterion = nn.BCEWithLogitsLoss()
if torch.cuda.is_available():
    model = model.cuda()
    criterion = criterion.cuda()
print(model)
epochs=3
n_epochs = 1
train_losses = []
val_losses = []
for epoch in range(n_epochs):
    loss_train=0
    for i in range(data.shape[0]):
        loss_train1=train(epoch,model,criterion,data_cuda[i].reshape(1,3,128,128),output_cuda[i].reshape(1,1,128,128),train_losses)
        loss_train+=loss_train1
    print('Epoch : ',epoch+1, '\t', 'loss :', loss_train/data.shape[0])
with torch.no_grad():
    torch.save(model.state_dict(), "C:\\Users\\jugof\\Desktop\\Python\\pytorch_models")
    a=np.array(model(data_cuda).to('cpu').numpy())*255
    cv.imshow('',a.reshape(128,128))
    cv.waitKey(0)"""

这是错误:

PS C:\Users\jugof\Desktop\Python> & C:/Users/jugof/anaconda3/python.exe c:/Users/jugof/Desktop/Python/3d_visual_effect1.py c:/Users/jugof/Desktop/Python/3d_visual_effect1.py:98: UserWarning: 要从张量复制构造,建议使用 sourceTensor.clone().detach() 或 sourceTensor.clone().detach() .requires_grad_(True),而不是 torch.tensor(sourceTensor)。 data_cuda=torch.tensor(data.to('cuda'), dtype=torch.float32) c:/Users/jugof/Desktop/Python/3d_visual_effect1.py:101: UserWarning: 要从张量复制构造,建议使用 sourceTensor.clone().detach() 或 sourceTensor.clone().detach() .requires_grad_(True),而不是 torch.tensor(sourceTensor)。 output_cuda=torch.tensor(output.to('cuda'),dtype=torch.float32) 纪元:1损失:张量(0.6933,设备='cuda:0',grad_fn=) 回溯(最近一次通话最后): 文件“c:/Users/jugof/Desktop/Python/3d_visual_effect1.py”,第 120 行,在 a=np.array(model(data_cuda).to('cpu').numpy())*255 _call_impl 中的文件“C:\Users\jugof\anaconda3\lib\site-packages\torch\nn\modules\module.py”,第 889 行 结果 = self.forward(*input, **kwargs) 文件“c:/Users/jugof/Desktop/Python/3d_visual_effect1.py”,第 62 行,向前 x1=self.b(m(self.conv12(m(self.conv11(m(self.conv1(x))))))) _call_impl 中的文件“C:\Users\jugof\anaconda3\lib\site-packages\torch\nn\modules\module.py”,第 889 行 结果 = self.forward(input, **kwargs) 文件“C:\Users\jugof\anaconda3\lib\site-packages\torch\nn\modules\conv.py”,第 399 行,向前 返回self._conv_forward(输入,self.weight,self.bias) _conv_forward 中的文件“C:\Users\jugof\anaconda3\lib\site-packages\torch\nn\modules\conv.py”,第 395 行 return F.conv2d(输入,权重,偏差,self.stride, 运行时错误:CUDA 内存不足。尝试分配 11.72 GiB(GPU 0;6.00 GiB 总容量;2.07 GiB 已分配;1.55 GiB 空闲;PyTorch 总共保留 2.62 GiB) 我输入一个 128128 形状的 numpy 数组(图像)并接收另一个相同形状,它是一个分割模型(再次)

我使用 Flickr-Faces-HQ 数据集 (FFHQ) 并使用了 128*128 下采样标签 - 我使用了 00000、01000 和 02000 文件,并且掩码由 opencv haarscascades_eye 接收

【问题讨论】:

【参考方案1】:

问题在于您的loss_train 列表,该列表存储了从实验开始时的所有损失。如果您输入的损失仅仅是浮点数,那将不是问题,但由于您没有在 train 函数中返回浮点数,您实际上是在存储损失张量,其中嵌入了所有计算图。实际上,张量保存了所有参与其计算的张量的指针,只要指针存在,分配的内存就不能被释放。

所以基本上,你保留所有时期的所有张量并防止 pytorch 清理它们;这就像一个(故意的)内存泄漏

您可以在开始实验后通过运行nvidia-smi -l 1 轻松监控此类问题。您将看到您的内存使用量线性增长,直到您的 GPU 内存不足(`nvidia-smi 是在 GPU 上执行操作时使用的好工具)。

为防止这种情况发生,只需将train函数的最后一行替换为return loss_train.item(),内存问题就会消失

【讨论】:

以上是关于运行时错误:CUDA 在训练结束时内存不足并且不保存模型;火炬的主要内容,如果未能解决你的问题,请参考以下文章

如何在训练过程中检查CUDA内存不足的根本原因?

CUDA 错误:内存不足 - Python 进程使用所有 GPU 内存

让 CUDA 内存不足

Tensorflow GPU错误CUDA_ERROR_OUT_OF_MEMORY:内存不足

即使使用 AWS P8 实例,Yolov5 模型训练也因 CUDA 内存不足而失败

keras上的多GPU训练错误(OOM)(足够的内存,可能是配置问题)