WGAN-GP 的 Pytorch Autograd 中的内存泄漏

Posted

技术标签:

【中文标题】WGAN-GP 的 Pytorch Autograd 中的内存泄漏【英文标题】:Memory Leak in Pytorch Autograd of WGAN-GP 【发布时间】:2021-11-04 22:23:21 【问题描述】:

我想用WGAN-GP,运行代码的时候报错:

def calculate_gradient_penalty(real_images, fake_images):

    t = torch.rand(real_images.size(0), 1, 1, 1).to(real_images.device)
    t = t.expand(real_images.size())

    interpolates = t * real_images + (1 - t) * fake_images
    interpolates.requires_grad_(True)

    disc_interpolates = D(interpolates)

    grad = torch.autograd.grad(
        outputs=disc_interpolates, inputs=interpolates,
        grad_outputs=torch.ones_like(disc_interpolates),
        create_graph=True, retain_graph=True, allow_unused=True)[0]

    grad_norm = torch.norm(torch.flatten(grad, start_dim=1), dim=1)
    loss_gp = torch.mean((grad_norm - 1) ** 2) * lambda_term

    return loss_gp

RuntimeError Traceback(最近调用 最后)在

/opt/conda/lib/python3.8/site-packages/torch/tensor.py 在 后向(自我,梯度,retain_graph,create_graph,输入) 第243章 244 个输入 = 输入) --> 245 torch.autograd.backward(自我,渐变,retain_graph,create_graph,输入=输入) 246 247 def register_hook(自我,钩子):

/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py 向后(张量,grad_tensors,retain_graph,create_graph, grad_variables,输入) 第143章 144 --> 145 变量。execution_engine.run_backward( 146 张量,grad_tensors,retain_graph,create_graph,输入, 147 allow_unreachable=True,accumulate_grad=True) #allow_unreachable 标志

RuntimeError: CUDA 内存不足。尝试分配 64.00 MiB (GPU 2; 15.75 GiB 总容量;已分配 13.76 GiB; 2.75 MiB 免费; PyTorch 总共保留了 14.50 GiB)

火车代码:

%%time

d_progress = []
d_fake_progress = []
d_real_progress = []
penalty = []
g_progress = []

data = get_infinite_batches(benign_data_loader)
one = torch.FloatTensor([1]).to(device) 
mone = (one * -1).to(device) 

for g_iter in range(generator_iters):

    print('----------G Iter-----------'.format(g_iter+1))

    for p in D.parameters():
        p.requires_grad = True # This is by Default
    
    d_loss_real = 0
    d_loss_fake = 0
    Wasserstein_D = 0

    for d_iter in range(critic_iter):
        D.zero_grad()
         
        images = data.__next__()
        if images.size()[0] != batch_size:
            continue
    
        # Train Discriminator
        # Real Images
        images = images.to(device)
        z = torch.randn(batch_size, 100, 1, 1).to(device)
        d_loss_real = D(images)
        d_loss_real = d_loss_real.mean(0).view(1)
        d_loss_real.backward(mone)
    
        # Fake Images
        fake_images = G(z)
        d_loss_fake = D(fake_images)
        d_loss_fake = d_loss_fake.mean(0).view(1)
        d_loss_fake.backward(one)
    
        # Calculate Penalty
        gradient_penalty = calculate_gradient_penalty(images.data, fake_images.data)
        gradient_penalty.backward()
    
        # Total Loss
        d_loss = d_loss_fake - d_loss_real + gradient_penalty
        Wasserstein_D = d_loss_real - d_loss_fake
        d_optimizer.step()
        print(f'D Iter:d_iter+1/critic_iter Loss:d_loss.detach().cpu().numpy()')
    
        time.sleep(0.1)
        d_progress.append(d_loss) # Store Loss
        d_fake_progress.append(d_loss_fake)
        d_real_progress.append(d_loss_real)
        penalty.append(gradient_penalty)

    # Generator Updata
    for p in D.parameters():
        p.requires_grad = False  # Avoid Computation

    # Train Generator
    # Compute with Fake
    G.zero_grad()
    z = torch.randn(batch_size, 100, 1, 1).to(device)
    fake_images = G(z)
    g_loss = D(fake_images)
    g_loss = g_loss.mean().mean(0).view(1)
    g_loss.backward(one)
    # g_cost = -g_loss
    g_optimizer.step()
    print(f'G Iter:g_iter+1/generator_iters Loss:g_loss.detach().cpu().numpy()')
    
    g_progress.append(g_loss) # Store Loss    

有人知道如何解决这个问题吗?

【问题讨论】:

计算grad时需要保留图吗?您是否希望之后在loss_gp 上反向传播? 我想是的,如果我设置为False,就会出现:RuntimeError: Trying back through the graph a second time,但是保存的中间结果已经被释放了。第一次调用 .backward() 或 autograd.grad() 时指定 retain_graph=True。并且梯度(损失的 GP 部分)需要通过整个 D 网络反向传播 那么您的设备上根本没有足够的内存。你能展示你从loss_gp反向传播的部分吗? 好的,我已经上传了火车代码。 是的,您将需要更多内存。否则,您将不得不减少批量大小或减少模型的输入大小。 【参考方案1】:

保存在优化周期之外的所有损失张量(for g_iter in range(generator_iters)循环之外)需要从图中分离。否则,您会将所有先前的计算图保存在内存中。

因此,您应该分离附加到 d_progressd_fake_progressd_real_progresspenaltyg_progress 的任何内容。

您可以通过使用torch.Tensor.item 将张量转换为标量值来做到这一点,该图将在接下来的迭代中自行释放。更改以下行:

    d_progress.append(d_loss) # Store Loss
    d_fake_progress.append(d_loss_fake)
    d_real_progress.append(d_loss_real)
    penalty.append(gradient_penalty)

#######

g_progress.append(g_loss) # Store Loss  

到:

    d_progress.append(d_loss.item()) # Store Loss
    d_fake_progress.append(d_loss_fake.item())
    d_real_progress.append(d_loss_real.item())
    penalty.append(gradient_penalty.item())

#######

g_progress.append(g_loss.item()) # Store Loss  

【讨论】:

以上是关于WGAN-GP 的 Pytorch Autograd 中的内存泄漏的主要内容,如果未能解决你的问题,请参考以下文章

GAN1-对抗神经网络梳理(GAN,WGAN,WGAN-GP)

生成式对抗网络模型综述

GAN系列学习——前生今世

DCGANWGANWGAN-GPLSGANBEGAN原理总结及对比

一文读懂生成式对抗网络模型

深度学习生成对抗网络GAN|GANWGANWGAN-UPCGANCycleGANDCGAN