Pytorch 模型在 CPU 和 GPU 上都内存不足,无法弄清楚我做错了啥

Posted

技术标签:

【中文标题】Pytorch 模型在 CPU 和 GPU 上都内存不足,无法弄清楚我做错了啥【英文标题】:Pytorch model running out of memory on both CPU and GPU, can’t figure out what I’m doing wrongPytorch 模型在 CPU 和 GPU 上都内存不足,无法弄清楚我做错了什么 【发布时间】:2021-05-18 02:15:04 【问题描述】:

尝试使用 Pytorch Lightning 实现一个简单的多标签图像分类器。这是模型定义:

import torch
from torch import nn

# creates network class
class Net(pl.LightningModule):
    def __init__(self):
        super().__init__()

        # defines conv layers
        self.conv_layer_b1 = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=32,
                      kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Flatten(),
        )

        # passes dummy x matrix to find the input size of the fc layer
        x = torch.randn(1, 3, 800, 600)
        self._to_linear = None
        self.forward(x)

        # defines fc layer
        self.fc_layer = nn.Sequential(
            nn.Linear(in_features=self._to_linear,
                      out_features=256),
            nn.ReLU(),
            nn.Linear(256, 5),
        )

        # defines accuracy metric
        self.accuracy = pl.metrics.Accuracy()
        self.confusion_matrix = pl.metrics.ConfusionMatrix(num_classes=5)

    def forward(self, x):
        x = self.conv_layer_b1(x)

        if self._to_linear is None:
            # does not run fc layer if input size is not determined yet
            self._to_linear = x.shape[1]
        else:
            x = self.fc_layer(x)
        return x

    def cross_entropy_loss(self, logits, y):
        criterion = nn.CrossEntropyLoss()
        
        return criterion(logits, y)

    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        logits = self.forward(x)

        train_loss = self.cross_entropy_loss(logits, y)
        train_acc = self.accuracy(logits, y)
        train_cm = self.confusion_matrix(logits, y)

        self.log('train_loss', train_loss)
        self.log('train_acc', train_acc)
        self.log('train_cm', train_cm)

        return train_loss

    def validation_step(self, val_batch, batch_idx):
        x, y = val_batch
        logits = self.forward(x)

        val_loss = self.cross_entropy_loss(logits, y)
        val_acc = self.accuracy(logits, y)

        return 'val_loss': val_loss, 'val_acc': val_acc

    def validation_epoch_end(self, outputs):
        avg_val_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        avg_val_acc = torch.stack([x['val_acc'] for x in outputs]).mean()

        self.log("val_loss", avg_val_loss)
        self.log("val_acc", avg_val_acc)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=0.0008)

        return optimizer

问题可能不是机器,因为我使用的是具有 60 GB RAM 和 12 GB VRAM 的云实例。每当我运行这个模型时,即使是一个时期,我都会遇到内存不足的错误。在 CPU 上它看起来像这样:

RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 1966080000 bytes. Error code 12 (Cannot allocate memory)

在 GPU 上它看起来像这样:

RuntimeError: CUDA out of memory. Tried to allocate 7.32 GiB (GPU 0; 11.17 GiB total capacity; 4.00 KiB already allocated; 2.56 GiB free; 2.00 MiB reserved in total by PyTorch)

清除缓存和减小批处理大小不起作用。我是一个新手,很明显这里的东西正在爆炸,但我不知道是什么。任何帮助将不胜感激。

谢谢!

【问题讨论】:

【参考方案1】:

确实,这不是机器问题;模型本身就是不合理的大。通常,如果您查看常见的 CNN 模型,fc 层会出现在接近末尾,在输入已经通过相当多的卷积块(并且它们的空间分辨率降低)之后。

假设输入的形状为(batch, 3, 800, 600),当通过conv_layer_b1 层时,在MaxPool 操作之后,特征图的形状将是(batch, 32, 400, 300)。展平后,输入变为(batch, 32 * 400 * 300),即(batch, 3840000)

紧随其后的fc_layer 因此包含nn.Linear(3840000, 256),这简直是荒谬的。这个单一的线性层包含约 9.83 亿个可训练参数!作为参考,流行的图像分类 CNN 平均大约有 3 到 3000 万个参数,更大的变体达到 60 到 8000 万个。很少有人真正突破 1 亿大关。

你可以用这个来计算你的模型参数:

def count_params(model):
    return sum(map(lambda p: p.data.numel(), model.parameters()))

我的建议:800 x 600 确实是一个巨大的输入尺寸。如果可能,将其缩小到 400 x 300 左右。此外,在 FC 层之前添加几个类似于conv_layer_b1 的卷积块。例如:

def get_conv_block(C_in, C_out):
    return nn.Sequential(
            nn.Conv2d(in_channels=C_in, out_channels=C_out,
                      kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )

class Net(pl.LightningModule):
    def __init__(self):
        super().__init__()

        # defines conv layers
        self.conv_layer_b1 = get_conv_block(3, 16)
        self.conv_layer_b2 = get_conv_block(16, 32)
        self.conv_layer_b3 = get_conv_block(32, 64)
        self.conv_layer_b4 = get_conv_block(64, 128)
        self.conv_layer_b5 = get_conv_block(128, 256)

        # passes dummy x matrix to find the input size of the fc layer
        x = torch.randn(1, 3, 800, 600)
        self._to_linear = None
        self.forward(x)

        # defines fc layer
        self.fc_layer = nn.Sequential(
            nn.Flatten(),
            nn.Linear(in_features=self._to_linear,
                      out_features=256),
            nn.ReLU(),
            nn.Linear(256, 5)
        )

        # defines accuracy metric
        self.accuracy = pl.metrics.Accuracy()
        self.confusion_matrix = pl.metrics.ConfusionMatrix(num_classes=5)

    def forward(self, x):
        
        x = self.conv_layer_b1(x)
        x = self.conv_layer_b2(x)
        x = self.conv_layer_b3(x)
        x = self.conv_layer_b4(x)
        x = self.conv_layer_b5(x)
        
        if self._to_linear is None:
            # does not run fc layer if input size is not determined yet
            self._to_linear = nn.Flatten()(x).shape[1]
        else:
            x = self.fc_layer(x)
        return x

在这里,由于应用了更多的 conv-relu-pool 层,输入被缩减为一个更小形状的特征图,(batch, 256, 25, 18),可训练参数的总数将减少到大约 3000 万个参数.

【讨论】:

这是一个非常有帮助的答案,非常感谢我的家伙。

以上是关于Pytorch 模型在 CPU 和 GPU 上都内存不足,无法弄清楚我做错了啥的主要内容,如果未能解决你的问题,请参考以下文章

强烈推荐浅谈将Pytorch模型从CPU转换成GPU

使用 cpu 与 gpu 进行训练的 pytorch 模型精度之间的巨大差异

PyTorch在GPU上训练模型

pytorch 模型保存与加载 cpu转GPU

pytorch 模型保存与加载 cpu转GPU

是否有带有 CUDA 统一 GPU-CPU 内存分支的 PyTorch?