3.4 模型结构

本文来自开源组织 DataWhale 🐳 CV小组创作的目标检测入门教程。

对应开源项目 《动手学CV-Pytorch》 的第3章的内容,教程中涉及的代码也可以在项目中找到,后续会持续更新更多的优质内容,欢迎⭐️。




3.4.1 VGG16作为backbone


图3-17 Tiny-Detector的backbone



class VGGBase(nn.Module):                                                                                                                                         
    VGG base convolutions to produce feature maps.

    def __init__(self):
        super(VGGBase, self).__init__()

        # Standard convolutional layers in VGG16
        self.conv1_1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)  # stride = 1, by default
        self.conv1_2 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)    # 224->112

        self.conv2_1 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.conv2_2 = nn.Conv2d(128, 128, kernel_size=3, padding=1)
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)    # 112->56

        self.conv3_1 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
        self.conv3_2 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
        self.conv3_3 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
        self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2)    # 56->28

        self.conv4_1 = nn.Conv2d(256, 512, kernel_size=3, padding=1)
        self.conv4_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv4_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.pool4 = nn.MaxPool2d(kernel_size=2, stride=2)    # 28->14

        self.conv5_1 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv5_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv5_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.pool5 = nn.MaxPool2d(kernel_size=2, stride=2)    # 14->7

        # Load pretrained weights on ImageNet

    def forward(self, image):
        Forward propagation.

        :param image: images, a tensor of dimensions (N, 3, 224, 224)
        :return: feature maps pool5
        out = F.relu(self.conv1_1(image))  # (N, 64, 224, 224)
        out = F.relu(self.conv1_2(out))  # (N, 64, 224, 224)
        out = self.pool1(out)  # (N, 64, 112, 112)

        out = F.relu(self.conv2_1(out))  # (N, 128, 112, 112)
        out = F.relu(self.conv2_2(out))  # (N, 128, 112, 112)
        out = self.pool2(out)  # (N, 128, 56, 56)

        out = F.relu(self.conv3_1(out))  # (N, 256, 56, 56)
        out = F.relu(self.conv3_2(out))  # (N, 256, 56, 56)
        out = F.relu(self.conv3_3(out))  # (N, 256, 56, 56)
        out = self.pool3(out)  # (N, 256, 28, 28)

        out = F.relu(self.conv4_1(out))  # (N, 512, 28, 28)
        out = F.relu(self.conv4_2(out))  # (N, 512, 28, 28)
        out = F.relu(self.conv4_3(out))  # (N, 512, 28, 28)
        out = self.pool4(out)  # (N, 512, 14, 14)

        out = F.relu(self.conv5_1(out))  # (N, 512, 14, 14)
        out = F.relu(self.conv5_2(out))  # (N, 512, 14, 14)
        out = F.relu(self.conv5_3(out))  # (N, 512, 14, 14)
        out = self.pool5(out)  # (N, 512, 7, 7)

        # return 7*7 feature map                                                                                                                                  
        return out

    def load_pretrained_layers(self):
        we use a VGG-16 pretrained on the ImageNet task as the base network.
        There's one available in PyTorch, see https://pytorch.org/docs/stable/torchvision/models.html#torchvision.models.vgg16
        We copy these parameters into our network. It's straightforward for conv1 to conv5.
        # Current state of base
        state_dict = self.state_dict()
        param_names = list(state_dict.keys())

        # Pretrained VGG base
        pretrained_state_dict = torchvision.models.vgg16(pretrained=True).state_dict()
        pretrained_param_names = list(pretrained_state_dict.keys())

        # Transfer conv. parameters from pretrained model to current model
        for i, param in enumerate(param_names):  
            state_dict[param] = pretrained_state_dict[pretrained_param_names[i]]

        print("\\nLoaded base model.\\n")

因此,我们的Tiny_Detector特征提取层输出的是7x7的feature map,下面我们要在feature_map上设置对应的先验框,或者说anchor。


  • 将原图均匀分成7x7个cell
  • 设置3种不同的尺度:0.2, 0.4, 0.6
  • 设置3种不同的长宽比:1:1, 1:2, 2:1

因此,我们对这 7x7 的 feature map 设置了对应的7x7x9个anchor框,其中每一个cell有9个anchor框,如图3-18所示:



在我们的实验中,类别信息由21类别的得分组成(VOC数据集的20个类别 + 一个背景类),模型最终会选择预测得分最高的类作为边界框对象的类别。


图3-19 Tiny-Detector的输出示例


其实我们只需在7x7的feature map后,接上两个3x3的卷积层,即可分别完成分类和回归的预测。


3.4.2 分类头和回归头 边界框的编解码



图3-21 目标框和预测框示例



g c x = c x − c ^ x w ^ g_cx=\\fracc_x-\\hatc_x\\hatw gcx=w^cxc^x

g c y = c y − c ^ y h ^ g_cy=\\fracc_y-\\hatc_y\\hath gcy=h^cyc^y

g w = l o g ( w w ^ ) g_w=log(\\fracw\\hatw) gw=log(w^w)

g h = l o g ( h h ^ ) g_h=log(\\frach\\hath) gh=log(h^h)

模型预测并输出的是这个编码后的偏移量( g c x , g c y , g w , g h g_cx,g_cy,g_w,g_h gcx,gcy,gw,gh),最终只要再依照公式反向进行解码,就可以得到预测的目标框的信息。


def cxcy_to_gcxgcy(cxcy, priors_cxcy):
    Encode bounding boxes (that are in center-size form) w.r.t. the corresponding prior boxes (that are in center-size form).

    For the center coordinates, find the offset with respect to the prior box, and scale by the size of the prior box.
    For the size coordinates, scale by the size of the prior box, and convert to the log-space.

    In the model, we are predicting bounding box coordinates in this encoded form.

    :param cxcy: bounding boxes in center-size coordinates, a tensor of size (n_priors, 4)
    :param priors_cxcy: prior boxes with respect to which the encoding must be performed, a tensor of size (n_priors, 4)
    :return: encoded bounding boxes, a tensor of size (n_priors, 4)

    # The 10 and 5 below are referred to as 'variances' in the original SSD Caffe repo, completely empirical
    # They are for some sort of numerical conditioning, for 'scaling the localization gradient'
    # See https://github.com/weiliu89/caffe/issues/155
    return torch.cat([(cxcy[:, :2] - priors_cxcy[:, :2]) / (priors_cxcy[:, 2:] / 10),  # g_c_x, g_c_y
                      torch.log(cxcy[:, 2:] / priors_cxcy[:, 2:]) * 5], 1)  # g_w, g_h

def gcxgcy_to_cxcy(gcxgcy, priors_cxcy):
    Decode bounding box coordinates predicted by the model, since they are encoded in the form mentioned above.

    They are decoded into center-size coordinates.

    This is the inverse of the function above.

    :param gcxgcy: encoded bounding boxes, i.e. output of the model, a tensor of size (n_priors, 4)
    :param priors_cxcy: prior boxes with respect to which the encoding is defined, a tensor of size (n_priors, 4)
    :return: decoded bounding boxes in center-size form, a tensor of size (n_priors, 4)

    return torch.cat([gcxgcy[:, :2] * priors_cxcy[:, 2:] / 10 + priors_cxcy[:, :2],  # c_x, c_y
                      torch.exp(gcxgcy[:, 2:] / 5) * priors_cxcy[:, 2:]], 1)  # w, h 分类头与回归头预测

按照前面的介绍,对于输出7x7的feature map上的每个先验框我们想预测:


2)边界框编码后的偏移量( g c x , g c y , g w , g h g_cx,g_cy,g_w,g_h gcx,gcy,gw,gh)。

为了得到我们想预测的类别和偏移量,我们需要在feature map后分别接上两个卷积层:




图 3-22 Tiny-Detector输出示例

这个回归头和分类头的输出分别用蓝色和黄色表示。其feature map的大小7x7保持不变。我们真正关心的是第三维度通道数,把其具体的展开可以看到如下图3-23所示:

图3-23 每个cell中9个anchor预测编码偏移量



图3-24 每个cell中9个anchor预测分类得分

分类头和回归头结构的定义,由 model.py 中的 PredictionConvolutions 类实现,代码如下:

class PredictionConvolutions(nn.Module):
    Convolutions to predict class scores and bounding boxes using feature maps.

    The bounding boxes (locations) are predicted as encoded offsets w.r.t each of the 441 prior (default) boxes.
    See 'cxcy_to_gcxgcy' in utils.py for the encoding definition.

    The class scores represent the scores of each object class in each of the 441 bounding boxes located.
    A high score for 'background' = no object.

    def __init__(self, n_classes):
        :param n_classes: number of different types of objects
        super(PredictionConvolutions, self).__init__()

        self.n_classes = n_classes

        # Number of prior-boxes we are considering per position in the feature map
        # 9 prior-boxes implies we use 9 different aspect ratios, etc.
        n_boxes = 9 

        # Localization prediction convolutions (predict offsets w.r.t prior-boxes)
        self.loc_conv = nn.Conv2d(512, n_boxes * 4, kernel_size=3, padding=1)

        # Class prediction convolutions (predict classes in localization boxes)
        self.cl_conv = nn.Conv2d(512, n_boxes * n_classes, kernel_size=3, padding=1)

        # Initialize convolutions' parameters





