Faster_RCNN 2.模型准备（上）

Posted 2020-11-02 三年一梦

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Faster_RCNN 2.模型准备（上）相关的知识，希望对你有一定的参考价值。

总结自论文：Faster_RCNN，与Pytorch代码：

本文主要介绍代码第二部分：model/utils ，首先分析一些主要理论操作，然后在代码分析里详细介绍其具体实现。

一. 主要操作

1. bounding box回归：

目的是提高定位表现。在DPM与RCNN中均有运用。

1) RCNN版本：

在RCNN中，利用class-specific（特定类别）的bounding box regressor。也即每一个类别学一个回归器，然后对该类的bbox预测结果进行进一步微调。注意在回归的时候要将bbox坐标(左上右下)转为中心点(x,y)与宽高(w,h)。对于bbox的预测结果P和gt_bbox Q来说我们学要学一个变换，使得这个变换可以将P映射到一个新的位置，使其尽可能逼近gt_bbox。与Faster-RCNN不同之处是这个变换的参数组为：

这四个参数都是特征的函数，前两个体现为bbox的中心尺度不变性，后两个体现为体现为bbox宽高的对数空间转换。学到这四个参数（函数）后，就可以将P映射到G\', 使得G\'尽量逼近ground truth G：

那么这个参数组是怎么得到的呢？它是关于候选框P的pool5 特征的函数。由pool5出来的候选框P的特征我们定义为，那么我们有：

其中W就是可学习的参数。也即我们要学习的参数组等价于W与特征的乘积。那么回归的目标参数组是什么呢？就是上面四个式子中的逆过程：

就是回归的目标参数组。即我们希望对于不同类别分别学一个W，使得对每个类别的候选框在pool5提到的特征与W乘积后可以尽可能的逼近。很清楚，最小二乘岭回归目标函数：

因为是岭回归，所以有一个关于W的L2惩罚项，RCNN论文里给的惩罚因子lambda=1000。还有就是这个回归数据对(P,G)不是随便选的，预测的P应该离至少一个ground truth G很近，这样学出来的参数才有意义。近的度量是P、G的IOU>0.6。

可以看到RCNN的每一个proposal都要经过一次特征提取的过程，这样效率很低，而后续的Fast\\Faster-RCNN都是对一张图的feature map上的区域进行bounding box回归。

2) Faster RCNN版本：

较于RCNN,主要有两点不同。

首先，特征不同。RCNN中，回归的特征是每个proposal的经过pool5后的特征，而Faster-RCNN是在整张图的feature map上以3*3大小的卷积不断滑过，每个3*3大小的feature map对应于9个anchor。之后是两个并行1*1的卷积缩小特征通道为4*9（9个abchor的四个坐标）和2*9（9个anchor的0-1类别），分别用来做回归与分类。这也是RPN网络的工作之一。RPN网络也是Faster-RCNN的主要优势。

其次，回归器数目与回归目标函数不同。在Faster-RCNN中不再是class-specific，而是9个回归器。因为feature map上的每个点对应有9个anchor。这9个anchor对应了9种不同的尺度和宽高比。每个回归器只针对1种尺度与宽高比。所以虽然Faster-RCNN中给出的候选框是9种anchor，但是经过多次回归它可以预测出各种大小形状的bounding box，这也归功于anchor的设计。至于回归损失函数，首先看一下预测和目标公式：

其中x,y,w,h分别为bbox的中心点坐标，宽与高。分别是预测box、anchor box、真实box。计算类似于RCNN，前两行是预测的box关于anchor的offset与scales，后两行是真实box与anchor的offset与scales。那回归的目的很明显，即使得尽可能相近。回归损失函数利用的是Fast-RCNN中定义的smooth L1函数，对外点更不敏感：

损失函数优化权重W，使得测试时bbox经过W运算后可以得到一个较好的offsets与scales，利用这个offsets与scales可在原预测bbox上微调，得到更好的预测结果。

2. RPN网络剖析：

RPN网络 RoIHead网络

RPN网络也是Faster-RCNN中最大的改进。RPN网络的输入为图像特征。RPN网络是全卷积网络。RPN网络要完成的任务是训练自己、提供rois。

训练自己：二分类、bounding box 回归（由AnchorTargetCreator实现）

提供rois：为Fast-RCNN提供训练需要的rois（由ProposalCreator实现）

1）RPN网络综述：

前面提到过，整个训练过程batchsize=1，即每次输入一张图片，所以feature map的shape为（1，512，hh, ww）。那么RPN的输入便是（1，512，hh, ww）。然后经过512个3*3且含pad的卷积后仍为（1，512，hh，ww）。此卷积后shape并没有发生变化，意义是转换语义空间？然后分支出现了。有两路分支，左路是18个1*1卷积，右路是36个1*1卷积。1*1卷积的意义是改变特征维度。那左路卷积后shape为（1，18，hh，ww），右路卷积后shape为（1，36，hh，ww）。左路通道数变为18，是因为每个点对应的9个anchor实现2分类概率预测，所以是9*2 = 18！右路通道数变为36，是因为每个点对应的9个anchor实现4个坐标值的预测，所以是9*4 = 36！

2）RPN网络中AnchorTargetCreator分析：

将20000多个候选的anchor选出256个anchor进行二分类和所有的anchor进行回归位置 。为上面的预测值提供相应的真实值。选择方式如下：

对于每一个ground truth bounding box (gt_bbox)，选择和它重叠度（IoU）最高的一个anchor作为正样本。
对于剩下的anchor，从中选择和任意一个gt_bbox重叠度超过0.7的anchor，作为正样本，正样本的数目不超过128个。
随机选择和gt_bbox重叠度小于0.3的anchor作为负样本。负样本和正样本的总数为256。

对于每个anchor, gt_label 要么为1（前景），要么为0（背景），所以这样实现二分类。在计算回归损失的时候，只计算正样本（前景）的损失，不计算负样本的位置损失。

3) RPN网络中ProposalCreator分析：

RPN利用 AnchorTargetCreator自身训练的同时，还会提供RoIs（region of interests）给Fast RCNN（RoIHead）作为训练样本。RPN生成RoIs的过程(ProposalCreator)如下：

对于每张图片，利用它的feature map，计算 (H/16)× (W/16)×9（大概20000）个anchor属于前景的概率，以及对应的位置参数。
选取概率较大的12000个anchor
利用回归的位置参数，修正这12000个anchor的位置，得到RoIs
利用非极大值（(Non-maximum suppression, NMS）抑制，选出概率最大的2000个RoIs

注意：在inference的时候，为了提高处理速度，12000和2000分别变为6000和300.

注意：这部分的操作不需要进行反向传播，因此可以利用numpy/tensor实现。

RPN的输出：RoIs（形如2000×4或者300×4的tensor）

3. RPN网络至 RoIHead网络

ProposalTargetCreator分析：

ProposalTargetCreator是RPN网络与ROIHead网络的过渡操作，前面讲过，RPN会产生大约2000个RoIs，这2000个RoIs不是都拿去训练，而是利用ProposalTargetCreator 选择128个RoIs用以训练。选择的规则如下：

RoIs和gt_bboxes 的IoU大于0.5的，选择一些（比如32个）
选择 RoIs和gt_bboxes的IoU小于等于0（或者0.1）的选择一些（比如 128-32=96个）作为负样本

为了便于训练，对选择出的128个RoIs，还对他们的gt_roi_loc 进行标准化处理（减去均值除以标准差）

对于分类问题,直接利用交叉熵损失. 而对于位置的回归损失,一样采用Smooth_L1Loss, 只不过只对正样本计算损失.而且是只对正样本中的这个类别4个参数计算损失。举例来说:

一个RoI在经过FC 84后会输出一个84维的loc 向量. 如果这个RoI是负样本,则这84维向量不参与计算 L1_Loss
如果这个RoI是正样本,属于label K,那么它的第 K×4, K×4+1 ，K×4+2， K×4+3 这4个数参与计算损失，其余的不参与计算损失。

二. 代码分析

1. bbox_tools.py

有关生成、微调bounding box的操作

import numpy as np
import numpy as xp

import six
from six import __init__


def loc2bbox(src_bbox, loc):
    """Decode bounding boxes from bounding box offsets and scales.

    Given bounding box offsets and scales computed by
    :meth:`bbox2loc`, this function decodes the representation to
    coordinates in 2D image coordinates.

    Given scales and offsets :math:`t_y, t_x, t_h, t_w` and a bounding
    box whose center is :math:`(y, x) = p_y, p_x` and size :math:`p_h, p_w`,
    the decoded bounding box\'s center :math:`\\\\hat{g}_y`, :math:`\\\\hat{g}_x`
    and size :math:`\\\\hat{g}_h`, :math:`\\\\hat{g}_w` are calculated
    by the following formulas.

    * :math:`\\\\hat{g}_y = p_h t_y + p_y`
    * :math:`\\\\hat{g}_x = p_w t_x + p_x`
    * :math:`\\\\hat{g}_h = p_h \\\\exp(t_h)`
    * :math:`\\\\hat{g}_w = p_w \\\\exp(t_w)`

    The decoding formulas are used in works such as R-CNN [#]_.

    The output is same type as the type of the inputs.

    .. [#] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. \\
    Rich feature hierarchies for accurate object detection and semantic \\
    segmentation. CVPR 2014.

    Args:
        src_bbox (array): A coordinates of bounding boxes.
            Its shape is :math:`(R, 4)`. These coordinates are
            :math:`p_{ymin}, p_{xmin}, p_{ymax}, p_{xmax}`.
        loc (array): An array with offsets and scales.
            The shapes of :obj:`src_bbox` and :obj:`loc` should be same.
            This contains values :math:`t_y, t_x, t_h, t_w`.

    Returns:
        array:
        Decoded bounding box coordinates. Its shape is :math:`(R, 4)`. \\
        The second axis contains four values \\
        :math:`\\\\hat{g}_{ymin}, \\\\hat{g}_{xmin},
        \\\\hat{g}_{ymax}, \\\\hat{g}_{xmax}`.

    """

    if src_bbox.shape[0] == 0:
        return xp.zeros((0, 4), dtype=loc.dtype)

    src_bbox = src_bbox.astype(src_bbox.dtype, copy=False)

    src_height = src_bbox[:, 2] - src_bbox[:, 0]
    src_width = src_bbox[:, 3] - src_bbox[:, 1]
    src_ctr_y = src_bbox[:, 0] + 0.5 * src_height
    src_ctr_x = src_bbox[:, 1] + 0.5 * src_width

    dy = loc[:, 0::4]
    dx = loc[:, 1::4]
    dh = loc[:, 2::4]
    dw = loc[:, 3::4]

    ctr_y = dy * src_height[:, xp.newaxis] + src_ctr_y[:, xp.newaxis]
    ctr_x = dx * src_width[:, xp.newaxis] + src_ctr_x[:, xp.newaxis]
    h = xp.exp(dh) * src_height[:, xp.newaxis]
    w = xp.exp(dw) * src_width[:, xp.newaxis]

    dst_bbox = xp.zeros(loc.shape, dtype=loc.dtype)
    dst_bbox[:, 0::4] = ctr_y - 0.5 * h
    dst_bbox[:, 1::4] = ctr_x - 0.5 * w
    dst_bbox[:, 2::4] = ctr_y + 0.5 * h
    dst_bbox[:, 3::4] = ctr_x + 0.5 * w

    return dst_bbox


def bbox2loc(src_bbox, dst_bbox):
    """Encodes the source and the destination bounding boxes to "loc".

    Given bounding boxes, this function computes offsets and scales
    to match the source bounding boxes to the target bounding boxes.
    Mathematcially, given a bounding box whose center is
    :math:`(y, x) = p_y, p_x` and
    size :math:`p_h, p_w` and the target bounding box whose center is
    :math:`g_y, g_x` and size :math:`g_h, g_w`, the offsets and scales
    :math:`t_y, t_x, t_h, t_w` can be computed by the following formulas.

    * :math:`t_y = \\\\frac{(g_y - p_y)} {p_h}`
    * :math:`t_x = \\\\frac{(g_x - p_x)} {p_w}`
    * :math:`t_h = \\\\log(\\\\frac{g_h} {p_h})`
    * :math:`t_w = \\\\log(\\\\frac{g_w} {p_w})`

    The output is same type as the type of the inputs.
    The encoding formulas are used in works such as R-CNN [#]_.

    .. [#] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. \\
    Rich feature hierarchies for accurate object detection and semantic \\
    segmentation. CVPR 2014.

    Args:
        src_bbox (array): An image coordinate array whose shape is
            :math:`(R, 4)`. :math:`R` is the number of bounding boxes.
            These coordinates are
            :math:`p_{ymin}, p_{xmin}, p_{ymax}, p_{xmax}`.
        dst_bbox (array): An image coordinate array whose shape is
            :math:`(R, 4)`.
            These coordinates are
            :math:`g_{ymin}, g_{xmin}, g_{ymax}, g_{xmax}`.

    Returns:
        array:
        Bounding box offsets and scales from :obj:`src_bbox` \\
        to :obj:`dst_bbox`. \\
        This has shape :math:`(R, 4)`.
        The second axis contains four values :math:`t_y, t_x, t_h, t_w`.

    """

    height = src_bbox[:, 2] - src_bbox[:, 0]
    width = src_bbox[:, 3] - src_bbox[:, 1]
    ctr_y = src_bbox[:, 0] + 0.5 * height
    ctr_x = src_bbox[:, 1] + 0.5 * width

    base_height = dst_bbox[:, 2] - dst_bbox[:, 0]
    base_width = dst_bbox[:, 3] - dst_bbox[:, 1]
    base_ctr_y = dst_bbox[:, 0] + 0.5 * base_height
    base_ctr_x = dst_bbox[:, 1] + 0.5 * base_width

    eps = xp.finfo(height.dtype).eps
    height = xp.maximum(height, eps)
    width = xp.maximum(width, eps)

    dy = (base_ctr_y - ctr_y) / height
    dx = (base_ctr_x - ctr_x) / width
    dh = xp.log(base_height / height)
    dw = xp.log(base_width / width)

    loc = xp.vstack((dy, dx, dh, dw)).transpose()
    return loc


def bbox_iou(bbox_a, bbox_b):
    """Calculate the Intersection of Unions (IoUs) between bounding boxes.

    IoU is calculated as a ratio of area of the intersection
    and area of the union.

    This function accepts both :obj:`numpy.ndarray` and :obj:`cupy.ndarray` as
    inputs. Please note that both :obj:`bbox_a` and :obj:`bbox_b` need to be
    same type.
    The output is same type as the type of the inputs.

    Args:
        bbox_a (array): An array whose shape is :math:`(N, 4)`.
            :math:`N` is the number of bounding boxes.
            The dtype should be :obj:`numpy.float32`.
        bbox_b (array): An array similar to :obj:`bbox_a`,
            whose shape is :math:`(K, 4)`.
            The dtype should be :obj:`numpy.float32`.

    Returns:
        array:
        An array whose shape is :math:`(N, K)`. \\
        An element at index :math:`(n, k)` contains IoUs between \\
        :math:`n` th bounding box in :obj:`bbox_a` and :math:`k` th bounding \\
        box in :obj:`bbox_b`.

    """
    if bbox_a.shape[1] != 4 or bbox_b.shape[1] != 4:
        raise IndexError

    # top left
    tl = xp.maximum(bbox_a[:, None, :2], bbox_b[:, :2])
    # bottom right
    br = xp.minimum(bbox_a[:, None, 2:], bbox_b[:, 2:])

    area_i = xp.prod(br - tl, axis=2) * (tl < br).all(axis=2)
    area_a = xp.prod(bbox_a[:, 2:] - bbox_a[:, :2], axis=1)
    area_b = xp.prod(bbox_b[:, 2:] - bbox_b[:, :2], axis=1)
    return area_i / (area_a[:, None] + area_b - area_i)


def __test():
    pass


if __name__ == \'__main__\':
    __test()


def generate_anchor_base(base_size=16, ratios=[0.5, 1, 2],
                         anchor_scales=[8, 16, 32]):
    """Generate anchor base windows by enumerating aspect ratio and scales.

    Generate anchors that are scaled and modified to the given aspect ratios.
    Area of a scaled anchor is preserved when modifying to the given aspect
    ratio.

    :obj:`R = len(ratios) * len(anchor_scales)` anchors are generated by this
    function.
    The :obj:`i * len(anchor_scales) + j` th anchor corresponds to an anchor
    generated by :obj:`ratios[i]` and :obj:`anchor_scales[j]`.

    For example, if the scale is :math:`8` and the ratio is :math:`0.25`,
    the width and the height of the base window will be stretched by :math:`8`.
    For modifying the anchor to the given aspect ratio,
    the height is halved and the width is doubled.

    Args:
        base_size (number): The width and the height of the reference window.
        ratios (list of floats): This is ratios of width to height of
            the anchors.
        anchor_scales (list of numbers): This is areas of anchors.
            Those areas will be the product of the square of an element in
            :obj:`anchor_scales` and the original area of the reference
            window.

    Returns:
        ~numpy.ndarray:
        An array of shape :math:`(R, 4)`.
        Each element is a set of coordinates of a bounding box.
        The second axis corresponds to
        :math:`(y_{min}, x_{min}, y_{max}, x_{max})` of a bounding box.

    """
    py = base_size / 2.
    px = base_size / 2.

    anchor_base = np.zeros((len(ratios) * len(anchor_scales), 4),
                           dtype=np.float32)
    for i in six.moves.range(len(ratios)):
        for j in six.moves.range(len(anchor_scales)):
            h = base_size * anchor_scales[j] * np.sqrt(ratios[i])
            w = base_size * anchor_scales[j] * np.sqrt(1. / ratios[i])

            index = i * len(anchor_scales) + j
            anchor_base[index, 0] = py - h / 2.
            anchor_base[index, 1] = px - w / 2.
            anchor_base[index, 2] = py + h / 2.
            anchor_base[index, 3] = px + w / 2.
    return anchor_base

View Code

函数bbox2loc输入的是源bbox和目标bbox，然后输出的是参数组，即源bbox相对于bbox的offset和scales。即实现的是上述公式3）。注意对坐标的转换（顶点坐标转为中心、宽高）。

函数loc2bbox输入的是源bbox和参数组，输出的是目标bbox，正好是上面的逆过程，实现的是上述公式2）。bbox2loc称编码过程，那loc2bbox即为解码过程。

函数bbox_iou实现的是交并比IOU，即任给两组bbox（N,4 与 K,4），输出数组shape为（N,K），即求出两组bbox中两两的交并比。

函数generate_anchor_base实现生成9个base anchor，为什么是base呢，因为对于每个feature map平面中的点，都要以此点为中心生成9个anchor。下图所示是以（0，0）为中心：

上图是按照论文所述：9个anchor对应于3种scales（面积分别为128²，256²，512²）和3种aspect ratios(宽高比分别为1:1, 1:2, 2:1)。这9个anchor形状应为：

90.50967 *181.01933 = 128²
181.01933 * 362.03867 = 256²
362.03867 * 724.07733 = 512²
128.0 * 128.0 = 128²
256.0 * 256.0 = 256²
512.0 * 512.0 = 512²
181.01933 * 90.50967 = 128²
362.03867 * 181.01933 = 256²
724.07733 * 362.03867 = 512²

该函数返回值为anchor_base，形状9*4，是9个anchor的坐上右下坐标：

-37.2548 -82.5097 53.2548 98.5097
-82.5097 -173.019 98.5097 189.019
-173.019 -354.039 189.019 370.039
-56 -56 72 72
-120 -120 136 136
-248 -248 264 264
-82.5097 -37.2548 98.5097 53.2548
-173.019 -82.5097 189.019 98.5097
-354.039 -173.019 370.039 189.019

那么问题来了，上面这个只产生的是以左上角（0，0）为中心的bbox，如何产生以feature map上的每个点为中心得到的anchor呢？

代码 model / region_proposal_network 中的函数实现了这一操作：

self.anchor_base = generate_anchor_base(anchor_scales=anchor_scales, ratios=ratios) # 首先生成上述以（0，0）为中心的9个base anchor

...

n, _, hh, ww = x.shape                    # x为feature map，n为batch_size,此版本代码为1. hh，ww即为宽高
anchor = _enumerate_shifted_anchor(       # 调用下述函数
            np.array(self.anchor_base),   
            self.feat_stride, hh, ww)     # feat_stride=16 ，因为是经4次pool后提到的特征，故feature map较原图缩小了16倍

...

def _enumerate_shifted_anchor(anchor_base, feat_stride, height, width):          # 利用base anchor生成所有对应feature map的anchor
    # Enumerate all shifted anchors:                                             # anchor_base :(9,4) 坐标，这里 A=9
    #
    # add A anchors (1, A, 4) to
    # cell K shifts (K, 1, 4) to get
    # shift anchors (K, A, 4)
    # reshape to (K*A, 4) shifted anchors
    # return (K*A, 4)

    # !TODO: add support for torch.CudaTensor
    # xp = cuda.get_array_module(anchor_base)
    # it seems that it can\'t be boosed using GPU
    import numpy as xp 
    shift_y = xp.arange(0, height * feat_stride, feat_stride)           # 纵向偏移量（0，16，32，...）  
    shift_x = xp.arange(0, width * feat_stride, feat_stride)            # 横向偏移量（0，16，32，...）
    shift_x, shift_y = xp.meshgrid(shift_x, shift_y)                             
    shift = xp.stack((shift_y.ravel(), shift_x.ravel(),
                      shift_y.ravel(), shift_x.ravel()), axis=1)

    A = anchor_base.shape[0]    # 9
    K = shift.shape[0]          # K = hh*ww  ，K约为20000
    anchor = anchor_base.reshape((1, A, 4)) + \\
             shift.reshape((1, K, 4)).transpose((1, 0, 2))
    anchor = anchor.reshape((K * A, 4)).astype(np.float32)
    return anchor     # 返回（K，4），所有anchor的坐标

分析上述代码：函数_enumerate_shifted_anchor首先生成横向与纵向的偏移量，我们需要将特征图的每一个点放大16倍到原图：

原始图片的锚点中心，两两相距16像素（图源：机器之心）

得到偏移量后我们将每个偏移量与base anchor的坐标相加，即得到所有anchor的左上右下坐标。每张图都约生成有hh*ww=20000个anchor！

左侧：锚点、中心：特征图空间单一锚点在原图中的表达，右侧：所有锚点在原图中的表达 （图源：机器之心）

2. creator_tool.py

import numpy as np
import cupy as cp

from model.utils.bbox_tools import bbox2loc, bbox_iou, loc2bbox
from model.utils.nms import non_maximum_suppression


class ProposalTargetCreator(object):
    """Assign ground truth bounding boxes to given RoIs.

    The :meth:`__call__` of this class generates training targets
    for each object proposal.
    This is used to train Faster RCNN [#]_.

    .. [#] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. \\
    Faster R-CNN: Towards Real-Time Object Detection with \\
    Region Proposal Networks. NIPS 2015.

    Args:
        n_sample (int): The number of sampled regions.
        pos_ratio (float): Fraction of regions that is labeled as a
            foreground.
        pos_iou_thresh (float): IoU threshold for a RoI to be considered as a
            foreground.
        neg_iou_thresh_hi (float): RoI is considered to be the background
            if IoU is in
            [:obj:`neg_iou_thresh_hi`, :obj:`neg_iou_thresh_hi`).
        neg_iou_thresh_lo (float): See above.

    """

    def __init__(self,
                 n_sample=128,
                 pos_ratio=0.25, pos_iou_thresh=0.5,
                 neg_iou_thresh_hi=0.5, neg_iou_thresh_lo=0.0
                 ):
        self.n_sample = n_sample
        self.pos_ratio = pos_ratio
        self.pos_iou_thresh = pos_iou_thresh
        self.neg_iou_thresh_hi = neg_iou_thresh_hi
        self.neg_iou_thresh_lo = neg_iou_thresh_lo  # NOTE: py-faster-rcnn默认的值是0.1

    def __call__(self, roi, bbox, label,
                 loc_normalize_mean=(0., 0., 0., 0.),
                 loc_normalize_std=(0.1, 0.1, 0.2, 0.2)):
        """Assigns ground truth to sampled proposals.

        This function samples total of :obj:`self.n_sample` RoIs
        from the combination of :obj:`roi` and :obj:`bbox`.
        The RoIs are assigned with the ground truth class labels as well as
        bounding box offsets and scales to match the ground truth bounding
        boxes. As many as :obj:`pos_ratio * self.n_sample` RoIs are
        sampled as foregrounds.

        Offsets and scales of bounding boxes are calculated using
        :func:`model.utils.bbox_tools.bbox2loc`.
        Also, types of input arrays and output arrays are same.

        Here are notations.

        * :math:`S` is the total number of sampled RoIs, which equals \\
            :obj:`self.n_sample`.
        * :math:`L` is number of object classes possibly including the \\
            background.

        Args:
            roi (array): Region of Interests (RoIs) from which we sample.
                Its shape is :math:`(R, 4)`
            bbox (array): The coordinates of ground truth bounding boxes.
                Its shape is :math:`(R\', 4)`.
            label (array): Ground truth bounding box labels. Its shape
                is :math:`(R\',)`. Its range is :math:`[0, L - 1]`, where
                :math:`L` is the number of foreground classes.
            loc_normalize_mean (tuple of four floats): Mean values to normalize
                coordinates of bouding boxes.
            loc_normalize_std (tupler of four floats): Standard deviation of
                the coordinates of bounding boxes.

        Returns:
            (array, array, array):

            * **sample_roi**: Regions of interests that are sampled. \\
                Its shape is :math:`(S, 4)`.
            * **gt_roi_loc**: Offsets and scales to match \\
                the sampled RoIs to the ground truth bounding boxes. \\
                Its shape is :math:`(S, 4)`.
            * **gt_roi_label**: Labels assigned to sampled RoIs. Its shape is \\
                :math:`(S,)`. Its range is :math:`[0, L]`. The label with \\
                value 0 is the background.

        """
        n_bbox, _ = bbox.shape

        roi = np.concatenate((roi, bbox), axis=0)

        pos_roi_per_image = np.round(self.n_sample * self.pos_ratio)
        iou = bbox_iou(roi, bbox)
        gt_assignment = iou.argmax(axis=1)
        max_iou = iou.max(axis=1)
        # Offset range of classes from [0, n_fg_class - 1] to [1, n_fg_class].
        # The label with value 0 is the background.
        gt_roi_label = label[gt_assignment] + 1

        # Select foreground RoIs as those with >= pos_iou_thresh IoU.
        pos_index = np.where(max_iou >= self.pos_iou_thresh)[0]
        pos_roi_per_this_image = int(min(pos_roi_per_image, pos_index.size))
        if pos_index.size > 0:
            pos_index = np.random.choice(
                pos_index, size=pos_roi_per_this_image, replace=False)

        # Select background RoIs as those within
        # [neg_iou_thresh_lo, neg_iou_thresh_hi).
        neg_index = np.where((max_iou < self.neg_iou_thresh_hi) &
                             (max_iou >= self.neg_iou_thresh_lo))[0]
        neg_roi_per_this_image = self.n_sample - pos_roi_per_this_image
        neg_roi_per_this_image = int(min(neg_roi_per_this_image,
                                         neg_index.size))
        if neg_index.size > 0:
            neg_index = np.random.choice(
                neg_index, size=neg_roi_per_this_image, replace=False)

        # The indices that we\'re selecting (both positive and negative).
        keep_index = np.append(pos_index, neg_index)
        gt_roi_label = gt_roi_label[keep_index]
        gt_roi_label[pos_roi_per_this_image:] = 0  # negative labels --> 0
        sample_roi = roi[keep_index]

        # Compute offsets and scales to match sampled RoIs to the GTs.
        gt_roi_loc = bbox2loc(sample_roi, bbox[gt_assignment[keep_index]])
        gt_roi_loc = ((gt_roi_loc - np.array(loc_normalize_mean, np.float32)
                       ) / np.array(loc_normalize_std, np.float32))

        return sample_roi, gt_roi_loc, gt_roi_label


class AnchorTargetCreator(object):
    """Assign the ground truth bounding boxes to anchors.

    Assigns the ground truth bounding boxes to anchors for training Region
    Proposal Networks introduced in Faster R-CNN [#]_.

    Offsets and scales to match anchors to the ground truth are
    calculated using the encoding scheme of
    :func:`model.utils.bbox_tools.bbox2loc`.

    .. [#] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. \\
    Faster R-CNN: Towards Real-Time Object Detection with \\
    Region Proposal Networks. NIPS 2015.

    Args:
        n_sample (int): The number of regions to produce.
        pos_iou_thresh (float): Anchors with IoU above this
            threshold will be assigned as positive.
        neg_iou_thresh (float): Anchors with IoU below this
            threshold will be assigned as negative.
        pos_ratio (float): Ratio of positive regions in the
            sampled regions.

    """

    def __init__(self,
                 n_sample=256,
                 pos_iou_thresh=0.7, neg_iou_thresh=0.3,
                 pos_ratio=0.5):
        self.n_sample = n_sample
        self.pos_iou_thresh = pos_iou_thresh
        self.neg_iou_thresh = neg_iou_thresh
        self.pos_ratio = pos_ratio

    def __call__(self, bbox, anchor, img_size):
        """Assign ground truth supervision to sampled subset of anchors.

        Types of input arrays and output arrays are same.

        Here are notations.

        * :math:`S` is the number of anchors.
        * :math:`R` is the number of bounding boxes.

        Args:
            bbox (array): Coordinates of bounding boxes. Its shape is
                :math:`(R, 4)`.
            anchor (array): Coordinates of anchors. Its shape is
                :math:`(S, 4)`.
            img_size (tuple of ints): A tuple :obj:`H, W`, which
                is a tuple of height and width of an image.

        Returns:
            (array, array):

            #NOTE: it\'s scale not only  offset
            * **loc**: Offsets and scales to match the anchors to \\
                the ground truth bounding boxes. Its shape is :math:`(S, 4)`.
            * **label**: Labels of anchors with values \\
                :obj:`(1=positive, 0=negative, -1=ignore)`. Its shape \\
                is :math:`(S,)`.

        """

        img_H, img_W = img_size

        n_anchor = len(anchor)
        inside_index = _get_inside_index(anchor, img_H, img_W)
        anchor = anchor[inside_index]
        argmax_ious, label = self._create_label(
            inside_index, anchor, bbox)

        # compute bounding box regression targets
        loc = bbox2loc(anchor, bbox[argmax_ious])

        # map up to original set of anchors
        label = _unmap(label, n_anchor, inside_index, fill=-1)
        loc = _unmap(loc, n_anchor, inside_index, fill=0)

        return loc, label

    def _create_label(self, inside_index, anchor, bbox):
        # label: 1 is positive, 0 is negative, -1 is dont care
        label = np.empty((len(inside_index),), dtype=np.int32)
        label.fill(-1)

        argmax_ious, max_ious, gt_argmax_ious = \\
            self._calc_ious(anchor, bbox, inside_index)

        # assign negative labels first so that positive labels can clobber them
        label[max_ious < self.neg_iou_thresh] = 0

        # positive label: for each gt, anchor with highest iou
        label[gt_argmax_ious] = 1

        # positive label: above threshold IOU
        label[max_ious >= self.pos_iou_thresh] = 1

        # subsample positive labels if we have too many
        n_pos = int(self.pos_ratio * self.n_sample)
        pos_index = np.where(label == 1)[0]
        if len(pos_index) > n_pos:
            disable_index = np.random.choice(
                pos_index, size=(len(pos_index) - n_pos), replace=False)
            label[disable_index] = -1

        # subsample negative labels if we have too many
        n_neg = self.n_sample - np.sum(label == 1)
        neg_index = np.where(label == 0)[0]
        if len(neg_index) > n_neg:
            disable_index = np.random.choice(
                neg_index, size=(len(neg_index) - n_neg), replace=False)
            label[disable_index] = -1

        return argmax_ious, label

    def _calc_ious(self, anchor, bbox, inside_index):
        # ious between the anchors and the gt boxes
        ious = bbox_iou(anchor, bbox)
        argmax_ious = ious.argmax(axis=1)
        max_ious = ious[np.arange(len(inside_index)), argmax_ious]
        gt_argmax_ious = ious.argmax(axis=0)
        gt_max_ious = ious[gt_argmax_ious, np.arange(ious.shape[1])]
        gt_argmax_ious = np.where(ious == gt_max_ious)[0]

        return argmax_ious, max_ious, gt_argmax_ious


def _unmap(data, count, index, fill=0):
    # Unmap a subset of item (data) back to the original set of items (of
    # size count)

    if len(data.shape) == 1:
        ret = np.empty((count,), dtype=data.dtype)
        ret.fill(fill)
        ret[index] = data
    else:
        ret = np.empty((count,) + data.shape[1:], dtype=data.dtype)
        ret.fill(fill)
        ret[index, :] = data
    return ret


def _get_inside_index(anchor, H, W):
    # Calc indicies of anchors which are located completely inside of the image
    # whose size is speficied.
    index_inside = np.where(
        (anchor[:, 0] >= 0) &
        (anchor[:, 1] >= 0) &
        (anchor[:, 2] <= H) &
        (anchor[:, 3] <= W)
    )[0]
    return index_inside


class ProposalCreator:
    # unNOTE: I\'ll make it undifferential
    # unTODO: make sure it\'s ok
    # It\'s ok
    """Proposal regions are generated by calling this object.

    The :meth:`__call__` of this object outputs object detection proposals by
    applying estimated bounding box offsets
    to a set of anchors.

    This class takes parameters to control number of bounding boxes to
    pass to NMS and keep after NMS.
    If the paramters are negative, it uses all the bounding boxes supplied
    or keep all the bounding boxes returned by NMS.

    This class is used for Region Proposal Networks introduced in
    Faster R-CNN [#]_.

    .. [#] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. \\
    Faster R-CNN: Towards Real-Time Object Detection with \\
    Region Proposal Networks. NIPS 2015.

    Args:
        nms_thresh (float): Threshold value used when calling NMS.
        n_train_pre_nms (int): Number of top scored bounding boxes
            to keep before passing to NMS in train mode.
        n_train_post_nms (int): Number of top scored bounding boxes
            to keep after passing to NMS in train mode.
        n_test_pre_nms (int): Number of top scored bounding boxes
            to keep before passing to NMS in test mode.
        n_test_post_nms (int): Number of top scored bounding boxes
            to keep after passing to NMS in test mode.
        force_cpu_nms (bool): If this is :obj:`True`,
            always use NMS in CPU mode. If :obj:`False`,
            the NMS mode is selected based on the type of inputs.
        min_size (int): A paramter to determine the threshold on
            discarding bounding boxes based on their sizes.

    """

    def __init__(self,
                 parent_model,
                 nms_thresh=0.7,
                 n_train_pre_nms=12000,
                 n_train_post_nms=2000,
                 n_test_pre_nms=6000,
                 n_test_post_nms=300,
                 min_size=16
                 ):
        self.parent_model = parent_model
        self.nms_thresh = nms_thresh
        self.n_train_pre_nms = n_train_pre_nms
        self.n_train_post_nms = n_train_post_nms
        self.n_test_pre_nms = n_test_pre_nms
        self.n_test_post_nms = n_test_post_nms
        self.min_size = min_size

    def __call__(self, loc, score,
                 anchor, img_size, scale=1.):
        """input should  be ndarray
        Propose RoIs.

        Inputs :obj:`loc, score, anchor` refer to the same anchor when indexed
        by the same index.

        On notations, :math:`R` is the total number of anchors. This is equal
        to product of the height and the width of an image and the number of
        anchor bases per pixel.

        Type of the output is same as the inputs.

        Args:
            loc (array): Predicted offsets and scaling to anchors.
                Its shape is :math:`(R, 4)`.
            score (array): Predicted foreground probability for anchors.
                Its shape is :math:`(R,)`.
            anchor (array): Coordinates of anchors. Its shape is
                :math:`(R, 4)`.
            img_size (tuple of ints): A tuple :obj:`height, width`,
                which contains image size after scaling.
            scale (float): The scaling factor used to scale an image after
                reading it from a file.

        Returns:
            array:
            An array of coordinates of proposal boxes.
            Its shape is :math以上是关于Faster_RCNN  2.模型准备（上）的主要内容，如果未能解决你的问题，请参考以下文章