OverFeat(译)

Posted leebxo

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了OverFeat(译)相关的知识,希望对你有一定的参考价值。

OverFeat:用卷积网络同时进行图像识别、定位和检测

作者:P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,and Y. LeCun.

发表:ICLR, 2014

原文:Overfeat: Integrated recognition, localization and detection using convolutional networks. 

Abstract 摘要
We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window
approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learned simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013) and obtained very competitive results for the detection and classifications tasks. In post-competition work, we establish a new state of the art for the detection task. Finally, we release a feature extractor from our best model called OverFeat.

我们提出了一个使用卷积网络进行分类,定位和检测的集成框架。我们提供了一个卷积网络里进行高效的多尺度缩放和滑窗的方法。我们还介绍了一个通过学习预测物体边界来实现定位任务的深度学习方法。为了提高检测的可信度,边界框被累积而不是被抑制。我们展示了如何训练一个网络可以同时用于分类、定位、检测等不同任务。【文章充分利用了卷积神经网络的特征提取功能,它把分类网络的第一层到第五层看做是特征提取层,只需要改变网络的最后几层,就可以分别实现分类、定位和检测的任务 】。这个整合框架获得了ILSVRC2013(2013年ImageNet大型视觉识别挑战赛)中定位任务的冠军,并且在检测和分类任务中也获得了相当有竞争力的结果。赛后,我们改进做法,最终在我们最佳模型的基础上提出了一个特征提取器,取名叫OverFeat。

1 Introduction 引言

Recognizing the category of the dominant object in an image is a tasks to which Convolutional Networks (ConvNets) [17] have been applied for many years, whether the objects were handwritten characters [16], house numbers [24], textureless toys [18], traffic signs [3, 26], objects from the Caltech-101 dataset [14], or objects from the 1000-category ImageNet dataset [15]. The accuracy of ConvNets on small datasets such as Caltech-101, while decent, has not been record-breaking.However, the advent of larger datasets has enabled ConvNets to significantly advance the state of the art on datasets such as the 1000-category ImageNet[5].
 

在一幅图片里面识别出主要物体所属类别的这类任务中,不管是手写字体识别、门牌号码,没有纹理的玩具,交通标志、数据库Caltech-101中的物体,或者ImageNet数据库中的1000类物体识别,卷积网络都已经成功应用了很多年了。ConvNets在像Caltech-101这样的小数据集上的准确性虽然不错,但还不算破纪录。但是,大数据集的出现使得ConvNets能够显着提升数据集上的最优效果,例如1000种标签的ImageNet数据库。

The main advantage of ConvNets for many such tasks is that the entire system is trained end to end, from raw pixels to ultimate categories, thereby alleviating the requirement to manually design a suitable feature extractor. The main disadvantage is their ravenous appetite for labeled training samples.

ConvNets对于许多此类任务的主要优点是提供了从原始像素到最终分类的端到端的解决方案,因此缓解了需要手动设计一个合适特征提取器的需求【ConvNets可以自动提取特征】。主要劣势是ConvNets需要超大量的带标签训练样本。

The main point of this paper is to show that training a convolutional network to simultaneously classify, locate and detect objects in images can boost the classification accuracy and the detection and localization accuracy of all tasks. The paper proposes a new integrated approach to object detection, recognition, and localization with a single ConvNet. We also introduce a novel method for localization and detection by accumulating predicted bounding boxes. We suggest that by combining
many localization predictions, detection can be performed without training on background samples and that it is possible to avoid the time-consuming and complicated bootstrapping training passes. Not training on background also lets the network focus solely on positive classes for higher accuracy.Experiments are conducted on the ImageNet ILSVRC 2012 and 2013 datasets and establish state of the art results on the ILSVRC 2013 localization and detection tasks.


这篇论文的主要目的是为了证明通过训练一个卷积网络去同时进行分类、定位和检测可以增强所有这些任务的准确度。该论文提出了一种新的用单个ConvNet进行物体检测、识别和定位的集成方法。我们还介绍了一种通过累积预测的边界框来进行定位和检测的新方法。我们建议通过组合众多位置的预测,检测任务可以在不需要对照样本的情况下进行训练,并且可以避免使用耗时长并且复杂的bootstrapping训练方式。没有对照样本的训练还使得网络只关注正样本来获得更高的准确度。我们在ImageNet ILSVRC 2012和 2013 数据库上进行试验,并在ILSVRC 2013的定位和检测任务上获得最优结果。

While images from the ImageNet classification dataset are largely chosen to contain a roughlycentered object that fills much of the image, objects of interest sometimes vary significantly in size  and position within the image. The first idea in addressing this is to apply a ConvNet at multiple locations in the image, in a sliding window fashion, and over multiple scales. Even with this, however, many viewing windows may contain a perfectly identifiable portion of the object (say, the head of a dog), but not the entire object, nor even the center of the object. This leads to decent classification but poor localization and detection. Thus, the second idea is to train the system to not
only produce a distribution over categories for each window, but also to produce a prediction of the location and size of the bounding box containing the object relative to the window. The third idea is to accumulate the evidence for each category at each location and size.


尽管对于ImageNet数据库中的多数图片来说,主要目标物体大致在视野中心并占据了大半个图片,但是物体在图片里面的实际尺寸和位置有时候会差别很大。解决这个问题的第一个想法是采用滑窗的形式,在图像中的多个位置应用ConvNet。即使这种方式,也会产生多个包含物体可辨别部分(比如,狗的头部)的符合条件的窗口 ,但不能包含整个物体,也不确定物体的中心部分,这导致了分类效果良好但是定位和识别效果较差;因此,第二个想法是训练网络不仅生成每个窗口所属类别的概率分布,而且会产生物体相对窗体所在位置和大小的边框;第三个想法就是积累在每个位置和尺寸对应类别的置信度。

Many authors have proposed to use ConvNets for detection and localization with a sliding window over multiple scales, going back to the early 1990’s for multi-character strings [20], faces [30], and hands [22]. More recently, ConvNets have been shown to yield state of the art performance on text detection in natural images [4], face detection [8, 23] and pedestrian detection [25].

很多作者建议在多缩放尺度下以滑窗的方式利用卷积网络来检测和定位,此想法可追溯至1990年代,用来做包含人脸和手的识别、多字符的字符串识别。 最近,ConvNets已被证明在自然图像 ,人脸检测 和行人检测任务上表现最佳。

Several authors have also proposed to train ConvNets to directly predict the instantiation parameters of the objects to be located, such as the position relative to the viewing window, or the pose of the object. For example Osadchy et al. [23] describe a ConvNet for simultaneous face detection and pose estimation. Faces are represented by a 3D manifold in the nine-dimensional output space. Positions on the manifold indicate the pose (pitch, yaw, and roll). When the training image is a
face, the network is trained to produce a point on the manifold at the location of the known pose. If the image is not a face, the output is pushed away from the manifold. At test time, the distance to the manifold indicate whether the image contains a face, and the position of the closest point on the manifold indicates pose. Taylor et al. [27, 28] use a ConvNet to estimate the location of body parts (hands, head, etc) so as to derive the human body pose. They use a metric learning criterion to train the network to produce points on a body pose manifold. Hinton et al. have also proposed to train networks to compute explicit instantiation parameters of features as part of a recognition
process [12].

一些作者还提出用ConvNets直接预测要定位物体的实例化参数,比如相对于观察窗的位置、物体的姿态等。例如Osadchy 等用卷积网络同时进行脸部识别和姿态估计。脸部在九维输出空间中由3D流行来表示。流行上的位置指示姿势 (俯仰,偏航和滚转),当训练图片是人脸的时候,网络会在流行上已知姿势的位置产生一个点。如果图片不是人脸,会在上移除。在测试阶段,和流行的距离显示图像是否包含人脸,并且距离流行最近的点表明姿势。【百度“流行学习”】。泰勒等人使用ConvNet估计人体的部位(手、头、等)从而获得人体姿态。他们使用一个学习准则去训练网络产生身体姿势流行上点的集合。Hinton等人 也提到了训练网络来计算特征的显式实例化参数作为识别的一部分。

 

Other authors have proposed to perform object localization via ConvNet-based segmentation. The simplest approach consists in training the ConvNet to classify the central pixel (or voxel for volumetric images) of its viewing window as a boundary between regions or not [13]. But when the regions must be categorized, it is preferable to perform semantic segmentation. The main idea is to train the ConvNet to classify the central pixel of the viewing window with the category of the object it belongs to, using the window as context for the decision. Applications range from biological image analysis [21], to obstacle tagging for mobile robots [10] to tagging of photos [7]. The advantage of this approach is that the bounding contours need not be rectangles, and the regions need not be well-circumscribed objects. The disadvantage is that it requires dense pixel-level labels for training. This segmentation pre-processing or object proposal step has recently gained popularity in traditional computer vision to reduce the search space of position, scale and aspect ratio for detection [19, 2, 6, 29]. Hence an expensive classification method can be applied at the optimal location
in the search space, thus increasing recognition accuracy. Additionally, [29, 1] suggest that these methods improve accuracy by drastically reducing unlikely object regions, hence reducing potential false positives. Our dense sliding window method, however, is able to outperform object proposal methods on the ILSVRC13 detection dataset.
 

其他的作者建议通过基于卷积的图像分割来对物体定位,最简单的方式是训练ConvNet来确定滑动窗口的中心像素是否为区域边界。但是当必须要确定边界上的像素属于哪个类别的时候,最好执行语义分割。主要思想是把滑窗作为决策的上下文信息,训练神经网络来确定视窗中心像素所属物体类别。应用包括生物图像分析、移动机器人的障碍标注和照片标注【比如QQ相册里面标定某个人脸是谁这个应用】。这种方式的优点是物体的轮廓不需要是矩形框,区域也不需要是限定范围的物体。缺点是要求密集型像素级别的标签作为训练样本。这种分割预处理或者OP(百度object proposals)步骤近来在传统计算机视觉中得到普及,以减少用于检测任务中的空间位置搜索和尺度缩放。此外,有文献表明这些方法通过大幅减少不可能的对象区域来提高精度,从而减少了误检率。然而,我们基于滑窗的方法比这种object proposal 方法在 ILSVRC13检测数据库上表现要好。

Krizhevsky et al. [15] recently demonstrated impressive classification performance using a large ConvNet. The authors also entered the ImageNet 2012 competition, winning both the classification and localization challenges. Although they demonstrated an impressive localization performance, there has been no published work describing how their approach. Our paper is thus the first to provide a clear explanation how ConvNets can be used for localization and detection for ImageNet data.
 

Krizhevsky等人最近使用大型ConvNet展示了令人印象深刻的分类性能。作者在 ImageNet 2012竞赛中同时获得了分类和定位的冠军。尽管他们的方法定位表现很好,但是并没用公开的描述该方法的相关著作发表。从而我们的论文是第一个比较清晰的阐释ConvNets是怎么被用来在ImageNet数据集上做目标定位和检测的。

In this paper we use the terms localization and detection in a way that is consistent with their use in the ImageNet 2013 competition, namely that the only difference is the evaluation criterion used and both involve predicting the bounding box for each object in the image.

这篇文献我们使用和ImageNet2013竞赛上有关定位和检测的术语一致 ,也就是说唯一的区别是评估标准不同,但都会涉及到图像中每个物体边框的预测。

技术分享图片

 

技术分享图片

2 Vision Tasks 视觉任务
In this paper, we explore three computer vision tasks in increasing order of difficulty: (i) classification, (ii) localization, and (iii) detection. Each task is a sub-task of the next. While all tasks are adressed using a single framework and a shared feature learning base, we will describe them separately in the following sections.

在这篇论文里面,我们用从易到难的顺序探索了三项机器视觉任务: (i) 分类, (ii) 定位,  (iii) 检测。每一个任务都是下个任务的子任务。尽管所有的任务都是用同一个框架来实现并且共享同一个特征提取器,我们还是对它们在接下来的版块分开进行描述。

Throughout the paper, we report results on the 2013 ImageNet Large Scale Visual Recognition Challenge (ILSVRC2013). In the classification task of this challenge, each image is assigned a single label corresponding to the main object in the image. Five guesses are allowed to find the correct answer (this is because images can also contain multiple unlabeled objects). The localization task is similar in that 5 guesses are allowed per image, but in addition, a bounding box for the predicted object must be returned with each guess. To be considered correct, the predicted box must match the groundtruth by at least 50% (using the PASCAL criterion of union over intersection), as well as be labeled with the correct class (i.e. each prediction is a label and bounding box that are associated together). The detection task differs from localization in that there can be any number of objects in each image (including zero), and false positives are penalized by the mean average precision (mAP) measure. The localization task is a convenient intermediate step between classification and detection, and allows us to evaluate our localization method independently of challenges specific to detection (such as learning a background class). In Fig. 1, we show examples of images with our  localization/detection predictions as well as corresponding groundtruth. Note that classification and localization share the same dataset, while detection also has additional data where objects can be smaller. The detection data also contain a set of images where certain objects are absent. This can be used for bootstrapping, but we have not made use of it in this work.

在本文中,我们的报道结果是基于2013年ImageNet大型视觉识别挑战赛 (ILSVRC2013)的。在这次挑战赛的分类任务中,每一个图像都会被分配一个对应图像中主要物体的标签。每个图片可以有五个猜测结果,以便结论的准确性(因为图片可能会包含多个未分类的物体)。与此类似,在定位任务中同样允许获得五个猜测结果,此外,必须返回每个猜测结果对应的物体边界。预测边框必须和实际边框至少有50%的匹配度(使用交叉融合的PASCAL标准),并且必须分类准确(比如每个预测结果都有一个相关联的标签和边框)。和定位任务不同,检测任务允许图片中包含任意数量的物体(包含0个物体的情况),并且每次误检都会受到均值平均精度(mAP)度量的惩罚。定位任务是分类和检测的一个中间步骤,并允许我们采用检测任务不同的标准(比如学习一个已知类别)来评估定位任务。图示一展示了我们的定位检测结果和相对应的真实结果。注意分类和定位任务使用同样的数据集,但是检测任务还使用了额外的包含较小物体的数据集。检测任务的数据集还包含一系列不包含指定物体的图片集。这些数据集可以用来强化训练,但是我们在这篇文章里面并没有用到。

 

3 Classification 分类任务


Our classification architecture is similar to the best ILSVRC12 architecture by Krizhevsky et al. [15]. However, we improve on the network design and the inference step. Because of time constraints, some of the training features in Krizhevsky’s model were not explored, and so we expect our results can be improved even further. These are discussed in the future work section 6.

我们的分类网络架构和Krizhevsky等人设计的ILSVRC12最佳架构相似。但是我们改进了网络设计和推理的步骤。因为时间限制,并没有探索 Krizhevsky设计的模型里面一些训练特征,所以我们的结果还有改进的空间,这些将在版块6中进行讨论。


3.1 Model Design and Training【模型设计和训练】


We train the network on the ImageNet 2012 training set (1.2 million images and C = 1000 classes) [5]. Our model uses the same fixed input size approach proposed by Krizhevsky et al. [15] during training but turns to multi-scale for classification as described in the next section. Each image is downsampled so that the smallest dimension is 256 pixels. We then extract 5 random crops (and their horizontal flips) of size 221x221 pixels and present these to the network in mini-batches of size 128. The weights in the network are initialized randomly with (μ, σ) = (0, 1 × 10-2). They are then updated by stochastic gradient descent, accompanied by momentum term of 0.6 and an ?2 weight decay of 1 × 10-5. The learning rate is initially 5 × 10-2 and is successively decreased by a factor of 0.5 after (30, 50, 60, 70, 80) epochs. DropOut [11] with a rate of 0.5 is employed on the fully connected layers (6th and 7th) in the classifier. 

我们的训练数据集是ImageNet2012(120万张图片,1000个类别标签)。我们采用Krizhevsky等人建议的方式使用固定大小的输入,但是正如下文描述的一样,在训练的过程中将变为多尺度的。每个图像都进行下采样,以便最小尺寸为256像素。然后随机截取 5个大小为221*221的 crops,然后水平一翻转就增加到了10个,然后以mini-batch为128将这些crops作为神经网络的输入。网络权值初始化为(μ, σ) = (0, 1 × 10-2)的正态分布;然后通过随机提速下降法更新权值,权值更新的momentum term 为0.6;L2f范数的权值衰减为1*10-5;学习速率初始化为0.05,然后在30,50,60,70,80的epoch时减为原来的0.5;对于最后的全连接层进行 0.5 rate 的dropout。

We detail the architecture sizes in tables 1 and 3. Note that during training, we treat this architecture as non-spatial (output maps of size 1x1), as opposed to the inference step, which produces spatial outputs. Layers 1-5 are similar to Krizhevsky et al. [15], using rectification (“relu”) non-linearities and max pooling, but with the following differences: (i) no contrast normalization is used; (ii) pooling regions are non-overlapping and (iii) our model has larger 1st and 2nd layer feature maps, thanks to a smaller stride (2 instead of 4). A larger stride is beneficial for speed but will hurt accuracy.

我们设计了快速型和准确型两个网络,表格1和3详细的展示了这两个网络的层次结构和大小。注意在训练过程中我们还是采用和传统ConvNet相同的训练方式,不产生空间信息(输出特征图尺寸还是1x1),但是测试阶段(推理阶段)我们会采用卷积滑窗的形式进行空间密集型预测,输出的结果带有空间信息。网络的前5层和Krizhevsky等人设计的网络类似,使用RELU激活函数和最大池化层,但是有以下区别: 

  • 没有使用局部响应归一化层;
  • 没有使用overlapping-pooling,池化区域不重合
  • 第一层和第二层使用了较小的步长,原来为4,现在为2。步长越大速度越快,但有损精度。
技术分享图片
In Fig. 2, we show the filter coefficients from the first two convolutional layers. The first layer filters capture orientated edges, patterns and blobs. In the second layer, the filters have a variety of forms,some diffuse, others with strong line structures or oriented edges.

在图示2,我们展示了前两个卷积层的滤波系数。第一个卷积层捕捉定向边缘、图案和斑点。 在第二层中,滤波器有多种形式,有些是漫射的,有些则具有强线性结构或定向边缘

技术分享图片
Table 1: Architecture specifics for fast model. The spatial size of the feature maps depends on the input image size, which varies during our inference step (see Table 5 in the Appendix). Here we show training spatial sizes. Layer 5 is the top convolutional layer. Subsequent layers are fully connected, and applied in sliding window fashion at test time. The fully-connected layers can also be seen as 1x1 convolutions in a spatial setting. Similar sizes for accurate model can be found in the Appendix Table 3.
表格1:快速模型的网络架构。特征图的空间尺寸取决于输入图片的大小,在测试阶段会有不同的大小(请查看附录中的表格5)。这里我们展示的是训练阶段特征图的空间尺寸。第5层是最后一个卷几层,后面的三层是全连接层,并且在测试阶段是以滑窗的形式应用。全连接层也可以看做成空间为1x1的卷积层。精确模型的网络架构可以在附录的表格3中查看
 
技术分享图片
Table 3: Architecture specifics for accurate model. It differs from the fast model mainly in the stride of the first convolution, the number of stages and the number of feature maps.
表格3:精确模型的网络架构:和快速模型的区别主要是第一层卷积的步长(卷积步长由4x4变为了2x2)、步骤由原来的8步变为9步【第5层卷积以后多了一步offset max-pooling操作】、特征图的数目也不一样。

 

 

3.2 Feature Extractor 特征提取器

Along with this paper, we release a feature extractor named “OverFeat” in order to provide powerful features for computer vision research. Two models are provided, a fast and accurate one. Each architecture is described in tables 1 and 3. We also compare their sizes in Table 4 in terms of parameters and connections. The accurate model is more accurate than the fast one (14.18% classification error as opposed to 16.39% in Table 2), however it requires nearly twice as many connections. Using a committee of 7 accurate models reaches 13.6% classification error as shown in Fig. 4.

本文中,我们发布了名为 “OverFeat”的特征提取器,以便为计算机视觉的研究提供更有力的 features。我们提供了两个模型,一个快速版,一个精确版。具体网络结构可以分别查看表格1和表格3。我们在表格4中针对它们的参数个数和连接个数进行了对比。精确版14%的分类误差、快速版16.39%的分类误差,具体可以查看表格2,但是精确版却有接近两倍的连接数。使用准确版的7种模型平均分类误差为13.6%,如插图4。

技术分享图片
Figure 4: Test set classification results. During the competition, OverFeat yielded 14.2% top 5 error rate using an average of 7 fast models. In post-competition work, OverFeat ranks fifth with 13.6% error using bigger models (more features and more layers).
We report the test set results of the 2013 competition in Fig. 4 where our model (OverFeat) obtained 14.2% accuracy by voting of 7 ConvNets (each trained with different initializations) and ranked 5th out of 18 teams. The best accuracy using only ILSVRC13 data was 11.7%. Pre-training with extra data from the ImageNet Fall11 dataset improved this number to 11.2%. In post-competition work, we improve the OverFeat results down to 13.6% error by using bigger models (more features and more layers). Due to time constraints, these bigger models are not fully trained, more improvements are expected to appear in time【我们使用较大模型(拥有更多的特征和层)的OverFeat可以将误差降低到13.6%,由于时间原因,这些较大型网络并没有充分训练,时间允许的话,表现会更好。】

3.3 Multi-Scale Classification 多尺度的分类方法


In [15], multi-view voting is used to boost performance: a fixed set of 10 views (4 corners and center,with horizontal flip) is averaged. However, this approach can ignore many regions of the image, and is computationally redundant when views overlap. Additionally, it is only applied at a single scale,which may not be the scale at which the ConvNet will respond with optimal confidence.

曾有人用multi-view voting的方式来增强表现效果:把一个图像分为四个角和中间这样的五个分视图,再水平翻转,变为10-view,最后投票求平均。但是,这种方法会忽略掉图片的很多区域,并且当选取的部分视图重叠的时候,计算会变得冗余。

Instead, we explore the entire image by densely running the network at each location and at multiple scales. While the sliding window approach may be computationally prohibitive for certain types of model, it is inherently efficient in the case of ConvNets (see section 3.5). This approach yields significantly more views for voting, which increases robustness while remaining efficient. The result of convolving a ConvNet on an image of arbitrary size is a spatial map of C-dimensional vectors at each scale.

与此不同,我们会对图片的每个位置进行多尺度的神经网络运算。虽然对于某些模型进行滑窗会因为计算量过大而不可行,但是对于本身高效的卷积网络来说,这种操作是可行的(参考下文的3.5部分)。这种方式会产生更多的部分视图参与投票,这样,在保证效率的同时增加了稳定性。对任意大小的图像执行卷积操作都会在每个缩放尺度上产生某个C维向量的空间特征图。

However, the total subsampling ratio in the network described above is 2x3x2x3, or 36. Hence when applied densely, this architecture can only produce a classification vector every 36 pixels in the input dimension along each axis. This coarse distribution of outputs decreases performance compared to the 10-view scheme because the network windows are not well aligned with the objects in the images. The better aligned the network window and the object, the strongest the confidence of the network response. To circumvent this problem, we take an approach similar to that introduced by Giusti et al. [9], and apply the last subsampling operation at every offset. This removes the loss of resolution from this layer, yielding a total subsampling ratio of x12 instead of x36.

但是,以上网络的总降采样比例为2x3x2x3,缩小了36倍。因此,当执行密集型预测的时候,这种架构只能针对每36个像素产生一个分类向量。与10-view方案相比,这种粗略的输出分布会降低性能,因为网络窗口与图像中的物体不能很好地对齐。网络窗口和物体越对齐,网络输出的置信度就会越高。为了避免这个问题,我们采用类似于Giusti等人介绍的方法,对最后一步降采样执行Offset。这样就会消除这一层分辨率的损失,导致最终的降采样比例为12,而不是36。

We now explain in detail how the resolution augmentation is performed. We use 6 scales of input which result in unpooled layer 5 maps of varying resolution (see Table 5 for details). These are then pooled and presented to the classifier using the following procedure, illustrated in Fig. 3:

(a) For a single image, at a given scale, we start with the unpooled layer 5 feature maps.


(b) Each of unpooled maps undergoes a 3x3 max pooling operation (non-overlapping regions),repeated 3x3 times for (?x, ?y) pixel offsets of {0, 1, 2}.


(c) This produces a set of pooled feature maps, replicated (3x3) times for different (?x, ?y) combinations.

(d) The classifier (layers 6,7,8) has a fixed input size of 5x5 and produces a C-dimensional output vector for each location within the pooled maps. The classifier is applied in sliding-window fashion to the pooled maps, yielding C-dimensional output maps (for a given (?x, ?y) combination).


(e) The output maps for different (?x, ?y) combinations are reshaped into a single 3D output map (two spatial dimensions x C classes)

接下来我们详细解释一下分辨率具体是怎么增大的。我们使用了6种尺寸的输入,这样在第5层未池化以前会有不同的分辨率(具体查看表格5),然后我们采用以下步骤进行池化,如图3所示:

(a)对于给定大小的一张图片,我们使用第五层网络输出的特征图作为输入。

(b)接着对每一个特征图进行3x3的最大池化操作(池化区域不重叠),移动(?x, ?y)个像素后重复上述操作【?x, ?y属于 {0, 1, 2},所以一共9次操作】。

(c)这样就可以产生一系列池化后的特征图,针对不同的(?x, ?y)组合,特征图数量一共增加了(3x3)倍

(d)原始网络的6,7,8层为全连接层,也就是分类器,它们的输入尺寸固定为5x5,针对池化后特征图的每个位置都会产生一个C维的输出,也就是说如果特征图的空间维度为WxH,那么全连接层的输出大小为(WxH)xC(C是channel通道数或class类别标签数目的意思)。针对每一个给定的 (?x, ?y)池化后的特征图,分类器以滑窗的形式进行操作,输出对应的C维特征图。【原始网络被分成特征提取器和分类器,前5层为特征提取器,6,7,8层为分类器】

(e)通过交错排列的形式组合每个(?x, ?y)的输出特征图,形成最终的三维特征图(两个空间维度(WxH)x类别数目C)。

技术分享图片
Figure 3: 1D illustration (to scale) of output map computation for classification, using y-dimension from scale 2 as an example (see Table 5). (a): 20 pixel unpooled layer 5 feature map. (b): max pooling over non-overlapping 3 pixel groups, using offsets of ? = {0, 1, 2} pixels (red, green, blue respectively). (c): The resulting 6 pixel pooled maps, for different ?. (d): 5 pixel classifier (layers 6,7) is applied in sliding window fashion to pooled maps, yielding 2 pixel by C maps for each ?. (e): reshaped into 6 pixel by C output maps.
图例3:从未池化的第5层特征图到最终用于分类的特征图的计算过程1维空间图示(针对特定scale),我们使用scale种类为2的y-维空间作为图例(表格5中列举了不同scale计算的输入和输出尺寸)。
(a)未池化以前,第五层的特征图在y-空间上有20个像素。
(b)对这20个像素,分别采用offset=0,1 ,2,以3个像素为一组,进行没有重叠的最大池化操作,输出结果分别对应图中的红、绿、蓝。
(c)最终针对不同的offset产生对应的6个像素的特征图
(d)针对每个池化后6个像素的特征图,在网络的6,7层,以5个像素为一组进行滑窗,产生2个像素xC的映射结果
(e)重新交错整合这3x2个像素,输出最终的特征图
 
技术分享图片
Table 5: Spatial dimensions of our multi-scale approach. 6 different sizes of input images are used, resulting in layer 5 unpooled feature maps of differing spatial resolution (although not indicated in the table, all have 256 feature channels). The (3x3) results from our dense pooling operation
with (?x, ?y) = {0, 1, 2}. See text and Fig. 3 for details for how these are converted into output maps.
表格5:不同尺寸输入各层的空间维度信息
我们使用了6种不同尺寸的输入图片,导致第5层特征图会有不同的空间分辨率(尽管没有在图中显示,所有scale的第5层都有256个通道),由于我们采用了(?x, ?y) = {0, 1, 2}的密集型池化操作,最后结果会是原始网络的(3x3)倍。详细的转换过程请参考图示3

These operations can be viewed as shifting the classifier’s viewing window by 1 pixel through pooling layers without subsampling and using skip-kernels in the following layer (where values in the neighborhood are non-adjacent). Or equivalently, as applying the final pooling layer and fullyconnected stack at every possible offset, and assembling the results by interleaving the outputs. 

以上操作可以看成,通过没有降采样的池化操作并在接下来的层采用跳跃核(邻域的值是不相邻的)使得分类器的查看窗口移动了一个像素。或者等价地,在每一次偏移后进行池化和全连接,然后交错输出结果。

The procedure above is repeated for the horizontally flipped version of each image. We then produce the final classification by (i) taking the spatial max for each class, at each scale and flip; (ii) averaging the resulting C-dimensional vectors from different scales and flips and (iii) taking the top-1 or top-5 elements (depending on the evaluation criterion) from the mean class vector.

同样的,对图像水平翻转后的特征图执行以上步骤,然后我们通过以下步骤产生最终的分类(i)在所有缩放和翻转后输出的特征图中分别找到针对每个类别标签的空间最大值组成C-维向量(ii)对所有C-维向量求平均(iii)在平均后的C-维向量中找到top-1和top-5的类别标签。

 At an intuitive level, the two halves of the network — i.e. feature extraction layers (1-5) and classifier layers (6-output) — are used in opposite ways. In the feature extraction portion, the filters are convolved across the entire image in one pass. From a computational perspective, this is far more efficient than sliding a fixed-size feature extractor over the image and then aggregating the results from different locations2. However, these principles are reversed for the classifier portion of the network. Here, we want to hunt for a fixed-size representation in the layer 5 feature maps across different positions and scales. Thus the classifier has a fixed-size 5x5 input and is exhaustively applied to the layer 5 maps. The exhaustive pooling scheme (with single pixel shifts (?x, ?y)) ensures that we can obtain fine alignment between the classifier and the representation of the object in the feature map
 

在直观的层面上,以上网络的两个部分-特征提取层(1-5层)和分类层(6层到最后输出)的处理方式是相反的。 特征提取部分滤波器在整幅图像上只进行了一次卷积操作,从计算的角度来说,这比通过固定大小的特征提取器在图像上进行滑窗然后在不同的位置上放大结果要高效的多。但是呢,分类层的原则与此相反。在这,我们想找到一个固定大小可以贯穿第五层特征图不同部分和比例的表示形式,因此分类层有一个5x5固定尺寸的输入,并且没有遗漏第5层特征图的相关信息。全覆盖的池化方案(通过每次平移一个像素 ,总共平移(?x, ?y) 的max-pooling操作)使得我们可以获得分类器与特征映射中目标物体的精确对齐。

3.4 Results 实验结果


In Table 2, we experiment with different approaches, and compare them to the single network model of Krizhevsky et al. [15] for reference. The approach described above, with 6 scales, achieves a top-5 error rate of 13.6%. As might be expected, using fewer scales hurts performance: the singlescale model is worse with 16.97% top-5 error. The fine stride technique illustrated in Fig. 3 brings a relatively small improvement in the single scale regime, but is also of importance for the multi-scale gains shown here

【Our network with 6 scales takes around 2 secs on a K20x GPU to process one image】

在表格2中,我们实验了不同的方法,并且和Krizhevsky等人设计的网络做比较。6种不同scales的上述方法取得了13.6%的top-5的错误率。和想象中一样,较少的scales有损性能:使用singlescale的模型top-5错误率比16.97%还差。在图示3中描述的精细步长技术(offset max-pooling技术)对于单一scale的网络带来了相对小的改进,但是对于此处显示的多scales网络也很重要。【我们带有6scales的网络在K20x GPU上处理一张图片大概需要2秒】

技术分享图片
Table 2: Classification experiments on validation set. Fine/coarse stride refers to the number of ? values used when applying the classifier. Fine: ? = 0, 1, 2; coarse: ? = 0.
表格2:在验证集上的分类实验结果。 Fine/coarse stride 在此处是指offset-max-pooling中? 的数量。精细:? = 0, 1, 2; 粗糙: ? = 0.
技术分享图片
Figure 4: Test set classification results. During the competition, OverFeat yielded 14.2% top 5 error rate using an average of 7 fast models. In post-competition work, OverFeat ranks fifth with 13.6% error using bigger models (more features and more layers).

We report the test set results of the 2013 competition in Fig. 4 where our model (OverFeat) obtained 14.2% accuracy by voting of 7 ConvNets (each trained with different initializations) and ranked 5th out of 18 teams. The best accuracy using only ILSVRC13 data was 11.7%. Pre-training with extra data from the ImageNet Fall11 dataset improved this number to 11.2%. In post-competition work,we improve the OverFeat results down to 13.6% error by using bigger models (more features and more layers). Due to time constraints, these bigger models are not fully trained, more improvements are expected to appear in time.

我们在图4中报告了2013年比赛的测试集结果,其中我们的模型(OverFeat)通过7个ConvNet的投票(每个训练有不同的初始化)获得了14.2%的准确性,并在18支队伍中排名第5。仅使用ILSVRC13数据的最佳准确性为11.7%。 使用ImageNet Fall11数据集的额外数据进行预训练将此数字提高至11.2%。赛后,我们使用更大的模型(包含更多的特征和更多的层)将误差降低到13.6%,由于时间的限制,大型网络并没有充分训练,时间允许的话,表现会更好。

3.5 ConvNets and Sliding Window Efficiency 卷积网络和滑动窗口效率


In contrast to many sliding-window approaches that compute an entire pipeline for each window of the input one at a time, ConvNets are inherently efficient when applied in a sliding fashion because they naturally share computations common to overlapping regions. When applying our network to larger images at test time, we simply apply each convolution over the extent of the full image. This extends the output of each layer to cover the new image size, eventually producing a map of output class  predictions, with one spatial location for each “window” (field of view) of input. This is diagrammed in Fig. 5. Convolutions are applied bottom-up, so that the computations common to neighboring windows need only be done once.

和许多每个窗口都需要从头计算一遍的滑窗方法相比,当卷积应用滑窗的时候具有固有的高效性,因为它们对重叠区域共享计算。在测试阶段,当我们的网络应用于更大的图像时,我们只需在整幅图像的范围内应用每个卷积。这样就扩展了每层的输出以覆盖新的图像大小,最终生成输出类别的映射图,映射图空间位置中的一个点对应输入中的一个滑动窗口。可以查看图示5。卷积是自底向上执行的,所以相邻窗口的共同计算只需要进行一次。

Note that the last layers of our architecture are fully connected linear layers. At test time, these layers are effectively replaced by convolution operations with kernels of 1x1 spatial extent. The entire ConvNet is then simply a sequence of convolutions, max-pooling and thresholding operations exclusively.

请注意,我们架构的最后一层是完全连接的线性层。 在测试时,这些全连接层被卷积操作有效地替换为1x1内核的卷积层。从而整个网络只是一系列的卷积 、最大池化和阈值激活操作。

技术分享图片
Figure 5: The efficiency of ConvNets for detection. During training, a ConvNet produces only a single spatial output (top). But when applied at test time over a larger image, it produces a spatial output map, e.g. 2x2 (bottom). Since all layers are applied convolutionally, the extra computation required for the larger image is limited to the yellow regions. This diagram omits the feature dimension for simplicity.
图示5: ConvNets用于检测的效率。训练过程中,一个ConvNet只产生一个输出(上面的图例),但是当我们使用较大的图片做测试的时候,可以输出一个空间映射图,比如说2x2(下面的图例)。既然我们通过卷积代替滑窗的形式在所有层执行计算,我们只需要对黄色标注的区域进行额外的计算。为了简化,上述图例忽略了特征维度(通道数)

 

4 Localization 定位任务


Starting from our classification-trained network, we replace the classifier layers by a regression network and train it to predict object bounding boxes at each spatial location and scale. We then combine the regression predictions together, along with the classification results at each location, as we now describe.

基于已经训练完的分类网络,我们用回归网络替换了后面的分类层,并对其进行训练,以预测每个空间位置和尺度上的物体边界框。然后我们将每个位置的分类结果和回归预测结合在一起输出结果。

4.1 Generating Predictions  生成原始的预测边框


To generate object bounding box predictions, we simultaneously run the classifier and regressor networks across all locations and scales. Since these share the same feature extraction layers, only the final regression layers need to be recomputed after computing the classification network. The output of the final softmax layer for a class c at each location provides a score of confidence that an object of class c is present (though not necessarily fully contained) in the corresponding field of view. Thus we can assign a confidence to each bounding box.

为了生成物体边界框预测,我们在所有位置和尺度上同时运行分类器和回归网络。 由于它们共享相同的特征提取层,所以在计算分类网络之后,只需要重新计算最终的回归层。最终softmax层输出每个位置处属于某个类别c的概率,提供了在相应的视野中存在类别c(虽然不一定完全包含)的置信度分数。 因此我们可以为每个边界框分配一个置信度。
4.2 Regressor Training 回归训练


The regression network takes as input the pooled feature maps from layer 5. It has 2 fully-connected hidden layers of size 4096 and 1024 channels, respectively. The final output layer has 4 units which specify the coordinates for the bounding box edges. As with classification, there are (3x3) copies throughout, resulting from the ?x, ?y shifts. The architecture is shown in Fig. 8

回归网络将来自于第5层池化以后的特征图作为输入。它有两个隐式的全连接层,相应的通道数分别为4096和1024。最终的输出层有4个单元【可能该翻译成4个神经元】,分别是预测边框四个边的坐标。和分类网络一样,回归网络也采用offset-pooling这种方式。网络的结构如图8。

技术分享图片
Figure 8: Application of the regression network to layer 5 features, at scale 2, for example. (a)The input to the regressor at this scale are 6x7 pixels spatially by 256 channels for each of the (3x3) ?x, ?y shifts. (b) Each unit in the 1st layer of the regression net is connected to a 5x5 spatial neighborhood in the layer 5 maps, as well as all 256 channels. Shifting the 5x5 neighborhood around results in a map of 2x3 spatial extent, for each of the 4096 channels in the layer, and for each of the (3x3) ?x, ?y shifts. (c) The 2nd regression layer has 1024 units and is fully connected (i.e. the purple element only connects to the purple element in (b), across all 4096 channels). (d) The output of the regression network is a 4-vector (specifying the edges of the bounding box) for each location in the 2x3 map, and for each of the (3x3) ?x, ?y shifts.
图例8:对第5层池化后的特征图应用回归网络,以表格5中scale=2的网络为例。
(a)对于每个(3x3) ?x, ?y,神经网络的输入是6x7x256,其中6x7是单个特征图的空间像素分布,256为通道数目。
(b)回归网络的第一层中的每个神经元都连接到第5层特征图中对应的5x5邻域以及所有256个通道。 在6x7的特征图 中移动5x5邻域,最终得到该层中的所有(3x3) ?x, ?y和4096个通道的2x3空间范围的特征图。
(c)回归网络的第二层有1024个神经元并且和第一层是全连接的
(d)回归网络的输出层是一个4维向量(分别为边界框的顶部、左边、底部、右边),代表2x3特征图中每个点对应的边框位置。

We fix the feature extraction layers (1-5) from the classification network and train the regression network using an ?2 loss between the predicted and true bounding box for each example. The final regressor layer is class-specific, having 1000 different versions, one for each class. We train this network using the same set of scales as  described in Section 3. We compare the prediction of the regressor net at each spatial location with the ground-truth bounding box, shifted into the frame of reference of the regressor’s translation offset within the convolution (see Fig. 8). However, we do not train the regressor on bounding boxes with less than 50% overlap with the input field of view:since the object is mostly outside of these locations, it will be better handled by regression windows that do contain the object.

我们固定来自分类网络的前5层特征提取器不变,并且用每个物体真实边界和预测值之间测L2损失函数来训练回归网络。最后一个回归层和类别标签数目相关,1000个类别标签对应1000个版本,一个类别一个版本。我们使用scales和第三部分描述一致的集合来训练网络。我们把预测值和经过offset处理后的参考边界进行对比(如图8所示)。但是我们会舍弃物体边界框小于50%视野框的候选项,因为物体大部分位于这些位置以外,以使得网络更好地处理真正包含该对象的回归窗口。


Training the regressors in a multi-scale manner is important for the across-scale prediction combination. Training on a single scale will perform well on that scale and still perform reasonably on other scales. However training multi-scale will make predictions match correctly across scales and exponentially increase the confidence of the merged predictions. In turn, this allows us to perform well with a few scales only, rather than many scales as is typically the case in detection. The typical ratio from one scale to another in pedestrian detection [25] is about 1.05 to 1.1, here however we use a large ratio of approximately 1.4 (this number differs for each scale since dimensions are adjusted to fit exactly the stride of our network) which allows us to run our system faster.

采用多个scale的方式训练回归网络对于跨尺度组合预测很重要。从单一scale上训练的网络将在这个scale上表现良好并且在其他尺度上仍然表现合理。但是多个缩放比例训练让预测在多个比例上匹配更准确,而且还会指数级别的增加预测类别的置信度。

4.3 Combining Predictions 预测边框的合并

技术分享图片
Figure 7: Examples of bounding boxes produced by the regression network, before being combined into final predictions. The examples shown here are at a single scale. Predictions may be more optimal at other scales depending on the objects. Here, most of the bounding boxes which are initially organized as a grid, converge to a single location and scale. This indicates that the network is very confident in the location of the object, as opposed to being spread out randomly. The top left image shows that it can also correctly identify multiple location if several objects are present. The various aspect ratios of the predicted bounding boxes shows that the network is able to cope with various object poses.
图例7:回归网络产生原始边界框的例子,此时的边界框还没有被合并成最终的预测结果。 此处展示的例子是在单一scale上执行的,取决于目标物体的不同,可能在其他scales上预测会准确。在这里,大多数边界框的网格会聚集到一个单一的位置和尺寸上,这表明网络对物体的位置非常有信心,而不是随机分布的。左上角的图片显示,如果有多个目标物体被呈现,它也可以正确的鉴别出对应的位置。预测边框有各种长宽比例,证明网络可以应对各种物体姿态。


We combine the individual predictions (see Fig. 7) via a greedy merge strategy applied to the regressor bounding boxes, using the following algorithm.

我们使用以下算法,通过针对于回归边界框的贪婪合并策略,来组合单个预测(见图7)。 

(a) Assign to Cs the set of classes in the top k for each scale s ∈ 1 . . . 6, found by taking the maximum detection class outputs across spatial locations for that scale.

(b) Assign to Bs the set of bounding boxes predicted by the regressor network for each class in Cs,across all spatial locations at scale s

(c) Assign B ← Us Bs

(d) Repeat merging until done:

(e) (b 1*, b 2?) = argminb1≠b2∈Bmatch_score(b1, b2)

(f) If match_score(b 1*, b 2?) > t , stop.

(g) Otherwise, set B ← B{b 1*, b 2?} ∪ box_merge(b 1*, b 2?

(a) 在6个缩放比例上运行密集型分类网络,在每个比例s上选取top-k个类别,作为图片所属类别的集合,设为Cs

(b)  在每个缩放比例的所有空间位置上运行回归网络,产生类别集合Cs中每个类别对应的bounding box集合,设为Bs

(c) 各个比例的Bs的并集设为B

(d)重复执行以下合并策略直到结束:

(e)计算B中任意两个边界框b1,b2的匹配分数,找到差异最小的两个边框(b 1*, b 2?)

(f)如果(b 1*, b 2?)的差异度>t,结束合并

(g)否则,就在B中删除b 1*, b 2?,然后把b 1*和b 2?融合后的边框放入B中

 

In the above, we compute match score using the sum of the distance between centers of the two bounding boxes and the intersection area of the boxes. box merge compute the average of the bounding boxes’ coordinates.

在以上过程中,我们通过中心之间的距离和重合区域的面积来计算两个边框匹配度的;通过计算边框坐标的平均值来求合并后的边框。

The final prediction is given by taking the merged bounding boxes with maximum class scores. This is computed by cumulatively adding the detection class outputs associated with the input windows from which each bounding box was predicted. See Fig. 6 for an example of bounding boxes merged into a single high-confidence bounding box. In that example, some turtle and whale bounding boxes appear in the intermediate multi-scale steps, but disappear in the final detection image. Not only do these bounding boxes have low classification confidence (at most 0.11 and 0.12 respectively), their collection is not as coherent as the bear bounding boxes to get a significant confidence boost. The bear boxes have a strong confidence (approximately 0.5 on average per scale) and high matching scores. Hence after merging, many bear bounding boxes are fused into a single very high confidence box, while false positives disappear below the detection threshold due their lack of bounding box coherence and confidence. This analysis suggest that our approach is naturally more robust to false positives coming from the pure-classification model than traditional non-maximum suppression, by rewarding bounding box coherence

获得合并后具有较大类别分数的边框给出最终的预测结果。这是通过联合每个边框所在的视窗,累积地添加其类别输出得到的。图例6就是一个将bounding boxes合并成一个具有较高置信度的bounding boxe的过程。在该例子中,一些龟和鲸鱼边界框出现在中间多尺度步骤中,但在最终检测图像中消失。这些龟和鲸鱼的边框不仅类别置信度较低(几乎分别为0.11和0.12);边框集合也不像熊那么一致,以获得显著的置信度的提升。熊类的边框有很强的置信度(每个scale的平均值大约为0.5)和高匹配分数。因此合并以后,很多熊的边界框会融合到一个置信度非常高的可靠的边界框,与此同时,由于缺少连贯性和置信度较低,低于检测阈值的误报边框会消失。这种分析表明,我们的基于纯净分类模型的方法比传统的通过奖励一致性边框的非极大值抑制方法要更加稳健。

 

技术分享图片
技术分享图片
Figure 6: Localization/Detection pipeline. The raw classifier/detector outputs a class and a confidence for each location (1st diagram). The resolution of these predictions can be increased using the method described in section 3.3 (2nd diagram). The regression then predicts the location scale of the object with respect to each window (3rd diagram). These bounding boxes are then merge and accumulated to a small number of objects (4th diagram)
图例6:定位/检测步骤。最原始的分类器可以粗糙的在每个位置输出一个类别和置信度(第一行图例),利用3.3部分描述的pixel-offset方法可以提高分辨率,预测出更多的边框(第二行图例),然后回归网络预测物体相对于每个视窗的位置和尺寸。(第三行图例)然后合并这些边框并累积到少量的物体上。

 

4.4 Experiments 实验结果

技术分享图片
Figure 9: Localization experiments on ILSVRC12 validation set. We experiment with different number of scales and with the use of single-class regression (SCR) or per-class regression (PCR)
图例9: 基于ILSVRC12验证集的定位实验,我们用不同的scales结合SCR或者PCR回归方法进行实验。


We apply our network to the Imagenet 2012 validation set using the localization criterion specified for the competition. The results for this are shown in Fig. 9. Fig. 10 shows the results of the 2012 and 2013 localization competitions (the train and test data are the same for both of these years). Our method is the winner of the 2013 competition with 29.9% error.

我们将网络应用于使用竞赛指定的定位标准的Imagenet 2012验证集。结果如图9和图10所示。我们的方法以29.9%获得了2013年的冠军。


Our multiscale and multi-view approach was critical to obtaining good performance, as can be seen in Fig. 9: Using only a single centered crop, our regressor network achieves an error rate of 40%. By combining regressor predictions from all spatial locations at two scales, we achieve a vastly better error rate of 31.5%. Adding a third and fourth scale further improves performance to 30.0% error.

我们的多尺度和多视窗方法是取得良好性能的关键,正如图9中看到的:只用一个剪切尺度,我们的回归网络获得了40%的误差率。当结合两个尺度上回归的预测结果时,我们获得了很大的改善,误差率为31.5%。增加3至4个scale又可以将性能改善到30%的误差率。


Using a different top layer for each class in the regressor network for each class (Per-Class Regressor (PCR) in Fig. 9) surprisingly did not outperform using only a single network shared among all classes (44.1% vs. 31.3%). This may be because there are relatively few examples per class annotated with bounding boxes in the training set, while the network has 1000 times more top-layer parameters, resulting in insufficient training. It is possible this approach may be improved by sharing parameters only among similar classes (e.g. training one network for all classes of dogs, another for vehicles, etc.).

令我们惊讶的是,在回归网络中为每个类标签使用不同的顶层并没有使用在所有类别标签之间共享的单一网络性能要好(误差率:前者44.1% 后者 31.3%),这可能是由于有针对性的训练样本 太少,而网络顶层具有1000倍的参数,导致训练不充分。相似的类别之间共享参数可能会好一些。(比如说所有类别的狗训练一个神经网络,所有机动车辆训练另外一个网络等。)

5 Detection 检测任务

Detection training is similar to classification training but in a spatial manner. Multiple location of an image may be trained simultaneously. Since the model is convolutional, all weights are shared among all locations. The main difference with the localization task, is the necessity to predict a background class when no object is present. Traditionally, negative examples are initially taken at random for training. Then the most offending negative errors are added to the training set in bootstrapping passes. Independent bootstrapping passes render training complicated and risk potential mismatches between the negative examples collection and training times. Additionally, the size of bootstrapping passes needs to be tuned to make sure training does not overfit on a small set. To circumvent all these problems, we perform negative training on the fly, by selecting a few interesting negative examples per image such as random ones or most offending ones. This approach is more computationally expensive, but renders the procedure much simpler. And since the feature extraction is initially trained with the classification task, the detection fine-tuning is not as long anyway.

检测任务的训练和分类任务类似,只不过带有空间信息。一个图像中的多个位置可以同时进行训练。既然模型是卷积化的,权重在所有位置共享。和定位任务的最大区别是当图片中没有呈现物体的时候,标定为 background。传统的处理方式是,负样本最初始随机进行训练的,然后,采用 bootstrapping 的方式将最显著的负面样本添加到训练集中。只采用 bootstrapping 会增加训练的复杂度和background与训练目标之间错匹配的风险,并且bootstrapping方法应该适当调整以确保小型数据不会被过拟合。为了规避这些问题,我们在每张图片上随机或有针对性的选择一些感兴趣的负样本。这种方式计算量较大,但是流程简单。并且既然我们是在分类任务中进行特征提取的,检测任务并不太需要微调。

In Fig. 11, we report the results of the ILSVRC 2013 competition where our detection system ranked 3rd with 19.4% mean average precision (mAP). We later established a new detection state of the art with 24.3% mAP. Note that there is a large gap between the top 3 methods and other teams (the 4th method yields 11.5% mAP). Additionally, our approach is considerably different from the top 2 other systems which use an initial segmentation step to reduce candidate windows from approximately 200,000 to 2,000. This technique speeds up inference and substantially reduces the number of potential false positives. [29, 1] suggest that detection accuracy drops when using dense sliding window as opposed to selective search which discards unlikely object locations hence reducing false positives. Combined with our method, we may observe similar improvements as seen here between traditional dense methods and segmentation based methods. It should also be noted that we did not fine tune on the detection validation set as NEC and UvA did. The validation and test set distributions differ significantly enough from the training set that this alone improves results by approximately 1 point. The improvement between the two OverFeat results in Fig. 11 are due to longer training times and the use of context, i.e. each scale also uses lower resolution scales as input.

技术分享图片
Figure 10: ILSVRC12 and ILSVRC13 competitions results (test set). Our entry is the winner of the ILSVRC13 localization competition with 29.9% error (top 5). Note that training and testing data is the same for both years. The OverFeat entry uses 4 scales and a single-class regression approach.
技术分享图片
Figure 11: ILSVRC13 test set Detection results. During the competition, UvA ranked first with 22.6% mAP. In post competition work, we establish a new state of the art with 24.3% mAP. Systems marked with * were pre-trained with the ILSVRC12 classification data.

6 Discussion 探讨

We have presented a multi-scale, sliding window approach that can be used for classification, localization and detection. We applied it to the ILSVRC 2013 datasets, and it currently ranks 4th in classification, 1st in localization and 1st in detection. A second important contribution of our paper is explaining how ConvNets can be effectively used for detection and localization tasks. These were never addressed in [15] and thus we are the first to explain how this can be done in the context of ImageNet 2012. The scheme we propose involves substantial modifications to networks designed for classification, but clearly demonstrate that ConvNets are capable of these more challenging tasks. Our localization approach won the 2013 ILSVRC competition and significantly outperformed all 2012 and 2013 approaches. The detection model was among the top performers during the competition, and ranks first in post-competition results. We have proposed an integrated pipeline that can perform different tasks while sharing a common feature extraction base, entirely learned directly from the pixels. Our approach might still be improved in several ways. (i) For localization, we are not currently back-propping through the whole network; doing so is likely to improve performance. (ii) We are using ?2 loss, rather than directly optimizing the intersection-over-union (IOU) criterion on which performance is measured. Swapping the loss to this should be possible since IOU is still differentiable, provided there is some overlap. (iii) Alternate parameterizations of the bounding box may help to decorrelate the outputs, which will aid network training

我们提出了一种可用于分类,定位和检测的多尺度滑动窗口方法。我们把它应用到 ILSVRC 2013数据库,至今为止(2013年)在分类任务中排第4名,在定位和检测任务中排第一。我们论文的第二个贡献是解释了卷积网络是怎样高效的被用于检测和定位任务。参考文献15这样的论文从未提及,因此我们是第一个解释了它是如何应用到ImageNet2012数据集的。我们提出的方案包括对分类网络的重大修改,证明卷积网络是可以完成更具挑战性的任务的。我们的定位方法赢得了 2013 ILSVRC的冠军,并且远超2012年和2013年的其他方法。检测模型在比赛中名列前茅,赛后的结果排第一。我们提出了一个集成方案,可以共享基于完全从像素中学习的特征提取器来完成不同的任务。我们的方法可能还可以从以下方式进行改进:

(i)对于定位任务,我们暂时没有对整个网络进行训练【可能是这个意思,暂时只是在分类的特征提取器保持不变的情况下训练后面的回归层】

(ii)我们暂时采用L2损失函数,而没有直接通过IOU标准来衡量。转换到IOU应该可行的,因为如果存在一些重叠,IOU仍然可以区分。

(iii)用其他的参数来表示边界框,有助于提高输出的准确性,有利于网络的训练。

 

【完结】

 

Attachments
























































以上是关于OverFeat(译)的主要内容,如果未能解决你的问题,请参考以下文章

深度学习(二十)基于Overfeat的图片分类定位检测

深度学习R-CNN网络基础

深度学习相关论文阅读

目标检测算法R-CNN(详解)

目标检测论文整理

BigQuery 错误:遇到““OVER”“OVER”“