SPP结构是由提出ResNet的何大神在论文《Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition》中提出的,主要就是可以解决CNN输入需要固定尺寸的问题,而且在分类和目标检测中都可以得到比较好的效果
SPP的结构其实是在之前就已经提出来了,就是分不同的尺寸对图进行特征提取的过程: 《The pyramid match kernel: Discriminative classification with sets of image features》 《Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories》等论文都提到了。
SPP结构实际上是对更古老的一种特征提取与分类方法:词袋模型(BoW- Bag of Word)的改进。
从
p
o
o
l
5
pool_5
pool5出来的featrue map的每个channel都需要经过SPP层,假设有k个channel,比如说是64个。那么整个SPP层对特征图的输出向量就为 35 * 64 = 2240维。
原文:In each spatial bin, we pool the responses of each filter (throughout this paper we use max pooling). The outputs of the spatial pyramid pooling are kM-dimensional vectors with the number of bins denoted as M (k is the number of filters in the last convolutional layer). The fixed-dimensional vectors are the input to the fully-connected layer。
spatial bin指的就是每个区域划分出来的一个一个的小格子。
位于金字塔顶端的通常为一个1 * 1的bin,也就是直接对整张图像进行池化操作的池化层,可以称作是“global pooling”,作者在文章中提到是说很多算法都会使用到global pooling的方式来提高性能,在SPP里正好以这样一种巧妙的方式进行了集成。Interestingly, the coarsest pyramid level has a single bin that covers the entire image. This is in fact a “global pooling” operation, which is also investigated in several concurrent works. In [31], [32] a global average pooling is used to reduce the model size and also reduce overfitting; in [33], a global average pooling is used on the testing stage after all fc layers to improve accuracy; in [34], a global max pooling is used for weakly supervised object recognition. The global pooling operation corresponds to the traditional Bag-of-Words method。
根据上面的描述,FC的输入就是
k
∗
M
k * M
k∗M, k是通道数,M就是划分的格子的多少,也就是spatial bins。只要这两个确定了,不管是什么样的尺寸图像进来,fc得到的输入就是固定的。即使长宽不一样,也可以报证尺寸统一,因为bins是相同的。
每个spatial bin的大小,在论文中也被称作window的大小就是
⌈
a
/
n
⌉
\lceil a/n \rceil
⌈a/n⌉,每个window之间的间隔,或者说步长设置为
⌊
a
/
n
⌋
\lfloor a/n \rfloor
⌊a/n⌋
训练方法和参数与R-CNN类似,包括正例和反例的选择。We use the ground-truth windows to generate the positive samples. The negative samples are those overlapping a positive window by at most 30% (measured by the intersection-over-union (IoU) ratio). Any negative sample is removed if it overlaps another negative sample by more than 70%
最关键的一点就是怎么把一个候选框映射到特征图上的window,论文中的方式是:S is the product of all previous strides. In our models, S = 16 for ZF-5 on conv5, and S = 12 for Overfeat-5/7 on conv5/7. Given a window in the image domain, we project the left (top) boundary by:
x
′
=
⌊
x
/
S
⌋
+
1
x^{'} = \lfloor x/S \rfloor + 1
x′=⌊x/S⌋+1 and the right (bottom) boundary
x
′
=
⌊
x
/
S
⌋
−
1
x^{'} = \lfloor x/S \rfloor - 1
x′=⌊x/S⌋−1. If the padding is not b p/2c , we need to add a proper offset to x。Y坐标同理。
同样,SPP网络针对目标检测任务也做了fine-tuning,具体内容就不贴上来了。
SPPNet的训练方法
原则上,训练SPPNet是可以通过任意尺寸的图像进行训练。但是为了效率起见(GPU和CUDA更适合用固定尺寸进行批量计算,速度更快),论文中是使用了两种尺寸的图像进行训练:180 * 180和224 * 224的两种尺寸。Theoretically, the above network structure can be trained with standard back-propagation [1], regardless of the input image size. But in practice the GPU implementations (such as cuda-convnet [3] and Caffe[35]) are preferably run on fixed input images. Next we describe our training solution that takes advantage of these GPU implementations while still preserving the spatial pyramid pooling behaviors.
论文中也提出了两种训练方法:
Single-size training
Multi-size training
Single-size training
只使用
224
∗
224
224 * 224
224∗224尺寸的crop后的图像进行训练,这里的裁剪是为了做数据类型增强。we first consider a network taking a fixed-size input (224×224) cropped from images.The cropping is for the purpose of data augmentation.
预设的金字塔包含三个层的bins :(3×3, 2×2, 1×1)
p
o
o
l
5
pool_5
pool5出来的尺寸是
13
∗
13
13 * 13
13∗13
Multi-size training
使用了180 * 180,和224 * 224的两种尺寸进行训练。180 * 180是通过把224 * 224进行resize得到的(同比例缩放)。Rather than crop a smaller 180×180 region, we resize the aforementioned 224×224 region to 180×180. So the regions at both scales differ only in resolution but not in content/layout.
因为输入是180 * 180,所以
p
o
o
l
5
pool_5
pool5的输出尺寸是
10
∗
10
10 * 10
10∗10。但是上面说了,进入fc的尺寸和bins个数有关,和前面卷积层的输出无关,只是说每个bins的宽高变了而已,做max-pooling又无所谓尺寸。
训练方法是每种尺寸的图像都做一些epoch,每种尺寸的epoch轮流训练。To reduce the overhead to switch from one network(e.g., 224) to the other (e.g., 180), we train each full epoch on one network, and then switch to the other one (keeping all weights) for the next full epoch. This is iterated
论文中还测试了在
[
180
,
224
]
[180, 224]
[180,224]之间随机分布的
s
∗
s
s * s
s∗s的尺寸进行训练的结果。Besides the above two-scale implementation, we have also tested a variant using s × s as input where s is randomly and uniformly sampled from [180, 224] at each epoch.
几个baseline都是输出了一个6×6大小特征图(通道数没提),然后是两个4096的fc层,最后输出当然是1000个类别,使用的是softmax激活层。In the baseline models, the pooling layer after the last convolutional layer generates 6×6 feature maps, with two 4096-d fc layers and a 1000-way softmax layer following。
增加了SPP结构不是说在网络中增加了参数所以提高了准确率。论文中在ZF-5上测试了一个{4×4, 3×3, 2×2, 1×1}(totally 30 bins)的SPP结构,增加这个结构后,因为fc的结构从36×256变成了30×256,所以整个网络的参数量(与no-SPP相比)是降低的,但是准确率是有提高的,但是和有50个bin的SPP还是相差一点点。也就是说明了SPP不是单纯的通过提高参数量来提升准确率,而是从结构上就有优势。This network has fewer parameters than its no-SPP counterpart, because its fc6 layer has 30×256-d inputs instead of 36×256-d. The top-1/top-5 errors of this network are 35.06/14.04. This result is similar to the 50-bin pyramid above (34.98/14.14), but considerably better than the no-SPP counterpart (35.99/14.76).
全尺寸图像对比,前面进入testing阶段的image都还是经过了corp裁剪操作(10-view),论文还做了全尺寸图像的测试(用全尺寸图做训练),其实是做了resize操作的,只是说在不同分辨率的情况下保留了全图信息。We resize the image so that min(w, h)=256 while maintaining its aspect ratio. The SPP-net is applied on this full image to compute the scores of the full view。