
CVPR-2020
由于内存和计算资源的限制,在嵌入式设备上部署卷积神经网络非常困难。
The redundancy in feature maps is an important characteristic of those successful CNNs, but has rarely been investigated in neural architecture design.

本文作者提出 GhostNet,保留部分固有特征(intrinsic features),通过固有特征的线性变换(cheap operation)模拟生成相对冗余的特征(ghost features),降低计算量的同时,保持了特征的多样性——图1 中同颜色的框,可以看成一个是 intrinsic feature,一个是 intrinsic feature 通过线性变换得到的 ghost feature
为啥要冗余特征?
Abundant and even redundant information in the feature maps of well-trained deep neural networks often guarantees a comprehensive understanding of the input data.
Redundancy in feature maps could be an important characteristic for a successful deep neural network.
作者 embrace 冗余特征, but in a cost-efficient way
提出 GhostNet,分类任务中,速度精度权衡的比 mobilenetv3 好
输入 X ∈ R c × h × w X \in \mathbb{R}^{c \times h \times w} X∈Rc×h×w
卷积产生的特征图 Y ∈ R h ′ × w ′ × n Y \in \mathbb{R}^{h' \times w' \times n} Y∈Rh′×w′×n, Y = X ∗ f + b Y = X*f +b Y=X∗f+b
其中 ∗ * ∗ 是 conv 操作, b b b 是 bias,convolution filters f ∈ R c × k × k × n f \in \mathbb{R}^{c \times k \times k \times n} f∈Rc×k×k×n

We point out that it is unnecessary to generate these redundant feature maps one by one with large number of FLOPs and parameters.
Suppose that the output feature maps are “ghosts” of a handful of intrinsic feature maps with some cheap transformations
These intrinsic feature maps are often of smaller size and produced by ordinary convolution filters.
通道数为 m m m 的 intrinsic feature Y ′ Y' Y′ 产生的方式为
Y ′ = X ∗ f ′ Y' = X * f' Y′=X∗f′, Y ′ ∈ R h ′ × w ′ × m Y' \in \mathbb{R}^{h' \times w' \times m} Y′∈Rh′×w′×m
其中 f ′ ∈ R c × k × k × m f' \in \mathbb{R}^{c \times k \times k \times m} f′∈Rc×k×k×m, m ≤ n m \leq n m≤n
通过 intrinsic feature
Y
′
Y'
Y′ 产生 ghost feature 的形式如下

其中
y i ′ y'_i yi′ 是 i i i-th intrinsic feature map
Φ i j \Phi_{ij} Φij 是 j j j-th linear operation for generating the j j j-th ghost feature map y i j y_{ij} yij,最后一个操作固定为 identity mapping
一个 intrinsic feature 可以有多个 ghost map, { y i j } j = 1 s \{y_{ij}\}_{j=1}^s {yij}j=1s
引入 ghost 机制后,最终的输出特征图为
Y = [ y 11 , y 12 , . . . , y 1 m , y 21 , . . . , y m s ] Y = [y_{11}, y_{12}, ..., y_{1m}, y_{21}, ..., y_{ms}] Y=[y11,y12,...,y1m,y21,...,yms],
m
m
m 是 intrinsic features 的数量

看代码后,结构是这样的

图片来自于 https://zhuanlan.zhihu.com/p/115844245
cheap ops 也即 depth-wise conv
1)提出的 Ghost module 和普通卷积之间有什么不同呢?
2)Ghost module Complexities 如何?
Ghost module 有 1 个 identity mapping, m ⋅ ( s − 1 ) = n s ⋅ ( s − 1 ) m \cdot (s-1) = \frac{n}{s} \cdot (s-1) m⋅(s−1)=sn⋅(s−1) 个 linear operation
和正常 conv 对比,计算量比值如下

分母两项,前面一项是 正常卷积,输入通道 c,输出通道
m
=
n
s
m = \frac{n}{s}
m=sn,后面一项对通道为
m
m
m 的 intrinsic feature maps 每个通道 做了
s
−
1
s-1
s−1 种 linear operation(比如 depth-wise conv),the averaged kernel size of each linear operation is equal to
d
×
d
d \times d
d×d
k ≈ d k \approx d k≈d, s ≪ c s \ll c s≪c
we suggest to take linear operations of the same size (e.g. 3x3 or 5x5) in one Ghost module for efficient implementation.
参数量比值如下

可以看到,计算量和参数量都约等于减少了
s
s
s 倍数(linear operation 的个数)
larger s s s leads to larger compression and speed-up ratio
1)Ghost Bottlenecks

two stacked Ghost modules
一个 expansion layer increasing the number of channels
一个 reduces the number of channels to match the shortcut path
第二个 Ghost modules 没用 relu 是借鉴的 MobileNetV2 思想(通道数较少的时候不用 relu)
2)GhostNet

GhostNet-
α
\alpha
α,multiply a factor
α
\alpha
α on the number of channels
1)Toy Experiments

用的是 depth-wise conv

there are strong correlations between feature maps in deep neural networks and these redundant feature maps could be generated from several intrinsic feature maps.
the irregular module(各种 linear operation) will reduce the efficiency of computing units,作者推荐是 d d d 固定,用 depth-wise conv
2)CIFAR-10
a)Analysis on Hyper-parameters
固定
s
=
2
s=2
s=2(两分支),消融
d
d
d(非 identity mapping 分支中的 depth-wise conv 的 kernel size)

此时
d
=
3
d=3
d=3 效果最好,1x1 cannot introduce spatial information,
d
=
5
d = 5
d=5 or
d
=
7
d = 7
d=7 lead to overfitting and more computations
固定
d
=
3
d=3
d=3,消融
s
s
s

计算量和参数量随着
s
s
s 的增加明显降低,精度损失的比较缓慢
larger s s s leads to larger compression and speed-up ratio
b)Comparison with State-of-the-arts

c)Visualization of Feature Maps

Although the generated feature maps are from the primary feature maps, they exactly have significant difference which means the generated features are flexible enough to satisfy the need for the specific task.

3)Large Models on ImageNet
对比不同的压缩形式

1)ImageNet Classification

人狠话不多,SOTA
Actual Inference Speed


2)Object Detection

和 V3 差不多
class GhostModule(nn.Module):
def __init__(self, inp, oup, kernel_size=1, ratio=2, dw_size=3, stride=1, relu=True):
super(GhostModule, self).__init__()
self.oup = oup
init_channels = math.ceil(oup / ratio)
new_channels = init_channels*(ratio-1)
self.primary_conv = nn.Sequential(
nn.Conv2d(inp, init_channels, kernel_size, stride, kernel_size//2, bias=False),
nn.BatchNorm2d(init_channels),
nn.ReLU(inplace=True) if relu else nn.Sequential(),
)
self.cheap_operation = nn.Sequential(
nn.Conv2d(init_channels, new_channels, dw_size, 1, dw_size//2, groups=init_channels, bias=False),
nn.BatchNorm2d(new_channels),
nn.ReLU(inplace=True) if relu else nn.Sequential(),
)
def forward(self, x):
x1 = self.primary_conv(x)
x2 = self.cheap_operation(x1)
out = torch.cat([x1,x2], dim=1)
return out[:,:self.oup,:,:]
class GhostBottleneck(nn.Module):
""" Ghost bottleneck w/ optional SE"""
def __init__(self, in_chs, mid_chs, out_chs, dw_kernel_size=3,
stride=1, act_layer=nn.ReLU, se_ratio=0.):
super(GhostBottleneck, self).__init__()
has_se = se_ratio is not None and se_ratio > 0.
self.stride = stride
# Point-wise expansion
self.ghost1 = GhostModule(in_chs, mid_chs, relu=True)
# Depth-wise convolution
if self.stride > 1:
self.conv_dw = nn.Conv2d(mid_chs, mid_chs, dw_kernel_size, stride=stride,
padding=(dw_kernel_size-1)//2,
groups=mid_chs, bias=False)
self.bn_dw = nn.BatchNorm2d(mid_chs)
# Squeeze-and-excitation
if has_se:
self.se = SqueezeExcite(mid_chs, se_ratio=se_ratio)
else:
self.se = None
# Point-wise linear projection
self.ghost2 = GhostModule(mid_chs, out_chs, relu=False)
# shortcut
if (in_chs == out_chs and self.stride == 1):
self.shortcut = nn.Sequential()
else:
self.shortcut = nn.Sequential(
nn.Conv2d(in_chs, in_chs, dw_kernel_size, stride=stride,
padding=(dw_kernel_size-1)//2, groups=in_chs, bias=False),
nn.BatchNorm2d(in_chs),
nn.Conv2d(in_chs, out_chs, 1, stride=1, padding=0, bias=False),
nn.BatchNorm2d(out_chs),
)
def forward(self, x):
residual = x
# 1st ghost bottleneck
x = self.ghost1(x)
# Depth-wise convolution
if self.stride > 1:
x = self.conv_dw(x)
x = self.bn_dw(x)
# Squeeze-and-excitation
if self.se is not None:
x = self.se(x)
# 2nd ghost bottleneck
x = self.ghost2(x)
x += self.shortcut(residual)
return x
摘抄一些有趣的解读
哈哈哈,深度可分离卷积倒过来,没毛病