Generalized Focal Loss v2 原理与代码解析

paper：Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection

code：GitHub - implus/GFocalV2: Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection, CVPR2021

背景

单阶段目标检测模型中除了分类和回归分支外，还常常用到定位质量估计（Localization Quality Estimcation, LQE）分支，在推理阶段LQE score经常与分类score相乘作为最终得分，因此在LQE的帮助下，高质量的边界框得分往往高于低质量的边界框，大大减小了NMS中高质量框被错误过滤掉的风险。

之前的模型中的LQE包括YOLO中的Objectness，IoU-Net中的IoU，FCOS中的Centerness，这些方法都有一个共同的特点就是它们都是基于原始的卷积特征，比如点、边界或区域的特征来估计定位质量，如下图(a)-(g)所示。

文本的创新点

本文直接利用边界框分布的统计数据来评估定位质量，边界框分布在Generalized Focal Loss v1中提出，它学习每个预测边的离散概率分布，来描述边框回归的不确定性，如下图(a)所示。作者观察到，边框回归的一般分布统计和其真实定位质量有很强的相关性，如下图(b)所示。具体就是，分布的形状（平整度）可以清晰地反应预测结果的定位质量，分布越尖锐，预测结果越准确，反之亦然。因此很自然的就想到，用分布的统计信息来指导定位质量估计的学习，作者提出了一个轻量的子网络Distribution-Guided Quality Predictor（DGQP），利用边框分布统计信息来得到更可靠的LQE score。本文在Generalized Focal Loss v1的基础上，增加了DGQP模块，提出了一种新的dense object detector，Generlized Focal Loss v2，精度进一步得到提升。

方法介绍

上面提到了学习到的边界回归分布的flatness与最终预测框的质量高度相关，一些相关的统计数据可以反映分布的flatness，和GFLv1一样，采用anchor point到gt四边的距离作为回归的target，记左右上下四边分别为 ${l, r, t, b}$ ，定义 $w$ 边的离散分布为 $P^{w} = [P^{w} (y_{0}), P^{w} (y_{1}), . . ., P^{w} (y_{n})], w \in {l, r, t, b}$ ，作者提出使用每条边分布的Top-k和均值然后拼接起来作为统计特征 $F \in R^{4 (k + 1)}$ ：

其中Topkm(·)表示Top-k和其均值的联合运算，Concat(·)表示通道拼接，选择Top-k和其均值作为统计输入有两个好处：（1）由于 $P^{w}$ 的和是固定的即 $\sum_{i = 0}^{n} P^{w} (y_{i}) = 1$ ，因此Top-k和其均值反映了分布的平整度：值越大，分布越尖锐，越小，越平整。（2）Top-k和均值可以使统计特征对其在分布域上的相对偏移不敏感，如下图所示，从而可以得到一个不受对象尺度影响的鲁棒表示。

给定一般分布的统计特征 $F$ 作为输入，作者设计了一个非常轻量的子网络 $F (\cdot)$ 来预测最终的IoU质量估计。这个子网络只包含两个全连接层，分别接ReLU和Sigmoid，最终IoU标量I计算公式如下

其中 $δ$ 和 $σ$ 分别表示ReLU和Sigmoid， $W_{1} \in R^{p \times 4 (k + 1)}$ ， $W_{2} \in R^{1 \times p}$ ， $k$ 表示Top-k， $p$ 是是隐藏层的维度（文本中分别设置 $k = 4, p = 64$ ），GFLv2的整体结构如下图所示，其中红色虚线框部分就是DGQP

代码解析

下面是GFL v1最终的分类和回归head的实现，其中输入x是经backbone和neck后的单层输出特征图


def forward_single(self, x, scale):
    """Forward feature of a single scale level.
    Args:
        x (Tensor): Features of a single scale level.
        scale (:obj: `mmcv.cnn.Scale`): Learnable scale module to resize
            the bbox prediction.
    Returns:
        tuple:
            cls_score (Tensor): Cls and quality joint scores for a single
                scale level the channel number is num_classes.
            bbox_pred (Tensor): Box distribution logits for a single scale
                level, the channel number is 4*(n+1), n is max value of
                integral set.
    """
    cls_feat = x  # (2,256,38,38)
    reg_feat = x
    for cls_conv in self.cls_convs:
        cls_feat = cls_conv(cls_feat)
    # (2,256,38,38)
    for reg_conv in self.reg_convs:
        reg_feat = reg_conv(reg_feat)
    # (2,256,38,38)
    cls_score = self.gfl_cls(cls_feat)  # (2,20,38,38)
    bbox_pred = scale(self.gfl_reg(reg_feat)).float()  # (2,68,38,38), 68=4x(16+1)
    return cls_score, bbox_pred

下面是GFL v2的实现

其中对回归head的输出bbox_pred进行softmax后计算topk(k=4)并与其均值拼接得到统计输入stat，然后输入子网络reg_conf，子网络包含两层全连接层，其中self.total_dim=k+1=5，self.reg_channels=64，得到质量估计quality_score再与分类score相乘作为最终的分类得分。


conf_vector = [nn.Conv2d(4 * self.total_dim, self.reg_channels, 1)]
conf_vector += [self.relu]
conf_vector += [nn.Conv2d(self.reg_channels, 1, 1), nn.Sigmoid()]
 
self.reg_conf = nn.Sequential(*conf_vector)
 
 
def forward_single(self, x, scale):
    """Forward feature of a single scale level.
    Args:
        x (Tensor): Features of a single scale level.
        scale (:obj: `mmcv.cnn.Scale`): Learnable scale module to resize
            the bbox prediction.
    Returns:
        tuple:
            cls_score (Tensor): Cls and quality joint scores for a single
                scale level the channel number is num_classes.
            bbox_pred (Tensor): Box distribution logits for a single scale
                level, the channel number is 4*(n+1), n is max value of
                integral set.
    """
    cls_feat = x
    reg_feat = x
    for cls_conv in self.cls_convs:
        cls_feat = cls_conv(cls_feat)
    for reg_conv in self.reg_convs:
        reg_feat = reg_conv(reg_feat)
 
    bbox_pred = scale(self.gfl_reg(reg_feat)).float()
    N, C, H, W = bbox_pred.size()
    prob = F.softmax(bbox_pred.reshape(N, 4, self.reg_max + 1, H, W), dim=2)
    prob_topk, _ = prob.topk(self.reg_topk, dim=2)
 
    if self.add_mean:
        stat = torch.cat([prob_topk, prob_topk.mean(dim=2, keepdim=True)],
                         dim=2)
    else:
        stat = prob_topk
 
    quality_score = self.reg_conf(stat.reshape(N, -1, H, W))
    cls_score = self.gfl_cls(cls_feat).sigmoid() * quality_score
 
    return cls_score, bbox_pred