1、backbone:Darknet-53
YOLOv3使用Darknet-53提取特征,其借鉴了Darknet-19结构。不同于Darknet-19的是,Darknet-53引入了大量的残差结构,并且使用步长为2,卷积核大小为3×3卷积层Conv2D代替池化层Maxpooling2D。通过在ImageNet上的分类表现,Darknet-53经过以上改造在保证准确率的同时极大地提升了网络的运行速度,证明了Darknet-53在特征提取能力上的有效性。
图1
图1为Darknet-53的网络模型,Darknet-53中总共有6个单独的卷积层和23个Residual层,每个Residual包含2个卷积层(一个1×1,一个3×3),所以Darknet-53中共有52层卷积,之所以称为Darknet-53,是因为在YOLOv3中,前52层只用作特征提取,最后一层是用于输出预测值的,故加上输出那一层称为Darknet-53。
图2
图2为Darknet-53中的残差结构。加入残差结构Residual的目的是为了增加网络的深度,用于支持网络提取更高级别的语义特征,同时残差的结构可以帮助我们避免梯度的消失或爆炸。因为残差的物理结构,反映到反向梯度传播中,可以使得梯度传递到前面很远的网络层中,削弱反向求导的链式反应。从图2中可以看到,输入首先经过一个1×1的卷积层Conv(1×1 stride=1)将通道数降低一半变为In_channels/2, 然后经过一个3×3的卷积层Conv(3×3,stride=1)进行特征提取,这时通道数又从In_channels/2恢复为In_channels。最后3×3卷积的输出与经过Shorcut传递过来的输入Input相加得到最终的Output(此时3×3卷积的输出与Input的形状(In_channels,h,w)相同,可以直接相加)。我们看到,经过Residual运算之后,输入的特征图形状保持不变。
2、整体网络框架
图3
图3为YOLOV3的整体架构图,可以看出其使用Darknet-53进行特征抽取,使用其中的三个分支进行预测输出。第一个分支输出的特征图大小为13x13,主要用于检测大物体(感受野越大,越能检测到大物体)。第二个分支输出的特征图大小为26x26,第一个分支输出的特征图进行上采样后(特征图由13x13变为26x26)和第二个分支输出融合作为最终的第二个分支预测,主要用于检测中等物体。第三个分支输出的特征图大小为52x52,第二个分支输出的特征图进行上采样后和第三个分支输出融合作为最终的第三个分支预测,主要用于检测小物体。更多网络框架细节问题将在代码流程中逐一分析讲解。
3、损失函数
图4
YOLOV3的损失函数可以分为三部分,分别是边界框损失、置信度损失、分类损失。通过不断的训练使网络收敛,使三种损失都达到相对小的值。
项目地址:models: Models of MindSpore - Gitee.com
代码总览:
图5
下面主要分析darknet.py、yolo.py、loss.py这三个文件,分别对应YOLOV3特征抽取网络网络、YOLOV3网络、YOLOV3损失函数。
darknet.py文件中定义了一个函数两个类。其中conv_block函数定义了基本的Conv+BN+relu组合,Darknet53网络中大量使用了这种组合;ResidualBlock类定义了残差结构;DarkNet按照网络结构图,搭积木般构造了darknet53网络;darknet53函数返回DarkNet类,相当于对外的接口。
conv_block
- # 构造了Conv2d+BatchNorm2d+ReLU的组合。
- def conv_block(in_channels, out_channels, kernel_size, stride,dilation=1):
- """Get a conv2d batchnorm and relu layer"""
- pad_mode = 'same'
- padding = 0
-
- return nn.SequentialCell(
- [nn.Conv2d(in_channels,
- out_channels,
- kernel_size=kernel_size,
- stride=stride,
- padding=padding,
- dilation=dilation,
- pad_mode=pad_mode),
- nn.BatchNorm2d(out_channels, momentum=0.1),
- nn.ReLU()]
- )
ResidualBlock
- class ResidualBlock(nn.Cell):
- """
- DarkNet V1 residual block definition.
- Args:
- in_channels: Integer. Input channel.
- out_channels: Integer. Output channel.
- Returns:
- Tensor, output tensor.
- Examples:
- ResidualBlock(3, 208)
- """
- expansion = 4
-
- def __init__(self,
- in_channels,
- out_channels):
-
- super(ResidualBlock, self).__init__()
- out_chls = out_channels//2
- # 定义残差结构。首先通过1x1的卷积核将通道数降为一半,然后通过3x3的卷积核将通道数恢复。
- # 无论是经过1x1卷积还是3x3卷积,特征图的大小是不变的。padding=same,stride=1.
- # 输入经过残差结构后,特征图大小和通道数是保持不变的,所以可以进行ops.Add()操作。
- self.conv1 = conv_block(in_channels, out_chls, kernel_size=1, stride=1)
- self.conv2 = conv_block(out_chls, out_channels, kernel_size=3, stride=1)
- self.add = ops.Add()
-
- def construct(self, x):
- identity = x
- out = self.conv1(x)
- out = self.conv2(out)
- out = self.add(out, identity)
-
- return out
DarkNet
- class DarkNet(nn.Cell):
- """
- DarkNet V1 network.
- Args:
- block: Cell. Block for network.
- layer_nums: List. Numbers of different layers.
- in_channels: Integer. Input channel.
- out_channels: Integer. Output channel.
- detect: Bool. Whether detect or not. Default:False.
- Returns:
- Tuple, tuple of output tensor,(f1,f2,f3,f4,f5).
- Examples:
- DarkNet(ResidualBlock,
- [1, 2, 8, 8, 4],
- [32, 64, 128, 256, 512],
- [64, 128, 256, 512, 1024],
- 100)
- """
- def __init__(self,
- block,
- layer_nums,
- in_channels,
- out_channels,
- detect=False):
- super(DarkNet, self).__init__()
-
- self.outchannel = out_channels[-1]
- self.detect = detect
- # labyer_nums对应Darknet53网络结构中ResidualBlock的重复个数顺序列表。实际值为[1, 2, 8, 8, 4]
- # in_channels对应Darknet53网络结构中ResidualBlock中第一个Conv的卷积核个数。实际值为[32, 64, 128, 256, 512]
- # in_channels对应Darknet53网络结构中ResidualBlock中第二个Conv的卷积核个数。实际值为[64, 128, 256, 512, 1024]
- if not len(layer_nums) == len(in_channels) == len(out_channels) == 5:
- raise ValueError("the length of layer_num, inchannel, outchannel list must be 5!")
- # convx对应Darknet53中6个单独的卷积层,layerx对应Darknet53的5个残差结构,它们的重复次数为[1, 2, 8, 8, 4]
- self.conv0 = conv_block(3,
- in_channels[0],
- kernel_size=3,
- stride=1)
- self.conv1 = conv_block(in_channels[0],
- out_channels[0],
- kernel_size=3,
- stride=2)
- self.layer1 = self._make_layer(block,
- layer_nums[0],
- in_channel=out_channels[0],
- out_channel=out_channels[0])
- self.conv2 = conv_block(in_channels[1],
- out_channels[1],
- kernel_size=3,
- stride=2)
- self.layer2 = self._make_layer(block,
- layer_nums[1],
- in_channel=out_channels[1],
- out_channel=out_channels[1])
- self.conv3 = conv_block(in_channels[2],
- out_channels[2],
- kernel_size=3,
- stride=2)
- self.layer3 = self._make_layer(block,
- layer_nums[2],
- in_channel=out_channels[2],
- out_channel=out_channels[2])
- self.conv4 = conv_block(in_channels[3],
- out_channels[3],
- kernel_size=3,
- stride=2)
- self.layer4 = self._make_layer(block,
- layer_nums[3],
- in_channel=out_channels[3],
- out_channel=out_channels[3])
- self.conv5 = conv_block(in_channels[4],
- out_channels[4],
- kernel_size=3,
- stride=2)
- self.layer5 = self._make_layer(block,
- layer_nums[4],
- in_channel=out_channels[4],
- out_channel=out_channels[4])
-
- def _make_layer(self, block, layer_num, in_channel, out_channel):
- """
- Make Layer for DarkNet.
- :param block: Cell. DarkNet block.
- :param layer_num: Integer. Layer number.
- :param in_channel: Integer. Input channel.
- :param out_channel: Integer. Output channel.
- Examples:
- _make_layer(ConvBlock, 1, 128, 256)
- """
- layers = []
- darkblk = block(in_channel, out_channel)
- layers.append(darkblk)
-
- for _ in range(1, layer_num):
- darkblk = block(out_channel, out_channel)
- layers.append(darkblk)
-
- return nn.SequentialCell(layers)
-
- def construct(self, x):
- c1 = self.conv0(x)
- c2 = self.conv1(c1)
- c3 = self.layer1(c2)
- c4 = self.conv2(c3)
- c5 = self.layer2(c4)
- c6 = self.conv3(c5)
- c7 = self.layer3(c6)
- c8 = self.conv4(c7)
- c9 = self.layer4(c8)
- c10 = self.conv5(c9)
- c11 = self.layer5(c10)
- # 如果是作为YOLOV3的特征抽取网络,那么输出三个分支的特征图,用于之后的检测。
- if self.detect:
- return c7, c9, c11
-
- return c11
-
- def get_out_channels(self):
- return self.outchannel
yolo.py文件中定义了一个函数和六个类。其中_conv_bn_relu定义了Conv2d+BatchNorm2d+LeakyRelu组合。YoloBlock类和YOLOv3类为一组,前者可以认为是基础类、后者是基于前者输出三个特征图的相关信息。同理有DetectionBlock类和YOLOV3DarkNet53类以及YoloLossBlock类和YoloWithLossCell类。IOU类计算IOU值。下面主要讲解一下YoloBlock类和YOLOv3类。
YoloBlock
- # Darknet53网络输出的特征图经过一系列的Conv2d+BatchNorm2d+LeakyReLU组合后,其中Conv2d为1x1和3x3的卷积核。
- # 最终输出255通道数的特征图。其中255=3*(80+5).一个像素分配三个anchor,
- # 每一个anchor负责预测80分类+4个位置坐标+1个类别置信度。
- class YoloBlock(nn.Cell):
- """
- YoloBlock for YOLOv3.
- Args:
- in_channels: Integer. Input channel.
- out_chls: Integer. Middle channel.
- out_channels: Integer. Output channel.
- Returns:
- Tuple, tuple of output tensor,(f1,f2,f3).
- Examples:
- YoloBlock(1024, 512, 255)
- """
- def __init__(self, in_channels, out_chls, out_channels):
- super(YoloBlock, self).__init__()
- out_chls_2 = out_chls*2
-
- self.conv0 = _conv_bn_relu(in_channels, out_chls, ksize=1)
- self.conv1 = _conv_bn_relu(out_chls, out_chls_2, ksize=3)
-
- self.conv2 = _conv_bn_relu(out_chls_2, out_chls, ksize=1)
- self.conv3 = _conv_bn_relu(out_chls, out_chls_2, ksize=3)
-
- self.conv4 = _conv_bn_relu(out_chls_2, out_chls, ksize=1)
- self.conv5 = _conv_bn_relu(out_chls, out_chls_2, ksize=3)
-
- self.conv6 = nn.Conv2d(out_chls_2, out_channels, kernel_size=1, stride=1, has_bias=True)
-
- def construct(self, x):
- c1 = self.conv0(x)
- c2 = self.conv1(c1)
-
- c3 = self.conv2(c2)
- c4 = self.conv3(c3)
-
- c5 = self.conv4(c4)
- c6 = self.conv5(c5)
-
- out = self.conv6(c6)
- return c5, out
YOLOv3
- # 基于YoloBlock,输出大、中、小三个特征图中经过上采样、Conv2d+BatchNorm2d+LeakyReLU组合后的结果。
- class YOLOv3(nn.Cell):
- """
- YOLOv3 Network.
- Note:
- backbone = darknet53
- Args:
- backbone_shape: List. Darknet output channels shape.
- backbone: Cell. Backbone Network.
- out_channel: Integer. Output channel.
- Returns:
- Tensor, output tensor.
- Examples:
- YOLOv3(backbone_shape=[64, 128, 256, 512, 1024]
- backbone=darknet53(),
- out_channel=255)
- """
- def __init__(self, backbone_shape, backbone, out_channel):
- super(YOLOv3, self).__init__()
- self.out_channel = out_channel
- self.backbone = backbone
- # 第一个特征图输入13x13,输入通道为1024,中间通道为512,输出通道为255. 对应大物体检测。
- self.backblock0 = YoloBlock(backbone_shape[-1], out_chls=backbone_shape[-2], out_channels=out_channel)
-
- self.conv1 = _conv_bn_relu(in_channel=backbone_shape[-2], out_channel=backbone_shape[-2]//2, ksize=1)
- # 第二个特征图输入26x26并且融合第一个特征图输出后+上采样,输入通道为512+256,中间通道256,输出通道255.
- self.backblock1 = YoloBlock(in_channels=backbone_shape[-2]+backbone_shape[-3],
- out_chls=backbone_shape[-3],
- out_channels=out_channel)
-
- self.conv2 = _conv_bn_relu(in_channel=backbone_shape[-3], out_channel=backbone_shape[-3]//2, ksize=1)
- # 第三个特征图输入13x13并且融合第二个特征图输出后+上采样,输入通道为256+128,中间通道128,输出通道255.
- self.backblock2 = YoloBlock(in_channels=backbone_shape[-3]+backbone_shape[-4],
- out_chls=backbone_shape[-4],
- out_channels=out_channel)
- self.concat = ops.Concat(axis=1)
-
- def construct(self, x):
- # input_shape of x is (batch_size, 3, h, w)
- # feature_map1 is (batch_size, backbone_shape[2], h/8, w/8)
- # feature_map2 is (batch_size, backbone_shape[3], h/16, w/16)
- # feature_map3 is (batch_size, backbone_shape[4], h/32, w/32)
- img_hight = ops.Shape()(x)[2]
- img_width = ops.Shape()(x)[3]
- feature_map1, feature_map2, feature_map3 = self.backbone(x)
- con1, big_object_output = self.backblock0(feature_map3)
-
- con1 = self.conv1(con1)
- # 使用线性插值法进行上采样,将特征图尺寸由13x13变为26x26。
- ups1 = ops.ResizeNearestNeighbor((img_hight // 16, img_width // 16))(con1)
- # 特征图大小相等26x26,直接通道数相加。
- con1 = self.concat((ups1, feature_map2))
- con2, medium_object_output = self.backblock1(con1)
-
- con2 = self.conv2(con2)
- # 使用线性插值法进行上采样,将特征图尺寸由26x26变为52x52。
- ups2 = ops.ResizeNearestNeighbor((img_hight // 8, img_width // 8))(con2)
- # 特征图大小相等52x52,直接通道数相加。
- con3 = self.concat((ups2, feature_map1))
- _, small_object_output = self.backblock2(con3)
-
- return big_object_output, medium_object_output, small_object_output
loss.py中定义了损失函数的具体计算公式,包括物体中心点坐标损失、物体宽高损失、检测框置信度损失、物体类别损失。
训练的相关参数通过default_config.yaml文件来配置。
- # Training options
-
- # dataset related
- data_dir: "/cache/data/coco2014/"
- per_batch_size: 32
-
- # network related
- pretrained_backbone: "/cache/checkpoint_path/0-148_92000.ckpt"
- resume_yolov3: ""
-
- # optimizer and lr related
- lr_scheduler: "exponential"
- lr: 0.001
- lr_epochs: "220,250"
- lr_gamma: 0.1
- eta_min: 0.0
- T_max: 320
- max_epoch: 320
- warmup_epochs: 0
- weight_decay: 0.0005
- momentum: 0.9
-
- # loss related
- loss_scale: 1024
- label_smooth: 0
- label_smooth_factor: 0.1
-
- # logging related
- log_interval: 100
- ckpt_path: "outputs/"
- ckpt_interval: -1
- is_save_on_master: 1
-
- # distributed related
- is_distributed: 1
- rank: 0
- group_size: 1
- bind_cpu: True
- device_num: 8
-
- # profiler init
- need_profiler: 0
-
- # reset default config
- training_shape: ""
YOLOV3的训练脚本train.py中首先加载训练参数以及环境变量:
- def run_train():
- """Train function."""
- if config.lr_scheduler == 'cosine_annealing' and config.max_epoch > config.T_max:
- config.T_max = config.max_epoch
- config.lr_epochs = list(map(int, config.lr_epochs.split(',')))
- config.data_root = os.path.join(config.data_dir, 'train2014')
- config.annFile = os.path.join(config.data_dir, 'annotations/instances_train2014.json')
-
- profiler = network_init(config)
train.py中定义网络结构和损失函数:
- network = YOLOV3DarkNet53(is_training=True)
- # default is kaiming-normal
- default_recurisive_init(network)
- load_yolov3_params(config, network)
-
- network = YoloWithLossCell(network)
定义学习率和优化器:
- lr = get_lr(config)
- opt = nn.Momentum(params=get_param_groups(network), momentum=config.momentum, learning_rate=ms.Tensor(lr),
- weight_decay=config.weight_decay, loss_scale=config.loss_scale)
调用nn.TrainOneStepCell接口进行训练:
- network = nn.TrainOneStepCell(network, opt, sens=config.loss_scale)
- network.set_train()
-
- t_end = time.time()
- data_loader = ds.create_dict_iterator(output_numpy=True)
- first_step = True
- stop_profiler = False
-
- for epoch_idx in range(config.max_epoch):
- for step_idx, data in enumerate(data_loader):
- images = data["image"]
- input_shape = images.shape[2:4]
- config.logger.info('iter[{}], shape{}'.format(step_idx, input_shape[0]))
- images = ms.Tensor.from_numpy(images)
-
- batch_y_true_0 = ms.Tensor.from_numpy(data['bbox1'])
- batch_y_true_1 = ms.Tensor.from_numpy(data['bbox2'])
- batch_y_true_2 = ms.Tensor.from_numpy(data['bbox3'])
- batch_gt_box0 = ms.Tensor.from_numpy(data['gt_box1'])
- batch_gt_box1 = ms.Tensor.from_numpy(data['gt_box2'])
- batch_gt_box2 = ms.Tensor.from_numpy(data['gt_box3'])
-
- loss = network(images, batch_y_true_0, batch_y_true_1, batch_y_true_2, batch_gt_box0, batch_gt_box1,
- batch_gt_box2)
- loss_meter.update(loss.asnumpy())
YOLOV3 + COCO2014 dataset示例训练输出:
- epoch[0], iter[0], loss:7809.262695, 0.15 imgs/sec, lr:9.746589057613164e-06
- epoch[0], iter[100], loss:2778.349033, 133.92 imgs/sec, lr:0.0009844054002314806
- epoch[0], iter[200], loss:535.517361, 130.54 imgs/sec, lr:0.0019590642768889666
推理脚本需要首先加载训练得到的权重文件、然后在定义网络后使用set_train(False)切换到推理模式。示意流程代码如下:
- # define Net
- network = YOLOV3DarkNet53(is_training=False)
- # load checkpoint
- load_parameters(network, config.pretrained)
- # init detection engine
- detection = DetectionEngine(config)
- # eval
- network.set_train(False)
- output_big, output_me, output_small = network(image)
- detection.detect([output_small, output_me, output_big], config.per_batch_size, image_shape, image_id)
YOLOV3 + COCO2014 dataset实例推理输出:
- Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.311
- Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.528
- Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.322
- Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.127
- Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.323
- Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.428
- Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.259
- Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.398
- Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.423
- Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.224
- Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.442
- Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.551