图2为Darknet-53中的残差结构。加入残差结构Residual的目的是为了增加网络的深度,用于支持网络提取更高级别的语义特征,同时残差的结构可以帮助我们避免梯度的消失或爆炸。因为残差的物理结构,反映到反向梯度传播中,可以使得梯度传递到前面很远的网络层中,削弱反向求导的链式反应。从图2中可以看到,输入首先经过一个1×1的卷积层Conv(1×1 stride=1)将通道数降低一半变为In_channels/2, 然后经过一个3×3的卷积层Conv(3×3,stride=1)进行特征提取,这时通道数又从In_channels/2恢复为In_channels。最后3×3卷积的输出与经过Shorcut传递过来的输入Input相加得到最终的Output(此时3×3卷积的输出与Input的形状(In_channels,h,w)相同,可以直接相加)。我们看到,经过Residual运算之后,输入的特征图形状保持不变。
项目地址:models: Models of MindSpore - Gitee.com
- # 构造了Conv2d+BatchNorm2d+ReLU的组合。
- def conv_block(in_channels, out_channels, kernel_size, stride,dilation=1):
- """Get a conv2d batchnorm and relu layer"""
- pad_mode = 'same'
- padding = 0
- return nn.SequentialCell(
- [nn.Conv2d(in_channels,
- out_channels,
- kernel_size=kernel_size,
- stride=stride,
- padding=padding,
- dilation=dilation,
- pad_mode=pad_mode),
- nn.BatchNorm2d(out_channels, momentum=0.1),
- nn.ReLU()]
- )
- class ResidualBlock(nn.Cell):
- """
- DarkNet V1 residual block definition.
- Args:
- in_channels: Integer. Input channel.
- out_channels: Integer. Output channel.
- Returns:
- Tensor, output tensor.
- Examples:
- ResidualBlock(3, 208)
- """
- expansion = 4
- def __init__(self,
- in_channels,
- out_channels):
- super(ResidualBlock, self).__init__()
- out_chls = out_channels//2
- # 定义残差结构。首先通过1x1的卷积核将通道数降为一半,然后通过3x3的卷积核将通道数恢复。
- # 无论是经过1x1卷积还是3x3卷积,特征图的大小是不变的。padding=same,stride=1.
- # 输入经过残差结构后,特征图大小和通道数是保持不变的,所以可以进行ops.Add()操作。
- self.conv1 = conv_block(in_channels, out_chls, kernel_size=1, stride=1)
- self.conv2 = conv_block(out_chls, out_channels, kernel_size=3, stride=1)
- self.add = ops.Add()
- def construct(self, x):
- identity = x
- out = self.conv1(x)
- out = self.conv2(out)
- out = self.add(out, identity)
- return out
- class DarkNet(nn.Cell):
- """
- DarkNet V1 network.
- Args:
- block: Cell. Block for network.
- layer_nums: List. Numbers of different layers.
- in_channels: Integer. Input channel.
- out_channels: Integer. Output channel.
- detect: Bool. Whether detect or not. Default:False.
- Returns:
- Tuple, tuple of output tensor,(f1,f2,f3,f4,f5).
- Examples:
- DarkNet(ResidualBlock,
- [1, 2, 8, 8, 4],
- [32, 64, 128, 256, 512],
- [64, 128, 256, 512, 1024],
- 100)
- """
- def __init__(self,
- block,
- layer_nums,
- in_channels,
- out_channels,
- detect=False):
- super(DarkNet, self).__init__()
- self.outchannel = out_channels[-1]
- self.detect = detect
- # labyer_nums对应Darknet53网络结构中ResidualBlock的重复个数顺序列表。实际值为[1, 2, 8, 8, 4]
- # in_channels对应Darknet53网络结构中ResidualBlock中第一个Conv的卷积核个数。实际值为[32, 64, 128, 256, 512]
- # in_channels对应Darknet53网络结构中ResidualBlock中第二个Conv的卷积核个数。实际值为[64, 128, 256, 512, 1024]
- if not len(layer_nums) == len(in_channels) == len(out_channels) == 5:
- raise ValueError("the length of layer_num, inchannel, outchannel list must be 5!")
- # convx对应Darknet53中6个单独的卷积层,layerx对应Darknet53的5个残差结构,它们的重复次数为[1, 2, 8, 8, 4]
- self.conv0 = conv_block(3,
- in_channels[0],
- kernel_size=3,
- stride=1)
- self.conv1 = conv_block(in_channels[0],
- out_channels[0],
- kernel_size=3,
- stride=2)
- self.layer1 = self._make_layer(block,
- layer_nums[0],
- in_channel=out_channels[0],
- out_channel=out_channels[0])
- self.conv2 = conv_block(in_channels[1],
- out_channels[1],
- kernel_size=3,
- stride=2)
- self.layer2 = self._make_layer(block,
- layer_nums[1],
- in_channel=out_channels[1],
- out_channel=out_channels[1])
- self.conv3 = conv_block(in_channels[2],
- out_channels[2],
- kernel_size=3,
- stride=2)
- self.layer3 = self._make_layer(block,
- layer_nums[2],
- in_channel=out_channels[2],
- out_channel=out_channels[2])
- self.conv4 = conv_block(in_channels[3],
- out_channels[3],
- kernel_size=3,
- stride=2)
- self.layer4 = self._make_layer(block,
- layer_nums[3],
- in_channel=out_channels[3],
- out_channel=out_channels[3])
- self.conv5 = conv_block(in_channels[4],
- out_channels[4],
- kernel_size=3,
- stride=2)
- self.layer5 = self._make_layer(block,
- layer_nums[4],
- in_channel=out_channels[4],
- out_channel=out_channels[4])
- def _make_layer(self, block, layer_num, in_channel, out_channel):
- """
- Make Layer for DarkNet.
- :param block: Cell. DarkNet block.
- :param layer_num: Integer. Layer number.
- :param in_channel: Integer. Input channel.
- :param out_channel: Integer. Output channel.
- Examples:
- _make_layer(ConvBlock, 1, 128, 256)
- """
- layers = []
- darkblk = block(in_channel, out_channel)
- layers.append(darkblk)
- for _ in range(1, layer_num):
- darkblk = block(out_channel, out_channel)
- layers.append(darkblk)
- return nn.SequentialCell(layers)
- def construct(self, x):
- c1 = self.conv0(x)
- c2 = self.conv1(c1)
- c3 = self.layer1(c2)
- c4 = self.conv2(c3)
- c5 = self.layer2(c4)
- c6 = self.conv3(c5)
- c7 = self.layer3(c6)
- c8 = self.conv4(c7)
- c9 = self.layer4(c8)
- c10 = self.conv5(c9)
- c11 = self.layer5(c10)
- # 如果是作为YOLOV3的特征抽取网络,那么输出三个分支的特征图,用于之后的检测。
- if self.detect:
- return c7, c9, c11
- return c11
- def get_out_channels(self):
- return self.outchannel
- # Darknet53网络输出的特征图经过一系列的Conv2d+BatchNorm2d+LeakyReLU组合后,其中Conv2d为1x1和3x3的卷积核。
- # 最终输出255通道数的特征图。其中255=3*(80+5).一个像素分配三个anchor,
- # 每一个anchor负责预测80分类+4个位置坐标+1个类别置信度。
- class YoloBlock(nn.Cell):
- """
- YoloBlock for YOLOv3.
- Args:
- in_channels: Integer. Input channel.
- out_chls: Integer. Middle channel.
- out_channels: Integer. Output channel.
- Returns:
- Tuple, tuple of output tensor,(f1,f2,f3).
- Examples:
- YoloBlock(1024, 512, 255)
- """
- def __init__(self, in_channels, out_chls, out_channels):
- super(YoloBlock, self).__init__()
- out_chls_2 = out_chls*2
- self.conv0 = _conv_bn_relu(in_channels, out_chls, ksize=1)
- self.conv1 = _conv_bn_relu(out_chls, out_chls_2, ksize=3)
- self.conv2 = _conv_bn_relu(out_chls_2, out_chls, ksize=1)
- self.conv3 = _conv_bn_relu(out_chls, out_chls_2, ksize=3)
- self.conv4 = _conv_bn_relu(out_chls_2, out_chls, ksize=1)
- self.conv5 = _conv_bn_relu(out_chls, out_chls_2, ksize=3)
- self.conv6 = nn.Conv2d(out_chls_2, out_channels, kernel_size=1, stride=1, has_bias=True)
- def construct(self, x):
- c1 = self.conv0(x)
- c2 = self.conv1(c1)
- c3 = self.conv2(c2)
- c4 = self.conv3(c3)
- c5 = self.conv4(c4)
- c6 = self.conv5(c5)
- out = self.conv6(c6)
- return c5, out
- # 基于YoloBlock,输出大、中、小三个特征图中经过上采样、Conv2d+BatchNorm2d+LeakyReLU组合后的结果。
- class YOLOv3(nn.Cell):
- """
- YOLOv3 Network.
- Note:
- backbone = darknet53
- Args:
- backbone_shape: List. Darknet output channels shape.
- backbone: Cell. Backbone Network.
- out_channel: Integer. Output channel.
- Returns:
- Tensor, output tensor.
- Examples:
- YOLOv3(backbone_shape=[64, 128, 256, 512, 1024]
- backbone=darknet53(),
- out_channel=255)
- """
- def __init__(self, backbone_shape, backbone, out_channel):
- super(YOLOv3, self).__init__()
- self.out_channel = out_channel
- self.backbone = backbone
- # 第一个特征图输入13x13,输入通道为1024,中间通道为512,输出通道为255. 对应大物体检测。
- self.backblock0 = YoloBlock(backbone_shape[-1], out_chls=backbone_shape[-2], out_channels=out_channel)
- self.conv1 = _conv_bn_relu(in_channel=backbone_shape[-2], out_channel=backbone_shape[-2]//2, ksize=1)
- # 第二个特征图输入26x26并且融合第一个特征图输出后+上采样,输入通道为512+256,中间通道256,输出通道255.
- self.backblock1 = YoloBlock(in_channels=backbone_shape[-2]+backbone_shape[-3],
- out_chls=backbone_shape[-3],
- out_channels=out_channel)
- self.conv2 = _conv_bn_relu(in_channel=backbone_shape[-3], out_channel=backbone_shape[-3]//2, ksize=1)
- # 第三个特征图输入13x13并且融合第二个特征图输出后+上采样,输入通道为256+128,中间通道128,输出通道255.
- self.backblock2 = YoloBlock(in_channels=backbone_shape[-3]+backbone_shape[-4],
- out_chls=backbone_shape[-4],
- out_channels=out_channel)
- self.concat = ops.Concat(axis=1)
- def construct(self, x):
- # input_shape of x is (batch_size, 3, h, w)
- # feature_map1 is (batch_size, backbone_shape[2], h/8, w/8)
- # feature_map2 is (batch_size, backbone_shape[3], h/16, w/16)
- # feature_map3 is (batch_size, backbone_shape[4], h/32, w/32)
- img_hight = ops.Shape()(x)[2]
- img_width = ops.Shape()(x)[3]
- feature_map1, feature_map2, feature_map3 = self.backbone(x)
- con1, big_object_output = self.backblock0(feature_map3)
- con1 = self.conv1(con1)
- # 使用线性插值法进行上采样,将特征图尺寸由13x13变为26x26。
- ups1 = ops.ResizeNearestNeighbor((img_hight // 16, img_width // 16))(con1)
- # 特征图大小相等26x26,直接通道数相加。
- con1 = self.concat((ups1, feature_map2))
- con2, medium_object_output = self.backblock1(con1)
- con2 = self.conv2(con2)
- # 使用线性插值法进行上采样,将特征图尺寸由26x26变为52x52。
- ups2 = ops.ResizeNearestNeighbor((img_hight // 8, img_width // 8))(con2)
- # 特征图大小相等52x52,直接通道数相加。
- con3 = self.concat((ups2, feature_map1))
- _, small_object_output = self.backblock2(con3)
- return big_object_output, medium_object_output, small_object_output
- # Training options
- # dataset related
- data_dir: "/cache/data/coco2014/"
- per_batch_size: 32
- # network related
- pretrained_backbone: "/cache/checkpoint_path/0-148_92000.ckpt"
- resume_yolov3: ""
- # optimizer and lr related
- lr_scheduler: "exponential"
- lr: 0.001
- lr_epochs: "220,250"
- lr_gamma: 0.1
- eta_min: 0.0
- T_max: 320
- max_epoch: 320
- warmup_epochs: 0
- weight_decay: 0.0005
- momentum: 0.9
- # loss related
- loss_scale: 1024
- label_smooth: 0
- label_smooth_factor: 0.1
- # logging related
- log_interval: 100
- ckpt_path: "outputs/"
- ckpt_interval: -1
- is_save_on_master: 1
- # distributed related
- is_distributed: 1
- rank: 0
- group_size: 1
- bind_cpu: True
- device_num: 8
- # profiler init
- need_profiler: 0
- # reset default config
- training_shape: ""
- def run_train():
- """Train function."""
- if config.lr_scheduler == 'cosine_annealing' and config.max_epoch > config.T_max:
- config.T_max = config.max_epoch
- config.lr_epochs = list(map(int, config.lr_epochs.split(',')))
- config.data_root = os.path.join(config.data_dir, 'train2014')
- config.annFile = os.path.join(config.data_dir, 'annotations/instances_train2014.json')
- profiler = network_init(config)
- network = YOLOV3DarkNet53(is_training=True)
- # default is kaiming-normal
- default_recurisive_init(network)
- load_yolov3_params(config, network)
- network = YoloWithLossCell(network)
- lr = get_lr(config)
- opt = nn.Momentum(params=get_param_groups(network), momentum=config.momentum, learning_rate=ms.Tensor(lr),
- weight_decay=config.weight_decay, loss_scale=config.loss_scale)
- network = nn.TrainOneStepCell(network, opt, sens=config.loss_scale)
- network.set_train()
- t_end = time.time()
- data_loader = ds.create_dict_iterator(output_numpy=True)
- first_step = True
- stop_profiler = False
- for epoch_idx in range(config.max_epoch):
- for step_idx, data in enumerate(data_loader):
- images = data["image"]
- input_shape = images.shape[2:4]
- config.logger.info('iter[{}], shape{}'.format(step_idx, input_shape[0]))
- images = ms.Tensor.from_numpy(images)
- batch_y_true_0 = ms.Tensor.from_numpy(data['bbox1'])
- batch_y_true_1 = ms.Tensor.from_numpy(data['bbox2'])
- batch_y_true_2 = ms.Tensor.from_numpy(data['bbox3'])
- batch_gt_box0 = ms.Tensor.from_numpy(data['gt_box1'])
- batch_gt_box1 = ms.Tensor.from_numpy(data['gt_box2'])
- batch_gt_box2 = ms.Tensor.from_numpy(data['gt_box3'])
- loss = network(images, batch_y_true_0, batch_y_true_1, batch_y_true_2, batch_gt_box0, batch_gt_box1,
- batch_gt_box2)
- loss_meter.update(loss.asnumpy())
YOLOV3 + COCO2014 dataset示例训练输出:
- epoch[0], iter[0], loss:7809.262695, 0.15 imgs/sec, lr:9.746589057613164e-06
- epoch[0], iter[100], loss:2778.349033, 133.92 imgs/sec, lr:0.0009844054002314806
- epoch[0], iter[200], loss:535.517361, 130.54 imgs/sec, lr:0.0019590642768889666
- # define Net
- network = YOLOV3DarkNet53(is_training=False)
- # load checkpoint
- load_parameters(network, config.pretrained)
- # init detection engine
- detection = DetectionEngine(config)
- # eval
- network.set_train(False)
- output_big, output_me, output_small = network(image)
- detection.detect([output_small, output_me, output_big], config.per_batch_size, image_shape, image_id)
YOLOV3 + COCO2014 dataset实例推理输出:
- Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.311
- Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.528
- Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.322
- Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.127
- Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.323
- Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.428
- Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.259
- Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.398
- Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.423
- Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.224
- Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.442
- Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.551