混合精度训练方法是通过混合使用单精度和半精度数据格式来加速深度神经网络训练的过程,同时保持了单精度训练所能达到的网络精度。混合精度训练能够加速计算过程,同时减少内存使用和存取,并使得在特定的硬件上可以训练更大的模型或batch size。MindSpore混合精度典型的计算流程如下图所示:
1、参数以FP32存储;
2、正向计算过程中,遇到FP16算子,需要把算子输入和参数从FP32 cast成FP16进行计算;
3、将Loss层设置为FP32进行计算;
4、反向计算过程中,首先乘以Loss Scale值,避免反向梯度过小而产生下溢;
5、FP16参数参与梯度计算,其结果将被cast回FP32;
6、除以Loss scale值,还原被放大的梯度;
7、判断梯度是否存在溢出,如果溢出则跳过更新,否则优化器以FP32对原始参数进行更新。
由于混合精度能带来加速计算,减少内存占用的优势,因此用户在遇到以下情况可以考虑使用混合精度:
1、内存资源不足;
2、训练速度较慢。
本文档是针对以下两类使用场景的的用户:
1、即将启动MindSpore训练代码迁移任务,并对MindSpore有基础了解;
2、已完成MindSpore训练代码迁移任务,即有可使用的MindSpore训练代码。
MindSpore在mindspore.Model接口中做了封装,方便用户调用。具体实现步骤与编写普通训练代码过程没有区别。只需要在Model中设置混合精度相关参数,如amp_level, loss_scale_manager, keep_batchnorm_fp32。
修改高阶API代码中的Model接口,将amp_level设置成"O3",网络将采用混合精度进行训练。
net = Model(net, loss, opt, metrics=metrics, amp_level="O3")
MindSpore低阶API使用混合精度,只需在MindSpore低阶API代码构造模型步骤中,将网络设置成混合精度进行训练。下面对比两者构造模型的区别。
MindSpore低阶API代码中构造模型:
- class BuildTrainNetwork(nn.Cell):
- '''Build train network.'''
- def __init__(self, my_network, my_criterion, train_batch_size, class_num):
- super(BuildTrainNetwork, self).__init__()
- self.network = my_network
- self.criterion = my_criterion
- self.print = P.Print()
- # Initialize self.output
- self.output = mindspore.Parameter(Tensor(np.ones((train_batch_size, class_num)), mindspore.float32), requires_grad=False)
-
- def construct(self, input_data, label):
- output = self.network(input_data)
- # Get the network output and assign it to self.output
- self.output = output
- loss0 = self.criterion(output, label)
-
- return loss0
-
-
- class TrainOneStepCellV2(TrainOneStepCell):
- def __init__(self, network, optimizer, sens=1.0):
- super(TrainOneStepCellV2, self).__init__(network, optimizer, sens=1.0)
-
- def construct(self, *inputs):
- weights = self.weights
- loss = self.network(*inputs)
- # Obtain self.network from BuildTrainNetwork
- output = self.network.output
- sens = P.Fill()(P.DType()(loss), P.Shape()(loss), self.sens)
- # Get the gradient of the network parameters
- grads = self.grad(self.network, weights)(*inputs, sens)
- grads = self.grad_reducer(grads)
- # Optimize model parameters
- loss = F.depend(loss, self.optimizer(grads))
-
- return loss, output
-
-
- model_constructed = BuildTrainNetwork(net, loss_function, TRAIN_BATCH_SIZE, CLASS_NUM)
- model_constructed = TrainOneStepCellV2(model_constructed, opt)
MindSpore低阶API混合精度代码中构造模型:
- class BuildTrainNetwork(nn.Cell):
- '''Build train network.'''
- def __init__(self, my_network, my_criterion, train_batch_size, class_num):
- super(BuildTrainNetwork, self).__init__()
- self.network = my_network
- self.criterion = my_criterion
- self.print = P.Print()
- # Initialize self.output
- self.output = mindspore.Parameter(Tensor(np.ones((train_batch_size, class_num)), mindspore.float32), requires_grad=False)
-
- def construct(self, input_data, label):
- output = self.network(input_data)
- # Get the network output and assign it to self.output
- self.output = output
- loss0 = self.criterion(output, label)
-
- return loss0
-
-
- class TrainOneStepCellV2(TrainOneStepCell):
- '''Build train network.'''
- def __init__(self, network, optimizer, sens=1.0):
- super(TrainOneStepCellV2, self).__init__(network, optimizer, sens=1.0)
-
- def construct(self, *inputs):
- weights = self.weights
- loss = self.network(*inputs)
- # Obtain self.network from BuildTrainNetwork
- output = self.network.output
- sens = P.Fill()(P.DType()(loss), P.Shape()(loss), self.sens)
- # Get the gradient of the network parameters
- grads = self.grad(self.network, weights)(*inputs, sens)
- grads = self.grad_reducer(grads)
- # Optimize model parameters
- loss = F.depend(loss, self.optimizer(grads))
-
- return loss, output
-
- def build_train_network_step2(network, optimizer, loss_fn=None, level='O0', **kwargs):
- """
- Build the mixed precision training cell automatically.
- """
- amp.validator.check_value_type('network', network, nn.Cell)
- amp.validator.check_value_type('optimizer', optimizer, nn.Optimizer)
- amp.validator.check('level', level, "", ['O0', 'O2', 'O3', "auto"], amp.Rel.IN)
-
- if level == "auto":
- device_target = context.get_context('device_target')
-
- if device_target == "GPU":
- level = "O2"
-
- elif device_target == "Ascend":
- level = "O3"
-
- else:
- raise ValueError("Level `auto` only support when `device_target` is GPU or Ascend.")
-
- amp._check_kwargs(kwargs)
- config = dict(amp._config_level[level], **kwargs)
- config = amp.edict(config)
-
- if config.cast_model_type == mstype.float16:
- network.to_float(mstype.float16)
-
- if config.keep_batchnorm_fp32:
- amp._do_keep_batchnorm_fp32(network)
-
- if loss_fn:
- network = amp._add_loss_network(network, loss_fn, config.cast_model_type)
-
- if amp._get_parallel_mode() in (amp.ParallelMode.SEMI_AUTO_PARALLEL, amp.ParallelMode.AUTO_PARALLEL):
- network = amp._VirtualDatasetCell(network)
-
- loss_scale = 1.0
- if config.loss_scale_manager is not None:
- loss_scale_manager = config.loss_scale_manager
- loss_scale = loss_scale_manager.get_loss_scale()
- update_cell = loss_scale_manager.get_update_cell()
-
- if update_cell is not None:
- # only cpu not support `TrainOneStepWithLossScaleCell` for control flow.
-
- if not context.get_context("enable_ge") and context.get_context("device_target") == "CPU":
- raise ValueError("Only `loss_scale_manager=None` and "
- "`loss_scale_manager=FixedLossScaleManager`"
- "are supported in current version. If you use `O2` option,"
- "use `loss_scale_manager=None` or `FixedLossScaleManager`")
- network = TrainOneStepCellV2(network, optimizer)
-
- return network
- network = TrainOneStepCellV2(network, optimizer)
- return network
-
-
- model_constructed = BuildTrainNetwork(net, loss_function, TRAIN_BATCH_SIZE, CLASS_NUM)
- model_constructed = build_train_network_step2(model_constructed, opt, level="O3")
相比全精度训练,使用混合精度后,获得了可观的性能提升。
低阶API: 2000 imgs/sec ;低阶API混合精度: 3200 imgs/sec
高阶API: 2200 imgs/sec ;高阶API混合精度: 3300 imgs/sec