
熟悉 BERT 模型的小伙伴对于 Roberta 模型肯定不陌生了。Roberta 模型在 BERT 模型的基础上进行了一定的改进,主要改进点有以下几个部分:
1. 训练语料:BERT只使用 16 GB 的Book Corpus数据集和英语维基百科进行训练,Roberta增加了 CC-NEWS 、OPEN WEB TEXT、STORIES 等语料,一共有 160 GB 的纯文本。
2. Batch Size:Roberta模型在训练中使用了更大的Batch Size -> [256 ~ 8000]。
3. 训练时间:Roberta模型使用 1024 块 V100 的 GPU 训练了整整 1 天的时间,模型参数量和训练时间更加多和更加长。
同时Roberta在具体的训练方法上也有所改进:
1. 动态 MASK 机制
2. 去除了 NSP 任务
3. Tokenizer部分更换成 BPE 算法,分词这一点和 GPT-2 十分相似
Roberta 源码(huggingface):
https://huggingface.co/roberta-base
Roberta论文:
https://arxiv.org/abs/1907.11692
这里我们采用了华为开发的 MindSpore 框架,并且选择了 Pytorch 版本 Roberta 模型进行模型迁移。欢迎大家一起参与到 MindSpore 开源社区的开发中来!
本文环境:
系统:Ubuntu 18
GPU:RTX 3090
MindSpore版本:1.3
数据集:SST-2(情感分析任务)
SST-2 数据集定义:
这是一个二分类的数据集,训练集和验证集的句子所对应的标签是0或1
模型权重转换
我们需要将 Pytorch 版本的 Roberta 权重转换成 MindSpore 适用的权重,这里提供一个转换的思路。主要可以参考官网的API映射文档进行改写。
官网链接:转换映射
(https://www.mindspore.cn/docs/migration_guide/zh-CN/r1.5/api_mapping/pytorch_api_mapping.html)
-
- def torch_to_ms(model, torch_model):
- """
- Updates mobilenetv2 model mindspore param's data from torch param's data.
- Args:
- model: mindspore model
- torch_model: torch model
- """
- print("start load")
- # load torch parameter and mindspore parameter
- torch_param_dict = torch_model
- ms_param_dict = model.parameters_dict()
- count = 0
- for ms_key in ms_param_dict.keys():
- ms_key_tmp = ms_key.split('.')
- if ms_key_tmp[0] == 'roberta_embedding_lookup':
- count += 1
- update_torch_to_ms(torch_param_dict, ms_param_dict, 'embeddings.word_embeddings.weight', ms_key)
- elif ms_key_tmp[0] == 'roberta_embedding_postprocessor':
- if ms_key_tmp[1] == "token_type_embedding":
- count += 1
- update_torch_to_ms(torch_param_dict, ms_param_dict, 'embeddings.token_type_embeddings.weight', ms_key)
- elif ms_key_tmp[1] == "full_position_embedding":
- count += 1
- update_torch_to_ms(torch_param_dict, ms_param_dict, 'embeddings.position_embeddings.weight',
- ms_key)
- elif ms_key_tmp[1] =="layernorm":
- if ms_key_tmp[2]=="gamma":
- count += 1
- update_torch_to_ms(torch_param_dict, ms_param_dict, 'embeddings.LayerNorm.weight',
- ms_key)
- else:
- count += 1
- update_torch_to_ms(torch_param_dict, ms_param_dict, 'embeddings.LayerNorm.bias',
- ms_key)
- elif ms_key_tmp[0] == "roberta_encoder":
- if ms_key_tmp[3]=='attention':
- par = ms_key_tmp[4].split('_')[0]
- count += 1
- update_torch_to_ms(torch_param_dict, ms_param_dict, 'encoder.layer.'+ms_key_tmp[2]+'.'+ms_key_tmp[3]+'.'
- +'self.'+par+'.'+ms_key_tmp[5],
- ms_key)
- elif ms_key_tmp[3]=='attention_output':
- if ms_key_tmp[4]=='dense':
- count += 1
- update_torch_to_ms(torch_param_dict, ms_param_dict,
- 'encoder.layer.' + ms_key_tmp[2] + '.attention.output.'+ms_key_tmp[4]+'.'+ms_key_tmp[5],
- ms_key)
-
- elif ms_key_tmp[4]=='layernorm':
- if ms_key_tmp[5]=='gamma':
- count += 1
- update_torch_to_ms(torch_param_dict, ms_param_dict,
- 'encoder.layer.' + ms_key_tmp[2] + '.attention.output.LayerNorm.weight',
- ms_key)
- else:
- count += 1
- update_torch_to_ms(torch_param_dict, ms_param_dict,
- 'encoder.layer.' + ms_key_tmp[2] + '.attention.output.LayerNorm.bias',
- ms_key)
- elif ms_key_tmp[3]=='intermediate':
- count += 1
- update_torch_to_ms(torch_param_dict, ms_param_dict,
- 'encoder.layer.' + ms_key_tmp[2] + '.intermediate.dense.'+ms_key_tmp[4],
- ms_key)
- elif ms_key_tmp[3]=='output':
- if ms_key_tmp[4]=='dense':
- count += 1
- update_torch_to_ms(torch_param_dict, ms_param_dict,
- 'encoder.layer.' + ms_key_tmp[2] + '.output.dense.'+ms_key_tmp[5],
- ms_key)
- else:
- if ms_key_tmp[5]=='gamma':
- count += 1
- update_torch_to_ms(torch_param_dict, ms_param_dict,
- 'encoder.layer.' + ms_key_tmp[2] + '.output.LayerNorm.weight',
- ms_key)
-
- else:
- count += 1
- update_torch_to_ms(torch_param_dict, ms_param_dict,
- 'encoder.layer.' + ms_key_tmp[2] + '.output.LayerNorm.bias',
- ms_key)
-
- if ms_key_tmp[0]=='dense':
- if ms_key_tmp[1]=='weight':
- count += 1
- update_torch_to_ms(torch_param_dict, ms_param_dict,
- 'pooler.dense.weight',
- ms_key)
- else:
- count += 1
- update_torch_to_ms(torch_param_dict, ms_param_dict,
- 'pooler.dense.bias',
- ms_key)
-
- save_checkpoint(model, "./model/roberta-base.ckpt")
- print(count)
- print("finish load")
这里值得注意的是转换的参数一定要对应的上,不然在后续权重加载中会出现权重加载失败,或是训练过程中loss值降不下去的情况!转换完后可以试着打印对应参数的 key 值防止错漏。

这样我们就获得了转换出来的roberta.ckpt文件用于加载权重。这里的权重文件一定要和tensorflow的权重文件进行区分!
数据处理
对于数据的输入这一块,MindSpore和Pytorch也是存在差异的。这里我们直接使用自己准备的dataset部分,这里面封装了多种NLP任务的数据集处理方式,可以将数据集转换成mindrecord形式,然后用于模型的训练和验证。下面跟着笔者继续去了解一下吧!
- """
- SST-2 dataset
- """
- from typing import Union, Dict, List
- import mindspore.dataset as ds
- from ..base_dataset import CLSBaseDataset
-
-
- class SST2Dataset(CLSBaseDataset):
- """
- SST2 dataset.
- Args:
- paths (Union[str, Dict[str, str]], Optional): Dataset file path or Dataset directory path, default None.
- tokenizer (Union[str]): Tokenizer function, default 'spacy'.
- lang (str): Tokenizer language, default 'en'.
- max_size (int, Optional): Vocab max size, default None.
- min_freq (int, Optional): Min word frequency, default None.
- padding (str): Padding token, default `<pad>`.
- unknown (str): Unknown token, default `<unk>`.
- buckets (List[int], Optional): Padding row to the length of buckets, default None.
- Examples:
- >>> sst2 = SST2Dataset(tokenizer='spacy', lang='en')
- # sst2 = SST2Dataset(tokenizer='spacy', lang='en', buckets=[16,32,64])
- >>> ds = sst2()
- """
-
- def __init__(self, paths: Union[str, Dict[str, str]] = None,
- tokenizer: Union[str] = 'spacy', lang: str = 'en', max_size: int = None, min_freq: int = None,
- padding: str = '<pad>', unknown: str = '<unk>',
- buckets: List[int] = None):
- super(SST2Dataset, self).__init__(sep='\t', name='SST-2')
- self._paths = paths
- self._tokenize = tokenizer
- self._lang = lang
- self._vocab_max_size = max_size
- self._vocab_min_freq = min_freq
- self._padding = padding
- self._unknown = unknown
- self._buckets = buckets
-
- def __call__(self) -> Dict[str, ds.MindDataset]:
- self.load(self._paths)
- self.process(tokenizer=self._tokenize, lang=self._lang, max_size=self._vocab_max_size,
- min_freq=self._vocab_min_freq, padding=self._padding,
- unknown=self._unknown, buckets=self._buckets)
- return self.mind_datasets
- from mindtext.dataset.classification import SST2Dataset
-
- #对SST-2情感分析的数据集进行处理 如果本来缓存就有只需要直接读缓存
- dataset = SST2Dataset(paths='./mindtext/dataset/SST-2',
- tokenizer="roberta-base",
- max_length=128,
- truncation_strategy=True,
- columns_list=['input_ids', 'attention_mask','label'],
- test_columns_list=['input_ids', 'attention_mask'],
- batch_size=64 )
-
-
- ds = dataset() #生成对应的train、dev的mindrecord文件
- ds = dataset.from_cache( columns_list=['input_ids', 'attention_mask','label'],
- test_columns_list=['input_ids', 'attention_mask'],
- batch_size=64
- )
- dev_dataset = ds['dev'] #取出转成mindrecord的验证集用于验证
生成出来的mindrecord文件,一个是 .mindrecord文件 一个是 .mindrecord.db文件,值得注意的是我们不能随意去修改它们的名字,两个文件是存在着一定的映射关系,如果强行去修改它们名字会出现读取不了mindrecord文件的错误!
还有进行数据处理的参数一定要和模型输入的参数一致,例如我们Roberta模型输入的参数是['input_ids', 'attention_mask','label']。
模型的主体架构
项目架构图:

架构方面我们主要是参考的fastnlp的划分方式。将模型划分为Encoder、Embedding、Tokenizer三个部分,后续会进一步的优化模型的架构。
1.Embedding
- """Roberta Embedding."""
- import logging
- from typing import Tuple
- import mindspore.nn as nn
- from mindspore import Tensor
- from mindspore.train.serialization import load_checkpoint, load_param_into_net
- from mindtext.modules.encoder.roberta import RobertaModel, RobertaConfig
-
-
- logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
- logger = logging.getLogger(__name__)
-
- class RobertaEmbedding(nn.Cell):
- """
- This is a class that loads pre-trained weight files into the model.
- """
- def __init__(self, roberta_config: RobertaConfig, is_training: bool = False):
- super(RobertaEmbedding, self).__init__()
- self.roberta = RobertaModel(roberta_config, is_training)
-
- def init_robertamodel(self,roberta):
- """
- Manual initialization BertModel
- """
- self.roberta=roberta
-
- def from_pretrain(self, ckpt_file):
- """
- Load the model parameters from checkpoint
- """
- param_dict = load_checkpoint(ckpt_file)
- load_param_into_net(self.roberta, param_dict)
-
- def construct(self, input_ids: Tensor, input_mask: Tensor)-> Tuple[Tensor, Tensor]:
- """
- Returns the result of the model after loading the pre-training weights
- Args:
- input_ids:A vector containing the transformation of characters into corresponding ids.
- input_mask:the mask for input_ids.
- Returns:
- sequence_output:the sequence output .
- pooled_output:the pooled output of first token:cls..
- """
- sequence_output, pooled_output, _ = self.roberta(input_ids, input_mask)
- return sequence_output, pooled_output
这部分主要用来加载预训练好的权重,在前文我们就得到了mindrecord形式的权重文件。
2.Encoder
- class RobertaModel(nn.Cell):
- """
- Used from mindtext.modules.encoder.roberta with Roberta
- Args:
- config (Class): Configuration for RobertaModel.
- is_training (bool): True for training mode. False for eval mode.
- use_one_hot_embeddings (bool): Specifies whether to use one hot encoding form.
- Default: False.
- """
-
- def __init__(self,
- config: RobertaConfig,
- is_training: bool,
- use_one_hot_embeddings: bool = False):
- super().__init__()
- config = copy.deepcopy(config)
- if not is_training:
- config.hidden_dropout_prob = 0.0
- config.attention_probs_dropout_prob = 0.0
-
- self.seq_length = config.seq_length
- self.hidden_size = config.hidden_size
- self.num_hidden_layers = config.num_hidden_layers
- self.embedding_size = config.hidden_size
- self.token_type_ids = None
- self.compute_type = numbtpye2mstype(config.compute_type)
- self.last_idx = self.num_hidden_layers - 1
- output_embedding_shape = [-1, self.seq_length, self.embedding_size]
- self.roberta_embedding_lookup = nn.Embedding(
- vocab_size=config.vocab_size,
- embedding_size=self.embedding_size,
- use_one_hot=use_one_hot_embeddings,
- embedding_table=TruncatedNormal(config.initializer_range))
-
- self.roberta_embedding_postprocessor = EmbeddingPostprocessor(
- embedding_size=self.embedding_size,
- embedding_shape=output_embedding_shape,
- use_token_type=False,
- token_type_vocab_size=config.type_vocab_size,
- use_one_hot_embeddings=use_one_hot_embeddings,
- max_position_embeddings=config.max_position_embeddings,
- dropout_prob=config.hidden_dropout_prob)
-
- self.roberta_encoder = RobertaTransformer(
- hidden_size=self.hidden_size,
- seq_length=self.seq_length,
- num_attention_heads=config.num_attention_heads,
- num_hidden_layers=self.num_hidden_layers,
- intermediate_size=config.intermediate_size,
- attention_probs_dropout_prob=config.attention_probs_dropout_prob,
- initializer_range=config.initializer_range,
- hidden_dropout_prob=config.hidden_dropout_prob,
- hidden_act=config.hidden_act,
- compute_type=self.compute_type,
- return_all_encoders=True)
-
- self.cast = P.Cast()
- self.dtype = numbtpye2mstype(config.dtype)
- self.cast_compute_type = SecurityCast()
- self.slice = P.StridedSlice()
-
- self.squeeze_1 = P.Squeeze(axis=1)
- self.dense = nn.Dense(self.hidden_size, self.hidden_size,
- activation="tanh",
- weight_init=TruncatedNormal(config.initializer_range))\
- .to_float(mstype.float32)
- self._create_attention_mask_from_input_mask = CreateAttentionMaskFromInputMask(config)
-
- def construct(self, input_ids: Tensor, input_mask: Tensor) -> Tuple[Tensor, Tensor, Tensor]:
- """Bidirectional Encoder Representations from Transformers.
- Args:
- input_ids:A vector containing the transformation of characters into corresponding ids.
- input_mask:the mask for input_ids.
- Returns:
- sequence_output:the sequence output .
- pooled_output:the pooled output of first token:cls.
- embedding_table:fixed embedding table.
- """
- # embedding
- embedding_tables = self.roberta_embedding_lookup.embedding_table
- word_embeddings = self.roberta_embedding_lookup(input_ids)
- embedding_output = self.roberta_embedding_postprocessor(input_ids,
- word_embeddings)
- # attention mask [batch_size, seq_length, seq_length]
- attention_mask = self._create_attention_mask_from_input_mask(input_mask)
-
- # roberta encoder
- encoder_output = self.roberta_encoder(self.cast_compute_type(embedding_output),
- attention_mask)
-
- sequence_output = self.cast(encoder_output[self.last_idx], self.dtype)
-
- # pooler
- batch_size = P.Shape()(input_ids)[0]
- sequence_slice = self.slice(sequence_output,
- (0, 0, 0),
- (batch_size, 1, self.hidden_size),
- (1, 1, 1))
- first_token = self.squeeze_1(sequence_slice)
- pooled_output = self.dense(first_token)
- pooled_output = self.cast(pooled_output, self.dtype)
- return sequence_output, pooled_output, embedding_tables
这是我们的模型主体,这里我们将模型整体放入到了encoder文件下的roberta.py。我们可以看到encoder部分主要包括 encoder_output 和 sequence_output 两个部分。这里的部分实现我们是参照了 MindSpore ModelZoo 项目的 BERT 模型去实现的(小伙伴们在迁移模型时候可以登录MindSpore官网前去参考哦~),所以只需要简单的将几个重要的模块进行拼接就OK了,这几个模块包括:
EncoderOutput:每个sub-layer加上残差连接和归一化的模块。
RobertaAttention:单个多头自注意力层。
RobertaEncoderCell:单个的RobertaEncoder层。
RobertaTransformer:将多个RobertaEncoderCell拼接在一起,形成完整的roberta模块。
这样可以大大减少我们模型架构的时间,也能更好的学习到MindSpore框架的使用。
3.Tokenizer
分词方面我们封装在dataset中。可以通过指定huggingface库中存在的预训练模型的名字来直接加载词表等文件!使用起来非常方便,详情可以回溯到数据处理部分的代码。如果想实现别的下游任务,但在dataset中没有的话,也可以使用我们实现的tokenlizer来用自己的方式去构建mindrecord形式的数据。
- def get_tokenizer(tokenize_method: str, lang: str = 'en'):
- """
- Get a tokenizer.
- Args:
- tokenize_method (str): Select a tokenizer method.
- lang (str): Tokenizer language, default English.
- Returns:
- function: A tokenizer function.
- """
- tokenizer_dict = {
- 'spacy': None,
- 'raw': _split,
- 'cn-char': _cn_split,
- }
- if tokenize_method == 'spacy':
- import spacy
- spacy.prefer_gpu()
- if lang != 'en':
- raise RuntimeError("Spacy only supports english")
- if parse_version(spacy.__version__) >= parse_version('3.0'):
- en = spacy.load('en_core_web_sm')
- else:
- en = spacy.load(lang)
-
- def _spacy_tokenizer(text):
- return [w.text for w in en.tokenizer(text)]
-
- tokenizer = _spacy_tokenizer
- elif tokenize_method in tokenizer_dict:
- tokenizer = tokenizer_dict[tokenize_method]
- else:
- raise RuntimeError(f"Only support {tokenizer_dict.keys()} tokenizer.")
- return tokenizer
模型参数加载
在模型主体的roberta.py文件中 RobertaConfig 模块可以加载yaml文件中的对应参数到 RobertaModel中去。以下是yaml文件的参数配置:
- seq_length: 128
- vocab_size: 50265
- hidden_size: 768
- bos_token_id: 0
- eos_token_id: 2
- num_hidden_layers: 12
- num_attention_heads: 12
- intermediate_size: 3072
- hidden_act: "gelu"
- hidden_dropout_prob: 0.1
- attention_probs_dropout_prob: 0.1
- max_position_embeddings: 514
- pad_token_id: 1
- type_vocab_size: 1
- initializer_range: 0.02
- use_relative_positions: False
- dtype: mstype.float32
- compute_type: mstype.float32
模型训练
接下来就是大家最期待的模型训练部分啦!我们首先要做的是将适用于 MindSpore 框架的权重参数加载到Roberta模型中去,同时初始化。可以看到,yaml 形式的超参数配置写入到RobertaModel 中并进行实例化,随后权重参数通过MindSpore内置的load_checkpoint和load_param_intonet函数加载进RobertaModel。这里我们不是直接进行加载,而是中间嵌套了一层RobertaEmbedding。而且需要注意的是训练时我们的 is_training 值设置为True,num_labels视具体下游任务要求设置。
- roberta_config_file = "./mindtext/config/test.yaml"
- roberta_config = RobertaConfig.from_yaml_file(roberta_config_file)
- rbm = RobertaModel(roberta_config, True)
- param_dict = load_checkpoint('./mindtext/pretrain/roberta-base-ms.ckpt')
- p = load_param_into_net(rbm, param_dict)
- em = RobertaEmbedding(roberta_config, True)
- em.initRobertaModel(rbm)
- roberta = RobertaforSequenceClassification(roberta_config, is_training=True, num_labels=2)
- roberta.init_embedding(em)
模型权重加载完毕后,设置好学习率lr、训练轮数epoch、损失函数、优化器等等,就可以开始训练啦!这里我们根据论文提供的learning_rate来设置:3e-5。其次还使用了warm_up预热学习率,待模型趋于稳定后,再选择预先设置的学习率进行训练,使得模型收敛速度变得更快,模型效果更佳。epoch设为3轮。
- epoch_num = 3
- save_path = "./mindtext/pretrain/output/roberta-base_sst.ckpt"
- lr_schedule = RobertaLearningRate(learning_rate=3e-5,
- end_learning_rate=1e-5,
- warmup_steps=int(train_dataset.get_dataset_size() * epoch_num * 0.1),
- decay_steps=train_dataset.get_dataset_size() * epoch_num,
- power=1.0)
- params = roberta.trainable_params()
- optimizer = AdamWeightDecay(params, lr_schedule, eps=1e-8)
一切准备好后,就进入训练阶段,最后将经过微调的模型参数保存至指定路径的ckpt文件中。
- def train(train_data, roberta, optimizer, save_path, epoch_num):
- update_cell = DynamicLossScaleUpdateCell(loss_scale_value=2 ** 32, scale_factor=2, scale_window=1000)
- netwithgrads = RobertaFinetuneCell(roberta, optimizer=optimizer, scale_update_cell=update_cell)
- callbacks = [TimeMonitor(train_data.get_dataset_size()), LossCallBack(train_data.get_dataset_size())]
- model = Model(netwithgrads)
- model.train(epoch_num, train_data, callbacks=callbacks, dataset_sink_mode=False)
- save_checkpoint(model.train_network, save_path)
模型评估
评估和训练的流程大致一样,读取转换成mindrecord形式的验证集,准备用作评估模型性能。
- dataset = SST2Dataset(paths='./mindtext/dataset/SST-2',
- tokenizer="roberta-base",
- max_length=128,
- truncation_strategy=True,
- columns_list=['input_ids', 'attention_mask','label'],
- test_columns_list=['input_ids', 'attention_mask'],
- batch_size=64 )
- ds = dataset.from_cache( columns_list=['input_ids', 'attention_mask','label'],
- test_columns_list=['input_ids', 'attention_mask'],
- batch_size=64
- )
- dev_dataset = ds['dev']
接着加载超参数配置和模型权重文件,通过from_pretrain函数装载到Roberta模型中去。注意这里的下游任务也需指定 is_training 和 num_labels 两个参数值,根据具体任务设置。
- roberta_config_file = "./mindtext/conf/test.yaml"
- roberta_config = RobertaConfig.from_yaml_file(roberta_config_file)
-
- roberta = RobertaforSequenceClassification(roberta_config, is_training=False, num_labels=2, dropout_prob=0.0)
- model_path = "./mindtext/pretrain/output/roberta_trainsst2.ckpt"
- roberta.from_pretrain(model_path)
最后就可以开始评估啦!该任务是一个文本二分类任务,模型预测标签和真实标签进行对比,模型精度为[0-1]之间的小数。
- def eval(eval_data, model):
- metirc = Accuracy('classification')
- metirc.clear()
- squeeze = mindspore.ops.Squeeze(1)
- for batch in tqdm(eval_data.create_dict_iterator(num_epochs=1), total=eval_data.get_dataset_size()):
- input_ids = batch['input_ids']
- input_mask = batch['attention_mask']
- label_ids = batch['label']
- inputs = {"input_ids": input_ids,
- "input_mask": input_mask
- }
- output = model(**inputs)
- sm = mindspore.nn.Softmax(axis=-1)
- output = sm(output)
- metirc.update(output, squeeze(label_ids))
- print(metirc.eval())