基于MindSpore的llama微调在OpenI平台上运行

克隆预训练模型

克隆chatglm-6b代码仓，下载分布式的模型文件

git lfs install
git clone https://huggingface.co/openlm-research/open_llama_7b
1
2

准备环境

安装Transformer

pip install transformers
1

执行转换脚本

python mindformers/models/glm/convert_weight.py --pt_ckpt_path /home/ma-user/work/models/mindspore/pt_glm_6b.pth --ms_ckpt_path ../models/mindspore/ms_glm_6b.ckpt
1

注意可能会遇到以下错误:

执行转换脚本，得到转换后的输出文件ms_glm_6b.ckpt
1

解决方法：

export LD_PRELOAD=$LD_PRELOAD:/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/torch/lib/libgomp-d22c30c5.so.1 
1

原理：找到torch中的libgomp-d22c30c5.so.1 然后赋值给LD_PRELOAD环境变量，这个报错好像只有ARM平台会有

微调训练集准备

微调方式：lora

目前提供alpaca数据集的预处理脚本用于全参微调/lora微调任务。

数据集地址：https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json

alpaca数据集原始格式样例：

# alpaca examples:
    {
        "instruction": "Describe a time when you had to make a difficult decision.",
        "input": "",
        "output": "I had to make a difficult decision when I was working as a project manager at a construction company. I was in charge of a project that needed to be completed by a certain date in order to meet the client\u2019s expectations. However, due to unexpected delays, we were not able to meet the deadline and so I had to make a difficult decision. I decided to extend the deadline, but I had to stretch the team\u2019s resources even further and increase the budget. Although it was a risky decision, I ultimately decided to go ahead with it to ensure that the project was completed on time and that the client\u2019s expectations were met. The project was eventually successfully completed and this was seen as a testament to my leadership and decision-making abilities."
    },
    {
        "instruction": "Identify the odd one out.",
        "input": "Twitter, Instagram, Telegram",
        "output": "Telegram"
    },
1
2
3
4
5
6
7
8
9
10
11

执行alpaca_converter.py，使用fastchat工具添加prompts模板，将原始数据集转换为多轮对话格式

# 脚本路径：tools/dataset_preprocess/llama/alpaca_converter.py
# 执行转换脚本
python alpaca_converter.py \
--data_path /home/ma-user/work/data/alpaca_data.json \
--output_path /home/ma-user/work/data/alpaca-data-conversation.json
1
2
3
4
5

参数说明

# 参数说明
data_path: 存放alpaca数据的路径
output_path: 输出转换后对话格式的数据路径
1
2
3

转换后的样例:

{
    "id": "1",
    "conversations": [
      {
        "from": "human",
        "value": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:"
      },
      {
        "from": "gpt",
        "value": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule."
      }
    ]
  },
1
2
3
4
5
6
7
8
9
10
11
12
13

执行llama_preprocess.py，进行数据预处理、Mindrecord数据生成，将带有prompt模板的数据转换为mindrecord格式。

安装依赖:

pip install "fschat[model_worker,webui]"
1

执行脚本

# 脚本路径：tools/dataset_preprocess/llama/llama_preprocess.py
# 由于此工具依赖fschat工具包解析prompt模板，请提前安装fschat >= 0.2.13 python = 3.9
python llama_preprocess.py \
--dataset_type qa \
--input_glob /home/ma-user/work/data/alpaca-data-conversation.json \
--model_file /home/ma-user/work/models/open_llama_7b/tokenizer.model \
--seq_length 2048 \
--output_file /home/ma-user/work/models/alpaca-fastchat2048.mindrecord
1
2
3
4
5
6
7
8

lora微调

目前lora微调适配了llama_7b模型，并给出了默认配置文件config/llama/run_llama_7b_lora.yaml

step 1. 修改配置文件，参考全参微调修改训练数据集路径与预训练权重路径。
step 2. 启动lora微调任务。
注：llama_7b_lora模型支持单卡启动，需将配置文件中的use_parallel参数置为False。

脚本启动

python run_mindformer.py --config=./configs/llama/run_llama_7b_lora.yaml --use_parallel=False --run_mode=finetune
1

run_llma_7b_lora.yaml

seed: 0
output_dir: './output'  # 当前不支持自定义修改，请勿修改该默认值
load_checkpoint: '/home/ma-user/work/models/mindspore/open_llama_7b_ms.ckpt'
src_strategy_path_or_dir: ''
auto_trans_ckpt: False  # If true, auto transform load_checkpoint to load in distributed model
only_save_strategy: False
resume_training: False
run_mode: 'finetune'

# trainer config
trainer:
  type: CausalLanguageModelingTrainer
  model_name: 'llama_7b_lora'

# runner config
runner_config:
  epochs: 1
  batch_size: 2
  sink_mode: True
  sink_size: 2

# optimizer
optimizer:
  type: FP32StateAdamWeightDecay
  beta1: 0.9
  beta2: 0.95
  eps: 1.e-8
  learning_rate: 1.e-4

# lr sechdule
lr_schedule:
  type: CosineWithWarmUpLR
  learning_rate: 1.e-4
  warmup_ratio: 0.03
  total_steps: -1 # -1 means it will load the total steps of the dataset

# dataset
train_dataset: &train_dataset
  data_loader:
    type: MindDataset
    dataset_dir: "/home/ma-user/work/models/alpaca-fastchat2048.mindrecord"
    shuffle: True
  input_columns: ["input_ids", "labels"]  # "input_ids", "labels" , labels are used in instruction finetune.
  num_parallel_workers: 8
  python_multiprocessing: False
  drop_remainder: True
  batch_size: 2
  repeat: 1
  numa_enable: False
  prefetch_size: 1

train_dataset_task:
  type: CausalLanguageModelDataset
  dataset_config: *train_dataset
# if True, do evaluate during the training process. if false, do nothing.
# note that the task trainer should support _evaluate_in_training function.
do_eval: False

# eval dataset
eval_dataset: &eval_dataset
  data_loader:
    type: MindDataset
    dataset_dir: "/home/ma-user/work/models/alpaca-fastchat2048.mindrecord"
    shuffle: False
  input_columns: ["input_ids", "labels"]
  num_parallel_workers: 8
  python_multiprocessing: False
  drop_remainder: False
  repeat: 1
  numa_enable: False
  prefetch_size: 1
eval_dataset_task:
  type: CausalLanguageModelDataset
  dataset_config: *eval_dataset

use_parallel: False
# parallel context config
parallel:
  parallel_mode: 1 # 0-data parallel, 1-semi-auto parallel, 2-auto parallel, 3-hybrid parallel
  gradients_mean: False
  enable_alltoall: False
  full_batch: True
  search_mode: "sharding_propagation"
  enable_parallel_optimizer: False
  strategy_ckpt_save_file: "./ckpt_strategy.ckpt"
  parallel_optimizer_config:
    gradient_accumulation_shard: False
    parallel_optimizer_threshold: 64
# default parallel of device num = 8 910A
parallel_config:
  data_parallel: 8
  model_parallel: 1
  pipeline_stage: 1
  use_seq_parallel: False
  optimizer_shard: False
  micro_batch_num: 1
  vocab_emb_dp: True
  gradient_aggregation_group: 4
# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
micro_batch_interleave_num: 1

# recompute config
recompute_config:
  recompute: True
  select_recompute: False
  parallel_optimizer_comm_recompute: False
  mp_comm_recompute: True
  recompute_slice_activation: True

# callbacks
callbacks:
  - type: MFLossMonitor
  - type: CheckpointMointor
    prefix: "llama_7b_lora"
    save_checkpoint_steps: 20000
    integrated_save: False
    async_save: False
  - type: ObsMonitor

# mindspore context init config
context:
  mode: 0 #0--Graph Mode; 1--Pynative Mode
  device_target: "Ascend"
  enable_graph_kernel: False
  graph_kernel_flags: "--disable_expand_ops=Softmax,Dropout --enable_parallel_fusion=true --reduce_fuse_depth=8 --enable_auto_tensor_inplace=true"
  max_call_depth: 10000
  max_device_memory: "31GB"
  save_graphs: False
  save_graphs_path: "./graph"
  device_id: 0

# model config
model:
  model_config:
    type: LlamaConfig
    batch_size: 1 # add for increase predict
    seq_length: 2048
    hidden_size: 4096
    num_layers: 32
    num_heads: 32
    vocab_size: 32000
    multiple_of: 256
    rms_norm_eps: 1.0e-6
    bos_token_id: 1
    eos_token_id: 2
    pad_token_id: 0
    ignore_token_id: -100
    compute_dtype: "float16"
    layernorm_compute_dtype: "float32"
    softmax_compute_dtype: "float16"
    rotary_dtype: "float16"
    param_init_type: "float16"
    use_past: False
    pretrain_seqlen: 2048 # seqlen of the pretrain checkpoint: 2048 for llama and 4096 for llama2
    extend_method: "None" # support "None", "PI", "NTK"
    compute_in_2d: False
    use_flash_attention: False
    offset: 0
    use_past_shard: False
    checkpoint_name_or_path: "llama_7b_lora"
    repetition_penalty: 1
    max_decode_length: 512
    top_k: 3
    top_p: 1
    do_sample: False
    pet_config:
      pet_type: lora
      # configuration of lora
      in_channels: 4096
      out_channels: 4096
      lora_rank: 16
      lora_alpha: 16
      lora_dropout: 0.05
  arch:
    type: LlamaForCausalLMWithLora

processor:
  return_tensors: ms
  tokenizer:
    unk_token: ''
    bos_token: ''
    eos_token: ''
    pad_token: ''
    type: LlamaTokenizer

# metric
metric:
  type: PerplexityMetric

# wrapper cell config
runner_wrapper:
  type: MFTrainOneStepCell
  scale_sense:
    type: DynamicLossScaleUpdateCell
    loss_scale_value: 4294967296
    scale_factor: 2
    scale_window: 1000
  use_clip_grad: True

eval_callbacks:
  - type: ObsMonitor

auto_tune: False
filepath_prefix: './autotune'
autotune_per_step: 10

profile: False
profile_start_step: 1
profile_stop_step: 10
init_start_profile: False
profile_communication: False
profile_memory: True
layer_scale: False
layer_decay: 0.65
lr_scale_factor: 256

# cfts init config
remote_save_url: "Please input obs url on AICC platform."
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218

相关阅读:
MySQL MVCC详细介绍
 动态规划算法的题到底应该怎么做？思路教给你自己写
 高级系统架构师考试经验分享
 WEBGPU学习之通过javascript代码修改顶点着色器的顶点数据。
docker 安装 neo4j
Linux 8：线程
 如何使用tornado将python代码封装成api
嵌入式Linux应用开发-Framebuffer 应用编程
 Vue3中全局组件的使用
 Docker 和 Kubernetes：技术相同和不同之处
原文地址：https://blog.csdn.net/yichao_ding/article/details/133810561