• chinese_llama_aplaca训练和代码分析


    训练细节 · ymcui/Chinese-LLaMA-Alpaca Wiki · GitHub中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs) - 训练细节 · ymcui/Chinese-LLaMA-Alpaca Wikiicon-default.png?t=N7T8https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E8%AE%AD%E7%BB%83%E7%BB%86%E8%8A%82中文LLaMA&Alpaca大语言模型词表扩充+预训练+指令精调 - 知乎在 大模型词表扩充必备工具SentencePiece一文中,我们提到了在目前开源大模型中,LLaMA无疑是最闪亮的星。但是,与 ChatGLM-6B 和 Bloom 原生支持中文不同。 LLaMA 原生仅支持 Latin 或 Cyrillic 语系,对于中文支…icon-default.png?t=N7T8https://zhuanlan.zhihu.com/p/631360711GitHub - liguodongiot/llm-action: 本项目旨在分享大模型相关技术原理以及实战经验。本项目旨在分享大模型相关技术原理以及实战经验。. Contribute to liguodongiot/llm-action development by creating an account on GitHub.icon-default.png?t=N7T8https://github.com/liguodongiot/llm-action大模型词表扩充必备工具SentencePiece - 知乎背景随着ChatGPT迅速出圈,最近几个月开源的大模型也是遍地开花。目前,开源的大语言模型主要有三大类:ChatGLM衍生的大模型(wenda、 ChatSQL等)、LLaMA衍生的大模型(Alpaca、Vicuna、BELLE、Phoenix、Chimera…icon-default.png?t=N7T8https://zhuanlan.zhihu.com/p/630696264经过了一次预训练和一次指令精调,预训练使用扩充后的tokenizer,精调使用chinese_llama_aplaca的tokenizer。

    1.词表扩充

    为什么要扩充词表?直接在原版llama上用中文预训练不行吗?

    原版LLaMA模型的词表大小是32K,其主要针对英语进行训练(具体详见LLaMA论文),对多语种支持不是特别理想(可以对比一下多语言经典模型XLM-R的词表大小为250K)。通过初步统计发现,LLaMA词表中仅包含很少的中文字符,所以在切词时会把中文切地更碎,需要多个byte token才能拼成一个完整的汉字,进而导致信息密度降低。比如,在扩展词表后的模型中,单个汉字倾向于被切成1个token,而在原版LLaMA中可能就需要2-3个才能组合成一个汉字,显著降低编解码的效率。

    Chinese-LLaMA-Alpaca是在通用中文语料上训练了基于 sentencepiece 的20K中文词表并与原版LLaMA模型的32K词表进行合并,排除重复的token后,得到的最终中文LLaMA词表大小为49953。在模型精调(fine-tune)阶段 Alpaca 比 LLaMA 多一个 pad token,所以中文Alpaca的词表大小为49954。合并中文扩充词表并与原版LLaMA模型的32K词表,这里直接使用官方训练好的词表chinese_sp.model

    1.1 sentencepiece训练:

    spm_train --input=/workspace/data/book/hongluomeng_clean.txt --model_prefix=/workspace/model/book/hongluomeng-tokenizer --vocab_size=4000 --character_coverage=0.9995 --model_type=bpe

    • --input: 训练语料文件,可以传递以逗号分隔的文件列表。文件格式为每行一个句子。 无需运行tokenizer、normalizer或preprocessor。 默认情况下,SentencePiece 使用 Unicode NFKC 规范化输入。
    • --model_prefix:输出模型名称前缀。 训练完成后将生成 .model 和 .vocab 文件。
    • --vocab_size:训练后的词表大小,例如:8000、16000 或 32000
    • --character_coverage:模型覆盖的字符数量,对于字符集丰富的语言(如日语或中文)推荐默认值为 0.9995,对于其他字符集较小的语言推荐默认值为 1.0。
    • --model_type:模型类型。 可选值:unigram(默认)、bpe、char 或 word 。 使用word类型时,必须对输入句子进行pretokenized。

    1.2 训练得到的model和原词表进行合并

    1. 转换格式
    2. ```
    3. python convert_llama_weights_to_hf.py --input_dir /home/image_team/image_team_docker_home/lgd/e_commerce_llm/weights/LLaMA-7B-Base/ --model_size 7B --output_dir /home/image_team/image_team_docker_home/lgd/e_commerce_llm/weights/LLaMA-7B-Base-hf/
    4. ```
    5. 词表合并
    6. ```
    7. python merge_tokenizers.py --llama_tokenizer_dir /home/image_team/image_team_docker_home/lgd/e_commerce_llm/weights/LLaMA-7B-Base-hf/ --chinese_sp_model_file /home/image_team/image_team_docker_home/lgd/common/Chinese-LLaMA-Alpaca-main/scripts/merge_tokenizer/chinese_sp.model
    8. ```
    9. merged_tokenizer_sp:为训练好的词表模型
    10. merged_tokenizer_hf:HF格式训练好的词表模型

    2.训练:

    训练分三个阶段,第一和第二阶段属于预训练阶段,第三阶段属于指令精调。

    2.1 第一阶段

    冻结transformer参数,仅训练embedding,在尽量不干扰元模型的情况下适配新增的中文词向量。收敛速度较慢,如果不是有特别充裕的时间和计算资源,官方建议跳过该阶段,同时,官网并没有提供该阶段的代码,如果需要进行该阶段预训练,需要自行修改。

    第一步:在训练之前,将除了Embedding之外的层设置为param.requires_grad = False,如下所示:

    1. for name, param in model.named_parameters():
    2. if "model.embed_tokens" not in name:
    3. param.requires_grad = False

    第二步:在训练的时候,在优化器中添加过滤器filter把requires_grad = False的参数过滤掉,这样在训练的时候,不会更新这些参数,如下所示:

    optimizer = AdamW(filter(lambda p: p.requires_grad, model.parameters()))

    2.2 第二阶段

    使用lora,为模型添加lora权重,训练embedding的同时更新lora权重

    1. lr=2e-4
    2. lora_rank=8
    3. lora_alpha=32
    4. lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
    5. modules_to_save="embed_tokens,lm_head"
    6. lora_dropout=0.05
    7. pretrained_model=/home/image_team/image_team_docker_home/lgd/e_commerce_llm/weights/LLaMA-7B-Base-hf/
    8. chinese_tokenizer_path=/home/image_team/image_team_docker_home/lgd/common/Chinese-LLaMA-Alpaca-main/scripts/merge_tokenizer/merged_tokenizer_hf/
    9. dataset_dir=/home/image_team/image_team_docker_home/lgd/common/Chinese-LLaMA-Alpaca-main/data/
    10. data_cache=/home/image_team/image_team_docker_home/lgd/common/Chinese-LLaMA-Alpaca-main/data_cache/
    11. per_device_train_batch_size=1
    12. per_device_eval_batch_size=1
    13. gradient_accumulation_steps=1
    14. training_step=100
    15. output_dir=/home/image_team/image_team_docker_home/lgd/common/Chinese-LLaMA-Alpaca-main/output_dir/
    16. deepspeed_config_file=ds_zero2_no_offload.json
    17. torchrun --nnodes 1 --nproc_per_node 3 run_clm_pt_with_peft.py \
    18. --deepspeed ${deepspeed_config_file} \
    19. --model_name_or_path ${pretrained_model} \
    20. --tokenizer_name_or_path ${chinese_tokenizer_path} \
    21. --dataset_dir ${dataset_dir} \
    22. --data_cache_dir ${data_cache} \
    23. --validation_split_percentage 0.001 \
    24. --per_device_train_batch_size ${per_device_train_batch_size} \
    25. --per_device_eval_batch_size ${per_device_eval_batch_size} \
    26. --do_train \
    27. --seed $RANDOM \
    28. --fp16 \
    29. --num_train_epochs 1 \
    30. --lr_scheduler_type cosine \
    31. --learning_rate ${lr} \
    32. --warmup_ratio 0.05 \
    33. --weight_decay 0.01 \
    34. --logging_strategy steps \
    35. --logging_steps 10 \
    36. --save_strategy steps \
    37. --save_total_limit 3 \
    38. --save_steps 200 \
    39. --gradient_accumulation_steps ${gradient_accumulation_steps} \
    40. --preprocessing_num_workers 8 \
    41. --block_size 512 \
    42. --output_dir ${output_dir} \
    43. --overwrite_output_dir \
    44. --ddp_timeout 30000 \
    45. --logging_first_step True \
    46. --lora_rank ${lora_rank} \
    47. --lora_alpha ${lora_alpha} \
    48. --trainable ${lora_trainable} \
    49. --modules_to_save ${modules_to_save} \
    50. --lora_dropout ${lora_dropout} \
    51. --torch_dtype float16 \
    52. --gradient_checkpointing \
    53. --ddp_find_unused_parameters False

    2.3 将lora模型合并到基础模型中

    python merge_llama_with_chinese_lora.py     --base_model /home/image_team/image_team_docker_home/lgd/e_commerce_llm/weights/LLaMA-7B-Base-hf/      --lora_model /home/image_team/image_team_docker_home/lgd/common/Chinese-LLaMA-Alpaca-main/output_dir/pt_lora_model/     --output_type huggingface     --output_dir /home/image_team/image_team_docker_home/lgd/common/Chinese-LLaMA-Alpaca-main/output_dir/

     2.4 第三阶段:指令精调

    训练数据是alpaca_data_zh_51k.json,词表扩充阶段得到的词表是49953,但是sft阶段,alpaca的词表比llama多一个pad token,所以是49954,注意这个chinese_llama_alpaca的词表直接从作者的项目中拉取。

    1. lr=1e-4
    2. lora_rank=8
    3. lora_alpha=32
    4. lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
    5. modules_to_save="embed_tokens,lm_head"
    6. lora_dropout=0.05
    7. pretrained_model=/home/image_team/image_team_docker_home/lgd/common/Chinese-LLaMA-Alpaca-main/output_dir/
    8. chinese_tokenizer_path=/home/image_team/image_team_docker_home/lgd/common/Chinese-LLaMA-Alpaca-main/chinese_alpaca_tokenizer/
    9. dataset_dir=/home/image_team/image_team_docker_home/lgd/common/Chinese-LLaMA-Alpaca-main/data/
    10. per_device_train_batch_size=1
    11. per_device_eval_batch_size=1
    12. gradient_accumulation_steps=8
    13. output_dir=/home/image_team/image_team_docker_home/lgd/common/Chinese-LLaMA-Alpaca-main/output_dir_sft/
    14. #peft_model=/home/image_team/image_team_docker_home/lgd/common/Chinese-LLaMA-Alpaca-main/output_dir_sft/
    15. validation_file=/home/image_team/image_team_docker_home/lgd/common/Chinese-LLaMA-Alpaca-main/data/alpaca_valid.json
    16. deepspeed_config_file=ds_zero2_no_offload.json
    17. torchrun --nnodes 1 --nproc_per_node 3 run_clm_sft_with_peft.py \
    18. --deepspeed ${deepspeed_config_file} \
    19. --model_name_or_path ${pretrained_model} \
    20. --tokenizer_name_or_path ${chinese_tokenizer_path} \
    21. --dataset_dir ${dataset_dir} \
    22. --validation_split_percentage 0.001 \
    23. --per_device_train_batch_size ${per_device_train_batch_size} \
    24. --per_device_eval_batch_size ${per_device_eval_batch_size} \
    25. --do_train \
    26. --do_eval \
    27. --seed $RANDOM \
    28. --fp16 \
    29. --num_train_epochs 1 \
    30. --lr_scheduler_type cosine \
    31. --learning_rate ${lr} \
    32. --warmup_ratio 0.03 \
    33. --weight_decay 0 \
    34. --logging_strategy steps \
    35. --logging_steps 10 \
    36. --save_strategy steps \
    37. --save_total_limit 3 \
    38. --evaluation_strategy steps \
    39. --eval_steps 100 \
    40. --save_steps 2000 \
    41. --gradient_accumulation_steps ${gradient_accumulation_steps} \
    42. --preprocessing_num_workers 8 \
    43. --max_seq_length 512 \
    44. --output_dir ${output_dir} \
    45. --overwrite_output_dir \
    46. --ddp_timeout 30000 \
    47. --logging_first_step True \
    48. --lora_rank ${lora_rank} \
    49. --lora_alpha ${lora_alpha} \
    50. --trainable ${lora_trainable} \
    51. --modules_to_save ${modules_to_save} \
    52. --lora_dropout ${lora_dropout} \
    53. --torch_dtype float16 \
    54. --validation_file ${validation_file} \
    55. --gradient_checkpointing \
    56. --ddp_find_unused_parameters False
    57. # --peft_path ${peft_model}

    2.5 将预训练权重lora和精调lora合并到基础模型上

    python merge_llama_with_chinese_lora.py     --base_model /home/image_team/image_team_docker_home/lgd/e_commerce_llm/weights/LLaMA-7B-Base-hf/   --lora_model /home/image_team/image_team_docker_home/lgd/common/Chinese-LLaMA-Alpaca-main/output_dir/pt_lora_model/,"/home/image_team/image_team_docker_home/lgd/common/Chinese-LLaMA-Alpaca-main/output_dir_sft/sft_lora_model/"     --output_type huggingface     --output_dir /home/image_team/image_team_docker_home/lgd/common/Chinese-LLaMA-Alpaca-main/output_dir_all/

    2.6 前向推理

    python inference_hf.py      --base_model /home/image_team/image_team_docker_home/lgd/common/Chinese-LLaMA-Alpaca-main/output_dir_all/     --with_prompt    --interactive

    transformer==4.31.0 

    3.代码分析

    3.1 预训练代码

    1. parser = HfArgumentParser((ModelArgument,DataTrainArguments,MyTrainingArgument))
    2. model_args, data_args, training_args = parser.parse_args_into_dataclasses()
    3. set_seed(training_args.seed)
    4. config = AutoConfig.from_pretrained(model_args.model_name_or_path,...)
    5. tokenizer = LlamaTokenizer.from_pretrained(model_args.tokenizer_name_or_path,...)
    6. block_size = tokenizer.model_max_length
    7. with training_args.main_process_first():
    8. files = [file.name for file in path.glob("*.txt")]
    9. for idx,file in enumerate(files):
    10. raw_dataset = load_dataset("text",data_file,cache_dir,keep_in_memory=False)
    11. tokenized_dataset = raw_dataset.map(tokenize_function,...)
    12. grouped_dataset = tokenized_dataset.map(group_texts,...)
    13. - tokenize_function->output = tokenizer(examples["text"])
    14. processed_dataset.save_to_disk(cache_path)
    15. lm_datasets = concatenate_datasets([lm_datasets,processed_dataset['train']])
    16. lm_datasets = lm_datasets.train_test_split(data_args.validation_split_percentage)
    17. train_dataset = lm_datasets['train']
    18. model = LlamaForCausalLM.from_pretrained(model_args.model_name_or_path,..)
    19. model_vocab_size = model.get_output_embeddings().weight.size(0)
    20. model.resize_token_embeddings(len(tokenizer))
    21. target_modules = training_args.trainable.split(',')
    22. modules_to_save = training_args.module_to_save
    23. lora_rank = training_args.lora_rank
    24. lora_dropout = training_args.lora_dropout
    25. lora_alpha = training_args.lora_alpha
    26. peft_config = LoraConfig(TasskType.CAUSAL_LM,target_modules,lora_rank,lora_alpha,lora_dropout,lora_dropout,modules_to_save)
    27. model = get_peft_model(model,peft_config)
    28. old_state_dict = model.state_dict()
    29. model.state_dict = (
    30. lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())
    31. ).__get__(model, type(model))
    32. trainer = Trainer(
    33. model,
    34. training_args,
    35. train_dataset,
    36. eval_dataset,
    37. tokenizer,
    38. fault_tolerance_data_collator,
    39. compute_metrics,preprocess_logits_for_metrics)
    40. trainer.add_callback(SavePeftModelCallback)
    41. checkpoint = training_args.resume_from_checkpoint
    42. train_result = trainer.train(checkpoint)
    43. metrics = train_result.metrics
    44. trainer.log_metrics("train",metrics)
    45. trainer.save_metrics("train",metrics)
    46. trainer.save_state()

    3.2 指令精调代码

    1. parser = HfArgumentParser((ModelArguments, DataTrainingArguments, MyTrainingArguments))
    2. model_args, data_args, training_args = parser.parse_args_into_dataclasses()
    3. set_seed(training_args.seed)
    4. config = AutoConfig.from_pretrained(model_args.model_name_or_path,...)
    5. tokenizer = LlamaTokenizer.from_pretrained(model_args.tokenizer_name_or_path,...)
    6. tokenizer.add_special_tokens(dict(pad_token="[PAD]"))
    7. data_collator = DataCollatorForSupervisedDataset(tokenizer)
    8. - input_ids,labels = tuple([instance[key] for instance in instances] for key in ("input_ids","labels"))
    9. - input_ids = torch.nn.utils.rnn.pad_sequence(input_ids,batch_first=True,padding_value=self.tokenizer.pad_token_id)
    10. - labels = torch.nn.utils.rnn.pad_sequence(labels,batch_first=True,padding_values=-100)
    11. with training_args.main_process_first():
    12. files = [os.path.join(path,filename) for file in path.glob("*.json")]
    13. train_dataset = build_instruction_dataset(files,tokenizer,data,max_seq_length,...)
    14. - for file in data_path:
    15. raw_dataset = load_dataset("json",data_file=file,..)
    16. tokenized_dataset = raw_dataset.map(tokenization,...)
    17. -- for instruction,input,output in zip(examples['instruction'],examples['input'],examples['output']):
    18. if input is not None and input != "":
    19. instruction = instruction+"\n"+input
    20. source = prompt.format_map({'instruction':instruction})
    21. target = f"{output}{tokenizer.eos_token}"
    22. tokenized_sources = tokenizer(sources,return_attention_mask=False)
    23. tokenized_targets = tokenizer(targets,return_attention_mask=False,add_special_tokens=False)
    24. for s,t in zip(tokenized_sources['input_ids'],tokenized_targets['input_ids']):
    25. input_ids = torch.LongTensor(s+t)[:max_seq_length]
    26. labels = torch.LongTensor([IGNORE_INDEX]*len(s) + t)[:max_seq_length]
    27. return results = {'input_ids':all_input_ids,'labels':all_labels}
    28. model = LlamaForCausalLM.from_pretrained(model_args.model_name_or_path,config,...)
    29. embedding_size = model.get_input_embeddings().weight.shape[0]
    30. model.resize_token_embeddings(len(tokenizers))
    31. target_modules = training_args.trainable.split(',')
    32. modules_to_save = training_args.modules_to_save
    33. if modules_to_save is not None:
    34. modules_to_save = modules_to_save.split(',')
    35. lora_rank = training_args.lora_rank
    36. lora_dropout = training_args.lora_dropout
    37. lora_alpha = training_args.lora_alpha
    38. peft_config = LoraConfig(
    39. task_type=TaskType.CAUSAL_LM,
    40. target_modules=target_modules,
    41. inference_mode=False,
    42. r=lora_rank, lora_alpha=lora_alpha,
    43. lora_dropout=lora_dropout,
    44. modules_to_save=modules_to_save)
    45. model = get_peft_model(model, peft_config)
    46. old_state_dict = model.state_dict
    47. model.state_dict = (
    48. lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())
    49. ).__get__(model, type(model))
    50. trainer = Trainer(
    51. model=model,
    52. args=training_args,
    53. train_dataset=train_dataset,
    54. eval_dataset=eval_dataset,
    55. tokenizer=tokenizer,
    56. data_collator=data_collator,
    57. )
    58. trainer.add_callback(SavePeftModelCallback)
    59. train_result = trainer.train(resume_from_checkpoint=checkpoint)
    60. metrics = train_result.metrics
    61. trainer.log_metrics("train", metrics)
    62. trainer.save_metrics("train", metrics)
    63. trainer.save_state()

    3.3 推理代码

    1. apply_attention_patch(use_memory_efficient_attention=True)
    2. apply_ntk_scaling_path(args.alpha)
    3. generation_config = dict(
    4. temperature=0.2,
    5. topk=40,
    6. top_p=0.9,
    7. do_sample=True,
    8. num_beams=1,
    9. repetition_penalty=1.1,
    10. max_new_tokens=400)
    11. tokenizer = LlamaTokenizer.from_pretrained(args.tokenizer_path)
    12. base_model = LlamaForCausalLM.from_pretrained(args.base_model,load_in_8bit,torch.float16,low_cpu_mem_usage=True)
    13. model_vocab_size = base_model.get_input_embeddings().weight.size(0)
    14. base_model.resize_token_embeddings(tokenzier_vocab_size)
    15. model = base_model
    16. model.eval()
    17. with torch.no_grad():
    18. while True:
    19. raw_input_text = input("Input:")
    20. input_text = generate_prompt(instruction=raw_input_text)
    21. inputs = tokenizer(input_text,return_tensors="pt")
    22. generation_output = model.generate(
    23. input_ids=inputs["input_ids"].to(device),
    24. attention_mask=inputs['attention_mask'].to(device),
    25. eos_token_id=tokenizer.eos_token_id,
    26. pad_token_id=tokenizer.pad_token_id,
    27. **generation_config)
    28. s = generate_output[0]
    29. output = tokenizer.decode(s,skip_special_tokens=True)
    30. response = output.split("### Response:")[1].strip()

     

  • 相关阅读:
    云计算期末复习(3)
    HICO-DET:适合踏入 HOI detection 领域的初学者阅读的论文......
    关联容器(字典)map
    mysql分页查询遇到order by发生的血案
    PHP生成图形验证码
    论文阅读 Dynamic Network Embedding by Modeling Triadic Closure Process
    批量插入,部分参数为null,报sql语法错误解决方案
    Spark RDD机制(持久化、依赖关系、checkpoint)
    Floyd 最短路径【学习算法】
    让 K8s 更简单!8款你不得不知的 AI 工具-Part 1
  • 原文地址:https://blog.csdn.net/u012193416/article/details/134119566