• 72B大模型分片部署


    一、定义

    1. 目的
    2. 官方教程
    3. 案例
    4. 小模型修改device_map 方式二

    二、实现

    1. 目的: 将72B大模型 部署到2张gpu 显卡中。
    2. 官方教程
      帖子:https://huggingface.co/blog/accelerate-large-models
    3. 实现
      1. 自动部署
    model = AutoModelForCausalLM.from_pretrained(
        model_dir, revision='master',
       device_map="auto"trust_remote_code=True
    ).eval()
    

    在这里插入图片描述
    2. 手动部署
    例如:

    device_map={'transformer.wte': 0, 'transformer.drop': 0, 'transformer.rotary_emb': 0, 'transformer.h.0': 0,'transformer.h.1': 0, 'transformer.h.2': 0, 'transformer.h.3': 0, 'transformer.h.4': 0, 'transformer.h.5': 0, 'transformer.h.6': 0, 'transformer.h.7': 0, 'transformer.h.8': 0, 'transformer.h.9': 0, 'transformer.h.10': 0, 'transformer.h.11': 0, 'transformer.h.12': 0, 'transformer.h.13': 0, 'transformer.h.14': 0, 'transformer.h.15': 0, 'transformer.h.16': 0, 'transformer.h.17': 0, 'transformer.h.18': 0, 'transformer.h.19': 0, 'transformer.h.20': 0, 'transformer.h.21': 0, 'transformer.h.22': 0, 'transformer.h.23': 0, 'transformer.h.24': 0, 'transformer.h.25': 0, 'transformer.h.26': 0, 'transformer.h.27': 0, 'transformer.h.28': 0, 'transformer.h.29': 0, 'transformer.h.30': 0, 'transformer.h.31': 0, 'transformer.h.32': 0, 'transformer.h.33': 0, 'transformer.h.34': 0, 'transformer.h.35': 0, 'transformer.h.36': 0, 'transformer.h.37': 0, 'transformer.h.38': 0, 'transformer.h.39': 0, 'transformer.h.40': 0, 'transformer.h.41': 1, 'transformer.h.42': 1, 'transformer.h.43': 1, 'transformer.h.44': 1, 'transformer.h.45': 1, 'transformer.h.46': 1, 'transformer.h.47': 1, 'transformer.h.48': 1, 'transformer.h.49': 1, 'transformer.h.50': 1, 'transformer.h.51': 1, 'transformer.h.52': 1, 'transformer.h.53': 1, 'transformer.h.54': 1, 'transformer.h.55': 1, 'transformer.h.56': 1, 'transformer.h.57': 1, 'transformer.h.58': 1, 'transformer.h.59': 1, 'transformer.h.60': 1, 'transformer.h.61': 1, 'transformer.h.62': 1, 'transformer.h.63': 1, 'transformer.h.64': 1, 'transformer.h.65': 1, 'transformer.h.66': 1, 'transformer.h.67': 1, 'transformer.h.68': 1, 'transformer.h.69': 1, 'transformer.h.70': 1, 'transformer.h.71': 1, 'transformer.h.72': 1, 'transformer.h.73': 1, 'transformer.h.74': 1, 'transformer.h.75': 1, 'transformer.h.76': 1, 'transformer.h.77': 1, 'transformer.h.78': 1, 'transformer.h.79': 1, 'transformer.ln_f': 1, 'lm_head': 1}
    

    https://huggingface.co/blog/accelerate-large-models
    1. 获取device_map:

    from accelerate import infer_auto_device_map, init_empty_weights
    from transformers import AutoConfig, AutoModelForCausalLM
    config = AutoConfig.from_pretrained("/home/Llama2-Chinese-7b-Chat")
    with init_empty_weights():
        model = AutoModelForCausalLM.from_config(config)
    device_map = infer_auto_device_map(model,no_split_module_classes=["OPTDecoderLayer"])
    print(device_map)
    

    遇到问题:返回: OrderedDict([(‘’, 0)]) 原因: 返回是正常的,模型比较小。表示模型的参数将在索引为0的设备上进行推理,这通常是指CUDA设备(如果可用),或者是CPU
    如果你遇到了返回 OrderedDict([(‘’, 0)]) 的情况,这通常意味着你的模型参数可以在默认设备上推理,并且这个设备是CUDA设备(如果安装了PyTorch的CUDA版本,并且可以访问到CUDA设备)。
    当模型比较大的时候,则可以返回如下:

    from accelerate import infer_auto_device_map, init_empty_weights
    from transformers import AutoConfig, AutoModelForCausalLM
    config = AutoConfig.from_pretrained("/home/Qwen2-72B-Instruct")
    with init_empty_weights():
        model = AutoModelForCausalLM.from_config(config)
    device_map = infer_auto_device_map(model)
    print(device_map)
    
    OrderedDict([('model.embed_tokens', 0), ('model.layers.0', 0), ('model.layers.1', 0), ('model.layers.2', 0), ('model.layers.3', 0), ('model.layers.4', 0), ('model.layers.5', 0), ('model.layers.6', 0), ('model.layers.7', 0), ('model.layers.8', 0), ('model.layers.9', 0), ('model.layers.10', 0), ('model.layers.11', 0), ('model.layers.12', 0), ('model.layers.13', 0), ('model.layers.14', 0), ('model.layers.15', 0), ('model.layers.16', 0), ('model.layers.17', 0), ('model.layers.18', 0), ('model.layers.19.self_attn', 0), ('model.layers.19.mlp.gate_proj', 0), ('model.layers.19.mlp.up_proj', 0), ('model.layers.19.mlp.down_proj', 1), ('model.layers.19.mlp.act_fn', 1), ('model.layers.19.input_layernorm', 1), ('model.layers.19.post_attention_layernorm', 1), ('model.layers.20', 1), ('model.layers.21', 1), ('model.layers.22', 1), ('model.layers.23', 1), ('model.layers.24', 1), ('model.layers.25', 1), ('model.layers.26', 1), ('model.layers.27', 1), ('model.layers.28', 1), ('model.layers.29', 1), ('model.layers.30', 1), ('model.layers.31', 1), ('model.layers.32.self_attn', 1), ('model.layers.32.input_layernorm', 2), ('model.layers.32.post_attention_layernorm', 2), ('model.layers.33', 2), ('model.layers.34', 2), ('model.layers.35', 2), ('model.layers.36', 2), ('model.layers.37', 2), ('model.layers.38', 2), ('model.layers.39', 2), ('model.layers.40', 2), ('model.layers.41.input_layernorm', 'cpu'), ('model.layers.41.post_attention_layernorm', 'cpu'), ('model.layers.42', 'cpu'), ('model.layers.43', 'cpu'), ('model.layers.44', 'cpu'), ('model.layers.45', 'cpu'), ('model.layers.46', 'cpu'), ('model.layers.47', 'cpu'), ('model.layers.48', 'cpu'), ('model.layers.49', 'cpu'), ('model.layers.50', 'cpu'), ('model.layers.51', 'cpu'), ('model.layers.52', 'cpu'), ('model.layers.53', 'cpu'), ('model.layers.54', 'cpu'), ('model.layers.55', 'cpu'), ('model.layers.56', 'cpu'), ('model.layers.57', 'cpu'), ('model.layers.58', 'cpu'), ('model.layers.59', 'cpu'), ('model.layers.60', 'cpu'), ('model.layers.61', 'cpu'), ('model.layers.62', 'cpu'), ('model.layers.63', 'cpu'), ('model.layers.64', 'cpu'), ('model.layers.65', 'cpu'), ('model.layers.66', 'cpu'), ('model.layers.67', 'cpu'), ('model.layers.68', 'cpu'), ('model.layers.69.self_attn', 'cpu'), ('model.layers.69.mlp.gate_proj', 'cpu'), ('model.layers.69.mlp.up_proj', 'cpu'), ('model.layers.69.mlp.down_proj', 'disk'), ('model.layers.69.mlp.act_fn', 'disk'), ('model.layers.69.input_layernorm', 'disk'), ('model.layers.69.post_attention_layernorm', 'disk'), ('model.layers.70', 'disk'), ('model.layers.71', 'disk'), ('model.layers.72', 'disk'), ('model.layers.73', 'disk'), ('model.layers.74', 'disk'), ('model.layers.75', 'disk'), ('model.layers.76', 'disk'), ('model.layers.77', 'disk'), ('model.layers.78', 'disk'), ('model.layers.79', 'disk'), ('model.norm', 'disk'), ('lm_head', 'disk'), ('model.layers.41.mlp', 'cpu'), ('model.layers.41.self_attn', 3), ('model.layers.32.mlp', 2)])
    

    修改device_map ,加载模型

    device_map={'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0,
                'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0,
                'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 0, 'model.layers.17': 0, 'model.layers.18': 0,
                'model.layers.19': 0, 'model.layers.20': 0, 'model.layers.21': 0, 'model.layers.22': 0, 'model.layers.23': 0, 'model.layers.24': 0, 'model.layers.25': 0,
                'model.layers.26': 0, 'model.layers.27': 0, 'model.layers.28': 0, 'model.layers.29': 0, 'model.layers.30': 0, 'model.layers.31': 0, 'model.layers.32': 0,
                'model.layers.33': 0, 'model.layers.34': 0, 'model.layers.35': 0, 'model.layers.36': 0, 'model.layers.37': 0, 'model.layers.38': 0, 'model.layers.39':0,
                'model.layers.40': 0, 'model.layers.41':0, 'model.layers.42': 1, 'model.layers.43': 1, 'model.layers.44': 1, 'model.layers.45': 1, 'model.layers.46': 1,
                'model.layers.47': 1, 'model.layers.48': 1, 'model.layers.49': 1, 'model.layers.50': 1, 'model.layers.51': 1, 'model.layers.52': 1, 'model.layers.53': 1,
                'model.layers.54': 1, 'model.layers.55': 1, 'model.layers.56': 1, 'model.layers.57': 1, 'model.layers.58': 1, 'model.layers.59': 1, 'model.layers.60': 1,
                'model.layers.61': 1, 'model.layers.62': 1, 'model.layers.63': 1, 'model.layers.64': 1, 'model.layers.65': "cpu", 'model.layers.66': "cpu",
                'model.layers.67': "cpu", 'model.layers.68':'cpu', 'model.layers.69': 'cpu', 'model.layers.70': 'cpu', 'model.layers.71': 'cpu', 'model.layers.72': 'cpu',
                'model.layers.73': 'cpu', 'model.layers.74': 'cpu', 'model.layers.75': 'cpu', 'model.layers.76': 'cpu',
                'model.layers.77': 'cpu', 'model.layers.78': 'cpu', 'model.layers.79': 'cpu', 'model.norm': 'cpu', 'lm_head': "cpu"}
    
    
    from transformers import AutoModelForCausalLM, AutoTokenizer
    device = "cuda" # the device to load the model onto
    
    model = AutoModelForCausalLM.from_pretrained(
        "/home/Qwen2-72B-Instruct",
        torch_dtype="auto",
        device_map=device_map
    )
    print(model.hf_device_map)
    
    tokenizer = AutoTokenizer.from_pretrained("/home/Qwen2-72B-Instruct")
    
    prompt = "Give me a short introduction to large language model."
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    
    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=128
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    1. 模型修改device_map方式二
      先加载模型,再看device_map
    path="/home/Llama2-Chinese-7b-Chat"
    model=AutoModelForCausalLM.from_pretrained(path,torch_dtype="auto",device_map="auto")
    print(model.hf_device_map)
    
    device_map={'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1, 'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1, 'model.layers.24': 1, 'model.layers.25': 1, 'model.layers.26': 1, 'model.layers.27': 1, 'model.layers.28': 1, 'model.layers.29': 1, 'model.layers.30': 1, 'model.layers.31': 1, 'model.norm': 1, 'lm_head': 1}
    

    为什么手工写的device_map会报错?
    如果报错RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_mm)
    进行涉及多个张量的操作时,如加法、乘法或更复杂的神经网络层计算,所有这些张量都必须在同一设备上。
    修改为0,1,2

    device_map={'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1, 'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 2, 'model.layers.24': 2, 'model.layers.25': 2, 'model.layers.26': 2, 'model.layers.27': 1, 'model.layers.28': 1, 'model.layers.29': 1, 'model.layers.30': 1, 'model.layers.31': 1, 'model.norm': 2, 'lm_head': 2}
    model=AutoModelForCausalLM.from_pretrained(path,device_map=device_map,torch_dtype="auto")
    
  • 相关阅读:
    【Linux】ps -ef 和ps aux 有什么不同呢?
    50天50个前端小项目(纯html+css+js)第十九天(主题时钟)
    hadoop配置文件workers
    成为会带团队的技术人 大项目:把握关键点,谋定而后动
    供热管网安全运行监测,提升供热管网安全性能
    访问者模式你了解了吗?
    软件定制APP开发步骤分析|小程序
    【2024最新华为OD-C/D卷试题汇总】[支持在线评测] 机器人搬砖(100分) - 三语言AC题解(Python/Java/Cpp)
    springboot老年康复中心信息管理系统的设计与实现毕业设计-附源码250859
    分享一些常用的小程序免费源码
  • 原文地址:https://blog.csdn.net/weixin_40777649/article/details/140459377