• accelerate 分布式技巧实战--部署ChatGLM-6B(三)


    accelerate 分布式技巧实战–部署ChatGLM-6B(三)

    基础环境

    torch==2.0.0+cu118
    transformers==4.28.1
    accelerate==0.18.0
    Tesla T4 15.3G
    内存:11.8G
    
    • 1
    • 2
    • 3
    • 4
    • 5

    下载相关文件:

    git clone https://github.com/THUDM/ChatGLM-6B
    cd ChatGLM-6B
    
    git clone --depth=1 https://huggingface.co/THUDM/chatglm-6b THUDM/chatglm-6b
    git clone --depth=1 https://huggingface.co/THUDM/chatglm-6b-int4 THUDM/chatglm-6b-int4
    
    pip install -r requirements.txt
    pip install gradio
    pip install accelerate
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    正常情况下,我们使用Chat-GLM需要的显存大于13G,内存没有评估过,但上述的肯定是不够的,16G应该可以。

    方案一:量化模型

    from accelerate import infer_auto_device_map, init_empty_weights, load_checkpoint_and_dispatch
    from transformers import AutoConfig, AutoModel, AutoModelForCausalLM, AutoTokenizer
    import gradio as gr
    import torch
    import time
    
    tokenizer = AutoTokenizer.from_pretrained("./THUDM/chatglm-6b-int4", trust_remote_code=True)
    model = AutoModel.from_pretrained("./THUDM/chatglm-6b-int4", trust_remote_code=True).half().cuda()
    
    model = model.eval()
    
    def predict(input, history=None):
        print(f'predict started: {time.time()}');
        if history is None:
            history = []
        response, history = model.chat(tokenizer, input, history)
        return response, history
    
    while True:
      text = input(">>用户:")
      response, history = model.chat(tokenizer, input, history)
      print(">>CHatGLM:", response)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22

    GPU使用4.9G,内存使用5.5G。

    方案二:一块GPU

    from accelerate import infer_auto_device_map, init_empty_weights, load_checkpoint_and_dispatch
    from transformers import AutoConfig, AutoModel, AutoModelForCausalLM, AutoTokenizer
    import gradio as gr
    import torch
    import time
    
    
    tokenizer = AutoTokenizer.from_pretrained("./THUDM/chatglm-6b", trust_remote_code=True)
    config = AutoConfig.from_pretrained("./THUDM/chatglm-6b", trust_remote_code=True)
    with init_empty_weights():
      model = AutoModel.from_config(config, trust_remote_code=True)
    
    for name, _ in model.named_parameters():
      print(name)
    # device_map = infer_auto_device_map(model, no_split_module_classes=["GLMBlock"])
    # print(device_map)
    device_map = {'transformer.word_embeddings': 0, 'transformer.layers.0': 0, 'transformer.layers.1': 0, 'transformer.layers.2': 0, 'transformer.layers.3': 0, 'transformer.layers.4': 0, 'transformer.layers.5': 0, 'transformer.layers.6': 0, 'transformer.layers.7': 0, 'transformer.layers.8': 0, 'transformer.layers.9': 0, 'transformer.layers.10': 0, 'transformer.layers.11': 0, 'transformer.layers.12': 0, 'transformer.layers.13': 0, 'transformer.layers.14': 0, 'transformer.layers.15': 0, 'transformer.layers.16': 0, 'transformer.layers.17': 0, 'transformer.layers.18': 0, 'transformer.layers.19': 0, 'transformer.layers.20': 0, 'transformer.layers.21': 'cpu', 'transformer.layers.22': 'cpu', 'transformer.layers.23': 'cpu', 'transformer.layers.24': 'cpu', 'transformer.layers.25': 'cpu', 'transformer.layers.26': 'cpu', 'transformer.layers.27': 'cpu', 'transformer.final_layernorm': 'cpu', 'lm_head': 'cpu'}
    model = load_checkpoint_and_dispatch(model, "./THUDM/chatglm-6b", device_map=device_map, offload_folder="offload", offload_state_dict=True, no_split_module_classes=["GLMBlock"]).half()
    
    def predict(input, history=None):
        print(f'predict started: {time.time()}');
        if history is None:
            history = []
        response, history = model.chat(tokenizer, input, history)
        return response, history
    
    while True:
      history = None
      text = input(">>用户:")
      response, history = model.chat(tokenizer, text, history)
      print(">>CHatGLM:", response)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31

    GPU使用9.7G,内存使用5.9G。第一轮输入你好后GPU使用11.2G。

    方案三:accelerate,多块GPU

    import os
    os.environ["cuda_visible_devices"] = "0,1"
    
    from accelerate import infer_auto_device_map, init_empty_weights, load_checkpoint_and_dispatch
    from transformers import AutoConfig, AutoModel, AutoModelForCausalLM, AutoTokenizer
    # import gradio as gr
    # import torch
    import time
    
    
    tokenizer = AutoTokenizer.from_pretrained(".\\chatglm-6b\\", trust_remote_code=True)
    config = AutoConfig.from_pretrained(".\\chatglm-6b\\", trust_remote_code=True)
    with init_empty_weights():
      model = AutoModel.from_config(config, trust_remote_code=True)
    
    for name, _ in model.named_parameters():
      print(name)
    # device_map = infer_auto_device_map(model, no_split_module_classes=["GLMBlock"])
    # print(device_map)
    # device_map = {'transformer.word_embeddings': 0, 'transformer.layers.0': 0, 'transformer.layers.1': 0, 'transformer.layers.2': 0, 'transformer.layers.3': 0, 'transformer.layers.4': 0, 'transformer.layers.5': 0, 'transformer.layers.6': 0, 'transformer.layers.7': 0, 'transformer.layers.8': 0, 'transformer.layers.9': 0, 'transformer.layers.10': 0, 'transformer.layers.11': 0, 'transformer.layers.12': 0, 'transformer.layers.13': 0, 'transformer.layers.14': 0, 'transformer.layers.15': 0, 'transformer.layers.16': 0, 'transformer.layers.17': 0, 'transformer.layers.18': 0, 'transformer.layers.19': 0, 'transformer.layers.20': 0, 'transformer.layers.21': 'cpu', 'transformer.layers.22': 'cpu', 'transformer.layers.23': 'cpu', 'transformer.layers.24': 'cpu', 'transformer.layers.25': 'cpu', 'transformer.layers.26': 'cpu', 'transformer.layers.27': 'cpu', 'transformer.final_layernorm': 'cpu', 'lm_head': 'cpu'}
    model = load_checkpoint_and_dispatch(model, ".\\chatglm-6b\\", device_map="balanced", offload_folder="offload", offload_state_dict=True, no_split_module_classes=["GLMBlock"]).half()
    
    def predict(input, history=None):
        print(f'predict started: {time.time()}')
        if history is None:
            history = []
        response, history = model.chat(tokenizer, input, history)
        return response, history
    
    while True:
      history = None
      text = input(">>用户:")
      response, history = model.chat(tokenizer, text, history)
      print(">>CHatGLM:", response)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34

    注意,这里我们设置设备映射为balanced,并只使用前两块GPU。显卡占用情况

    参考

    https://cloud.tencent.com/developer/article/2274903?areaSource=102001.17&traceId=dUu9a81soH3zQ5nQGczRV

  • 相关阅读:
    安装angular脚手架
    车险计算器微信小程序源码 带流量主功能
    uniapp实战项目 (仿知识星球App) - - 引入uView框架
    Teambition企业内部应用开发指南
    Centos7修改主机名hostname
    SSM的整合与使用
    Redis之入门学习
    【MySQL数据库】- 多表查询
    misc类设备与蜂鸣器驱动==Linux驱动开发6
    【操作系统】9/35MMAP
  • 原文地址:https://blog.csdn.net/weixin_42486623/article/details/132740746