• 【经验总结】Ubuntu 源代码方式安装 Microsoft DeepSpeed


    1. 背景介绍

    使用 DeepSpeed 在多服务器上分布式训练大模型

    2. 安装方法

    2.1 查看显卡参数

    ~$ CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())"
    (8, 0)
    ~$ CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_properties(torch.device('cuda')))"
    _CudaDeviceProperties(name='NVIDIA A800 80GB PCIe', major=8, minor=0, total_memory=81050MB, multi_processor_count=108)
    ~$ CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_arch_list())"
    ['sm_50', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_37', 'sm_90', 'compute_37']
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    2.2 源代码安装

    2.2.1 创建虚拟环境

    采用 clone 方式,新建一个 DeepSpeed 专用的 Anaconda 环境

    ~$ conda create -n deepspeed --clone peft
    
    • 1

    2.2.2 激活环境

    ~$ conda activate deepspeed
    
    • 1

    2.2.3 源代码安装 Transformers

    遵循官方文档,通过下面的命令安装 Transformers:

    ~$ pip install git+https://github.com/huggingface/transformers
    
    • 1

    2.2.4 源代码安装 DeepSpeed

    根据 GPU 实际情况设置参数 TORCH_CUDA_ARCH_LIST
    如果需要使用 CPU Offload 优化器参数,设置参数 DS_BUILD_CPU_ADAM=1
    如果需要使用 NVMe Offload,设置参数 DS_BUILD_UTILS=1

    ~$ git clone https://github.com/microsoft/DeepSpeed/
    Cloning into 'DeepSpeed'...
    remote: Enumerating objects: 45020, done.
    remote: Counting objects: 100% (3618/3618), done.
    remote: Compressing objects: 100% (413/413), done.
    remote: Total 45020 (delta 3387), reused 3299 (delta 3202), pack-reused 41402
    Receiving objects: 100% (45020/45020), 207.74 MiB | 14.32 MiB/s, done.
    Resolving deltas: 100% (32479/32479), done.
    Updating files: 100% (1554/1554), done.
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    ~$ cd DeepSpeed/
    ~$ TORCH_CUDA_ARCH_LIST="8.0" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
    --global-option="build_ext" --global-option="-j8" --no-cache -v \
    --disable-pip-version-check 2>&1 | tee build.log
    
    • 1
    • 2
    • 3
    • 4

    安装成功:

    ~$ pip show deepspeed
    Name: deepspeed
    Version: 0.14.3+fbdf0eaf
    Summary: DeepSpeed library
    Home-page: http://deepspeed.ai
    Author: DeepSpeed Team
    Author-email: deepspeed-info@microsoft.com
    License: Apache Software License 2.0
    Location: /public/home/acc5trotmy/.conda/envs/deepspeed/lib/python3.10/site-packages
    Requires: hjson, ninja, numpy, packaging, psutil, py-cpuinfo, pydantic, pynvml, torch, tqdm
    Required-by: 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    deepspeed 命令:

    ~$ deepspeed 
    [2024-04-24 12:05:52,629] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    df: /public/home/acc5trotmy/.triton/autotune: No such file or directory
     [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
     [WARNING]  async_io: please install the libaio-dev package with apt
     [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
     [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
     [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
     [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
    usage: deepspeed [-h] [-H HOSTFILE] [-i INCLUDE] [-e EXCLUDE] [--num_nodes NUM_NODES] [--min_elastic_nodes MIN_ELASTIC_NODES]
                     [--max_elastic_nodes MAX_ELASTIC_NODES] [--num_gpus NUM_GPUS] [--master_port MASTER_PORT] [--master_addr MASTER_ADDR]
                     [--launcher LAUNCHER] [--launcher_args LAUNCHER_ARGS] [--module] [--no_python] [--no_local_rank] [--no_ssh_check] [--force_multi]
                     [--save_pid] [--enable_each_rank_log ENABLE_EACH_RANK_LOG] [--autotuning {tune,run}] [--elastic_training] [--bind_cores_to_rank]
                     [--bind_core_list BIND_CORE_LIST] [--ssh_port SSH_PORT]
                     user_script ...
    deepspeed: error: the following arguments are required: user_script, user_args
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
  • 相关阅读:
    javascript学习笔记-Promise的基本用法
    计算机的组成部件的作用
    【信号处理】基于优化算法的 SAR 信号处理(Matlab代码实现)
    12 Go的接口
    调整参数提高mysql读写速度
    使用Generator处理二叉树的中序遍历
    Stable Diffusion XL网络结构-超详细原创
    【快应用】快应用审核驳回常见原因总结
    译:软件工程师的软技能(一)
    python打开.npy文件的常见报错及解决
  • 原文地址:https://blog.csdn.net/weixin_40378209/article/details/138152137