• docker中使用GPU+rocksdb



    配置环境


     dell@dell-Precision-3630-Tower  ~  lsb_release -a
    No LSB modules are available.
    Distributor ID:	Ubuntu
    Description:	Ubuntu 20.04.6 LTS
    Release:	20.04
    Codename:	focal
    
    dell@dell-Precision-3630-Tower  ~  nvcc --version
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2022 NVIDIA Corporation
    Built on Wed_Sep_21_10:33:58_PDT_2022
    Cuda compilation tools, release 11.8, V11.8.89
    Build cuda_11.8.r11.8/compiler.31833905_0
    
     dell@dell-Precision-3630-Tower  ~  docker version
    Client: Docker Engine - Community
     Version:           24.0.6
     API version:       1.43
     Go version:        go1.20.7
    
    
     OS/Arch:           linux/amd64
     Context:           default
    
    Server: Docker Engine - Community
     Engine:
      Version:          24.0.6
      API version:      1.43 (minimum version 1.12)
      Go version:       go1.20.7
    
      OS/Arch:          linux/amd64
      Experimental:     false
     containerd:
      Version:          1.6.24
    
     runc:
      Version:          1.1.9
    
     docker-init:
      Version:          0.19.0
    
    
    #安装方式:sudo apt-get install libcudnn8-dev=8.9.2.26-1+cuda11.8
    cudnn:libcudnn8-dev=8.9.2.26-1+cuda11.8
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45

    目录结构


    请添加图片描述

    nvidia-docker和从docker 19开始提供的nvidia-container-toolkit的区别:


    nvidia-docker

    • 概述nvidia-docker 是最初用于在 Docker 容器中提供 GPU 支持的工具。
    • 命令nvidia-docker 具有自己的命令行工具,并且最初被设计为 docker 命令的替代品。你可以用 nvidia-docker run 来启动一个使用 GPU 的容器。
    • 插件nvidia-docker 版本 1 和 2 都使用了 Docker 插件系统。版本 2 是 Docker 插件的一种形式,允许用户使用 --runtime=nvidia 标志与标准 docker 命令一起使用。

    nvidia-container-toolkit

    • 概述:在 Docker 19.03 版本之后,Docker 引入了一个名为 GPU 的设备请求特性。nvidia-container-toolkit 是一个新的工具,允许用户使用这个新特性,而不再需要 nvidia-docker 的自定义运行时。
    • 命令:与使用 nvidia-docker 不同,使用 nvidia-container-toolkit,你可以使用常规的 docker 命令,但是添加一个 --gpus 参数来启用 GPU 支持。例如:docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
    • 集成:它更紧密地集成到 Docker CLI 中,允许更好的兼容性和使用体验。

    比较和推荐使用

    • nvidia-docker 版本 1 已经弃用,而版本 2 在某些用例中仍然被使用,但逐渐被 nvidia-container-toolkit 替代。
    • 对于 Docker 19.03 及更高版本,官方推荐使用 nvidia-container-toolkit,因为它提供了一个更简洁和标准的方式来在容器中使用 GPU。
    • 使用 nvidia-container-toolkit 允许开发者和运维团队在不更改工作流的情况下,简单地将 GPU 支持添加到他们现有的 Docker 容器中。
    • 尽管在一些老的代码和项目中你仍然可能会看到 nvidia-docker 的使用,但新的项目和开发通常应该使用 nvidia-container-toolkit,除非有明确的理由不这样做。

    docker安装GPU工具箱nvidia-container-toolkit


    参考链接:

    https://zhuanlan.zhihu.com/p/544713249

    sudo apt install curl
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
    sudo systemctl restart docker
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    docker拉取含cuda的镜像建立镜像


    去Nvidia官网下载cuda版本的Docker:https://hub.docker.com/r/nvidia/cuda

    images包含的三种风格:

    • base: Includes the CUDA runtime (cudart)

    • runtime: Builds on the base and includes the CUDA math libraries, and NCCL. A runtime image that also includes cuDNN is available.

    • devel: Builds on the runtime and includes headers, development tools for building CUDA images. These images are particularly useful for multi-stage builds.

    • NVIDIA Container Toolkit

    The NVIDIA Container Toolkit for Docker is required to run CUDA images.

    For CUDA 10.0, nvidia-docker2 (v2.1.0) or greater is recommended. It is also recommended to use Docker 19.03.

    还是自己写一个镜像吧,该镜像拥有cudn,rocksdb环境

    # from official ubuntu 20.04
    # FROM ubuntu:20.04
    # docker pull nvidia/cuda:11.8.0-devel-ubuntu20.04
    FROM nvidia/cuda:11.8.0-devel-ubuntu20.04
    
    # RUN mv /etc/apt/sources.list /etc/apt/sources_backup.list && \
    # echo "deb http://mirrors.ustc.edu.cn/ubuntu/ focal main restricted " >> /etc/apt/sources.list && \
    # echo "deb http://mirrors.ustc.edu.cn/ubuntu/ focal-updates main restricted " >> /etc/apt/sources.list && \
    # echo "deb http://mirrors.ustc.edu.cn/ubuntu/ focal universe " >> /etc/apt/sources.list && \
    # echo "deb http://mirrors.ustc.edu.cn/ubuntu/ focal-updates universe " >> /etc/apt/sources.list && \
    # echo "deb http://mirrors.ustc.edu.cn/ubuntu/ focal multiverse " >> /etc/apt/sources.list && \
    # echo "deb http://mirrors.ustc.edu.cn/ubuntu/ focal-updates multiverse " >> /etc/apt/sources.list && \
    # echo "deb http://mirrors.ustc.edu.cn/ubuntu/ focal-backports main restricted universe multiverse " >> /etc/apt/sources.list && \
    # echo "deb http://mirrors.ustc.edu.cn/ubuntu/ focal-security main restricted " >> /etc/apt/sources.list && \
    # echo "deb http://mirrors.ustc.edu.cn/ubuntu/ focal-security universe " >> /etc/apt/sources.list && \
    # echo "deb http://mirrors.ustc.edu.cn/ubuntu/ focal-security multiverse " >> /etc/apt/sources.list && \
    # echo "deb http://archive.canonical.com/ubuntu focal partner " >> /etc/apt/sources.list
    # update system
    RUN ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo 'Asia/Shanghai' >/etc/timezone \ 
        && apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub \
        && apt clean && apt update && apt install -yq --no-install-recommends sudo \
        && sudo apt install -yq --no-install-recommends python3 python3-pip libgl1-mesa-glx libglib2.0-0 libsm6 libxext6 libxrender-dev openssh-server \
        && sudo pip3 install --upgrade pip \
        && sudo pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple \
        && sudo pip3 install setuptools
    
    RUN apt-get update && apt-get upgrade -y
    # install basic tools
    RUN apt-get install -y vim wget curl
    # install tzdata noninteractive
    RUN DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get -y install tzdata
    # install git and default compilers
    RUN apt-get install -y git gcc g++ clang clang-tools
    # install basic package
    RUN apt-get install -y lsb-release software-properties-common gnupg
    # install gflags, tbb
    RUN apt-get install -y libgflags-dev libtbb-dev
    # install compression libs
    RUN apt-get install -y libsnappy-dev zlib1g-dev libbz2-dev liblz4-dev libzstd-dev
    # install cmake
    RUN apt-get install -y cmake
    RUN apt-get install -y libssl-dev
    # install clang-13
    WORKDIR /root
    RUN wget https://apt.llvm.org/llvm.sh
    RUN chmod +x llvm.sh
    RUN ./llvm.sh 13 all
    # install gcc-7, 8, 10, 11, default is 9
    RUN apt-get install -y gcc-7 g++-7
    RUN apt-get install -y gcc-8 g++-8
    RUN apt-get install -y gcc-10 g++-10
    RUN echo "deb https://ppa.launchpadcontent.net/ubuntu-toolchain-r/test/ubuntu focal main" |tee -a /etc/apt/sources.list
    RUN echo "deb-src https://ppa.launchpadcontent.net/ubuntu-toolchain-r/test/ubuntu focal main" |tee -a /etc/apt/sources.list
    RUN curl -sL "http://keyserver.ubuntu.com/pks/lookup?op=get&search=0x60C317803A41BA51845E371A1E9377A2BA9EF27F" |apt-key add
    #RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 60C317803A41BA51845E371A1E9377A2BA9EF27F
    RUN add-apt-repository -y ppa:ubuntu-toolchain-r/test
    RUN apt-get update && apt-get upgrade -y
    #RUN apt-get install -y gcc-11 g++-11
    # install apt-get install -y valgrind
    RUN apt-get install -y valgrind
    # install folly depencencies
    RUN apt-get install -y libgoogle-glog-dev
    # install openjdk 8
    RUN apt-get install -y openjdk-8-jdk
    ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
    # install mingw
    RUN apt-get install -y mingw-w64
    
    # install gtest-parallel package
    RUN git clone --single-branch --branch master --depth 1 https://github.com/google/gtest-parallel.git ~/gtest-parallel
    ENV PATH $PATH:/root/gtest-parallel
    
    # install libprotobuf for fuzzers test
    RUN apt-get install -y ninja-build binutils liblzma-dev libz-dev pkg-config autoconf libtool
    #解决GnuTLS recv error
    RUN apt-get update
    RUN apt-get upgrade
    RUN apt-get install --reinstall ca-certificates
    RUN git clone --branch v1.0 https://github.com/google/libprotobuf-mutator.git ~/libprotobuf-mutator && cd ~/libprotobuf-mutator && git checkout ffd86a32874e5c08a143019aad1aaf0907294c9f && mkdir build && cd build && cmake .. -GNinja -DCMAKE_C_COMPILER=clang-13 -DCMAKE_CXX_COMPILER=clang++-13 -DCMAKE_BUILD_TYPE=Release -DLIB_PROTO_MUTATOR_DOWNLOAD_PROTOBUF=ON && ninja && ninja install
    ENV PKG_CONFIG_PATH /usr/local/OFF/:/root/libprotobuf-mutator/build/external.protobuf/lib/pkgconfig/
    ENV PROTOC_BIN /root/libprotobuf-mutator/build/external.protobuf/bin/protoc
    
    #install the latest google benchmark
    RUN git clone --depth 1 --branch v1.7.0 https://github.com/google/benchmark.git ~/benchmark
    RUN cd ~/benchmark && mkdir build && cd build && cmake .. -GNinja -DCMAKE_BUILD_TYPE=Release -DBENCHMARK_ENABLE_GTEST_TESTS=0 && ninja && ninja install
    
    # # clean up
    # RUN rm -rf /var/lib/apt/lists/*
    # RUN rm -rf /root/benchmark
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    #以下为build-image.sh
    #!/usr/bin/env bash
    
    
    SHELL_HOME=$(
      cd "$(dirname "$0")" || exit
      pwd
    )
    source "${SHELL_HOME}/../dev.conf"
    
    # docker build --build-arg \
    #   --build-arg http_proxy= xxx\
    #   --build-arg https_proxy= xxx\
    #   --build-arg all_proxy=socks5 \
    #   --tag "${IMAGE_NAME}:${IMAGE_VERSION}" "${SHELL_HOME}"
    
    docker build --tag "${IMAGE_NAME}:${IMAGE_VERSION}" "${SHELL_HOME}"
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

    运行容器


    参考链接:https://blog.csdn.net/Maid_Li/article/details/124952650

    在启动docker容器的时候要注意加一些cuda的参数

    • --gpus all-e NVIDIA_VISIBLE_DEVICES=all选择这个容器可见的显卡,直接全部就完事了
    • -e NVIDIA_DRIVER_CAPABILITIES=compute,utility配置了一些cuda必备的包如nvidia-smi之类的
    • 以下为start.sh
    #!/usr/bin/env bash
    
    #当前脚本路径
    SHELL_HOME=$(
      cd "$(dirname "$0")" || exit
      pwd
    )
    source "${SHELL_HOME}"/../dev.conf
    source "${SHELL_HOME}"/utilities/rocks.conf
    
    CONTAINER_NAME="rocksdb-gpu"
    
    # work dir inside the dev container
    SOURCE_DIR_INSIDE="/home/baum/GPU_ROCKS"
    #本地源代码目录 
    SOURCE_DIR="/nvme/baum/git-project/GPU_ROCKS"
    WORK_DIR=/rocks
    RECREATE_CONTAINER=""
    
    #我执行的./start.sh -s /nvme/baum/git-project/GPU_ROCKS
    function show_usage() {
      echo "
      Start a gdb container for Rocksdb.
    
      Usage:
        ./start.sh
        ./start.sh -s /path/to/your/cockroachdb/home
    
    
      Options:
        -s                Project path of crdb, default is '${HOME}/go/src/github.com/cockroachdb'.
        -r                Recreate the dev container.
        -h                Show this message.
      "
      exit
    }
    
    while getopts "s:hr" opt; do
      case $opt in
      s)
        SOURCE_DIR=${OPTARG}
        ;;
      r)
        RECREATE_CONTAINER="true"
        ;;
      h)
        show_usage
        ;;
      *)
        show_usage
        ;;
      esac
    done
    
    CONTAINER_RUNNING=$(docker container ls | grep "${CONTAINER_NAME}")
    CONTAINER_EXISTED=$(docker container ls -a | grep "${CONTAINER_NAME}")
    
    if [[ ${RECREATE_CONTAINER} == "true" && -n ${CONTAINER_EXISTED} ]]; then
      echo "remove the existing rocksdb-gpu container ..."
      docker rm -f "${CONTAINER_NAME}"
      CONTAINER_EXISTED=""
    fi
    
    echo "current SOURCE_DIR is '${SOURCE_DIR}'"
    
    if [[ -z ${CONTAINER_EXISTED} ]]; then
      echo "staring the rocksdb-gpu environment 1 ..."
      #-v 挂载目录,将前一个映射到后一个
      docker run -it -v "${SOURCE_DIR}":/rocks \
        -v "${SOURCE_DIR}":${SOURCE_DIR_INSIDE} \
        --name ${CONTAINER_NAME} \
        --publish "${ROCKS_PORT}"-"${GDB_PORT}":"${ROCKS_PORT}"-"${GDB_PORT}" \
        --network=rocksdb-br \
        --gpus all \
        -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
        -e NVIDIA_VISIBLE_DEVICES=all \
        --workdir ${WORK_DIR} \
        "${IMAGE_NAME}:${IMAGE_VERSION}" \
        bash
      exit
    fi
    
    if [[ -z ${CONTAINER_RUNNING} ]]; then
      echo "starting rocksdb-gpu environment 2 ..."
      docker start "${CONTAINER_NAME}"
    fi
    
    echo "logging into rocksdb-gpu environment '${CONTAINER_NAME}' ..."
    docker exec -it "${CONTAINER_NAME}" bash
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89

    网络配置


    本地16017-16019映射到容器16017-16019

    #init-docker-network.sh
    #!/usr/bin/env bash
    
    
    SHELL_HOME=$(
      cd "$(dirname "$0")" || exit
      pwd
    )
    source "${SHELL_HOME}"/dev.conf
    
    
    
    echo "create network bridge for rocks ..."
    docker network create --subnet="${SUBNET}" "${BRIDGE_NAME}"
    docker network list
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15

    参考链接:

    https://github.com/cnstark/pytorch-docker/blob/main/scripts/build_2.0.1_py3.9.17_cuda11.8.0_devel_ubuntu20.04.sh

    https://zhuanlan.zhihu.com/p/544713249

  • 相关阅读:
    【环境搭建】linux docker安装nexus3
    LCR 136. 删除链表的节点
    【Python游戏】Python基于pygame实现的人机大战的斗兽棋小游戏 | 附源码
    如何建立一套完善的销售管理体系?
    AD拼板技巧
    东华大学Linux实验一
    功能测试自动化测试流程
    安装包 amd,amd64, arm,arm64 都有什么区别
    使用 Setter 方法实现 Spring 依赖注入
    Mysql_Note7
  • 原文地址:https://blog.csdn.net/weixin_40579705/article/details/133829638