• k8spod使用gpu


    k8s pod使用gpu前提

    1. k8s节点有gpu显卡
    2. k8s节点安装gpu显卡驱动
    3. k8s节点docker或containerd运行时使用nvidia-container-runtime
    4. k8s部署gpu device plugin daemonset

    1.安装gpu显卡驱动

    查看节点显卡类型

    nvidia-smi  -L
    GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-f2b15a66-0630-5f77-1f17-28abb3854f1c)
    
    # 忘记没安装驱动,用不了上面命令,使用
    lspci | grep -i nvidia
    00:03.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)
    00:04.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1) 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    根据 1eb8到这个网站查
    http://pci-ids.ucw.cz/mods/PC/10de?action=help?help=pci
    在这里插入图片描述
    在这里插入图片描述

    根据型号到这个网站查找驱动安装程序
    https://www.nvidia.com/Download/Find.aspx#

    # 下载
    wget https://us.download.nvidia.com/tesla/515.65.01/NVIDIA-Linux-x86_64-515.65.01.run
    
    chmod +x NVIDIA-Linux-x86_64-515.65.01.run
    
    # 上述安装程序依赖这些包,安装
    apt install gcc linux-kernel-headers dkms
    sh NVIDIA-Linux-x86_64-515.65.01.run --ui=none --disable-nouveau --no-install-libglvnd --dkms -s
    
    # 使用下面命令验证是否安装成功。
    nvidia-smi
    Thu Nov  3 19:17:50 2022       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                    0 |
    | N/A   36C    P0    37W / 300W |      4MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |    0   N/A  N/A      1499      G   /usr/lib/xorg/Xorg                  4MiB |
    +-----------------------------------------------------------------------------+
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32

    2. 安装nvidia-container-runtime

    curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -
    
    curl -s -L https://nvidia.github.io/nvidia-container-runtime/$(. /etc/os-release;echo $ID$VERSION_ID)/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
    
    apt update
    
    apt install nvidia-container-runtime -y
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    2.1修改默认运行时

    2.2 cri为docker

    修改/etc/docker/daemon.json,增加default-runtime,runtimes配置.

    {
      "default-runtime": "nvidia",
      "runtimes": {
          "nvidia": {
              "path": "/usr/bin/nvidia-container-runtime",
              "runtimeArgs": []
          }
      },
      "registry-mirrors": [
         "https://docker.mirrors.ustc.edu.cn/",
         "https://hub-mirror.c.163.com/"
      ],
      "max-concurrent-downloads": 10,
      "log-driver": "json-file",
      "log-level": "warn",
      "log-opts": {
        "max-size": "10m",
        "max-file": "3"
        },
      "insecure-registries":
            ["127.0.0.1","192.168.12.12:8888"],
      "data-root":"/data/docker",
      "features":{"buildkit": true}
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24

    重启
    systemctl restart docker

    2.2 cri为containerd

    修改/etc/containerd/config.toml,如果文件不存在
    生成默认配置文件

    mkdir /etc/containerd
    containerd config default > /etc/containerd/config.toml
    vi /etc/containerd/config.toml
    ...
        [plugins."io.containerd.grpc.v1.cri".containerd]
          snapshotter = "overlayfs"
          default_runtime_name = "runc"
          no_pivot = false
    ...
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
              runtime_type = "io.containerd.runtime.v1.linux" # 将此处 runtime_type 的值改成 io.containerd.runtime.v1.linux
    ...
      [plugins."io.containerd.runtime.v1.linux"]
        shim = "containerd-shim"
        runtime = "nvidia-container-runtime" # 将此处 runtime 的值改成 nvidia-container-runtime
    ...
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

    重启
    systemctl restart containerd

    3. 部署nivdia-device-plugin

    kubectl apply -f nvidia-device-plugin.yaml

    # Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: nvidia-device-plugin-daemonset
      namespace: kube-system
    spec:
      selector:
        matchLabels:
          name: nvidia-device-plugin-ds
      updateStrategy:
        type: RollingUpdate
      template:
        metadata:
          # This annotation is deprecated. Kept here for backward compatibility
          # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
          annotations:
            scheduler.alpha.kubernetes.io/critical-pod: ""
          labels:
            name: nvidia-device-plugin-ds
        spec:
          tolerations:
          # This toleration is deprecated. Kept here for backward compatibility
          # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
          - key: CriticalAddonsOnly
            operator: Exists
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
          # Mark this pod as a critical add-on; when enabled, the critical add-on
          # scheduler reserves resources for critical add-on pods so that they can
          # be rescheduled after a failure.
          # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
          priorityClassName: "system-node-critical"
          containers:
          - image: nvidia/k8s-device-plugin:v0.7.1
            name: nvidia-device-plugin-ctr
            args: ["--fail-on-init-error=false"]
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop: ["ALL"]
            volumeMounts:
              - name: device-plugin
                mountPath: /var/lib/kubelet/device-plugins
          volumes:
            - name: device-plugin
              hostPath:
                path: /var/lib/kubelet/device-plugins
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62

    4. 查看kubelet是否识别gpu

    查看pod是否正常启动
    kubectl get pod -n kube-system -o wide
    在这里插入图片描述
    describe node查看是否识别gpu
    kubectl describe node vm-1-5-ubuntu
    在这里插入图片描述
    测试gpu pod启动
    kubectl apply -f gpu-pod.yaml

    
    apiVersion: v1
    kind: Pod
    metadata:
      name: gpu-pod
    spec:
      containers:
        - name: cuda-container
          image: nvidia/cuda:9.0-devel
          resources:
            limits:
              nvidia.com/gpu: 1
        - name: digits-container
          image: nvidia/digits:6.0
          resources:
            limits:
              nvidia.com/gpu: 1
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

    5. gpu共享

    以上方式为独占gpu,GPU资源在节点上是以个数暴露给kubernetes集群来进行调度的,也就是说如果有两个后端应用pod需要使用到GPU资源,但集群节点上只有一张GPU物理卡的情况下,会导致两个后端应用容器中仅有一个可以正常运行,另一个pod则会处于pending状态。

    gpu共享的配置阿里云和ucloud等都有对应文档配置,非云集群网上也有开源解决方案。

  • 相关阅读:
    基于SqlSugar的数据库访问处理的封装,支持多数据库并使之适应于实际业务开发中
    Dubbo服务调用过程流程图
    JPA中findById读脏数据-缓存原因
    Jina 近期更新
    【无标题】
    【树莓派c++图像处理起航1】
    CC1链分析与复现
    minio搭建文件存储服务
    4、SpringBoot_Mybatis、Druid、Juint整合
    kube-scheduler的调度上下文
  • 原文地址:https://blog.csdn.net/ledrsnet/article/details/127676495