• k8s部署问题及解决方法


    1.k8s集群使用GPU问题

    首先需要安装 nvidia-container-runtime

    yum install nvidia-container-runtime
    
    • 1

    然后修改 /etc/docker/daemon.json 文件,添加以下内容

      "default-runtime": "nvidia",
      "runtimes": {
            "nvidia": {
                    "path": "/usr/bin/nvidia-container-runtime",
                    "runtimeArgs": []
            }
      },
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    然后重新载入daemon并重启docker

    systemctl daemon-reload
    systemctl restart docker
    
    • 1
    • 2

    最后安装 k8s插件 nvidia-device-plugin

    nvidia-device-plugin.yml

    # Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: nvidia-device-plugin-daemonset
      namespace: kube-system
    spec:
      selector:
        matchLabels:
          name: nvidia-device-plugin-ds
      updateStrategy:
        type: RollingUpdate
      template:
        metadata:
          # This annotation is deprecated. Kept here for backward compatibility
          # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
          annotations:
            scheduler.alpha.kubernetes.io/critical-pod: ""
          labels:
            name: nvidia-device-plugin-ds
        spec:
          tolerations:
          # This toleration is deprecated. Kept here for backward compatibility
          # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
          - key: CriticalAddonsOnly
            operator: Exists
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
          # Mark this pod as a critical add-on; when enabled, the critical add-on
          # scheduler reserves resources for critical add-on pods so that they can
          # be rescheduled after a failure.
          # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
          priorityClassName: "system-node-critical"
          containers:
          - image: nvidia/k8s-device-plugin:1.11
            name: nvidia-device-plugin-ctr
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop: ["ALL"]
            volumeMounts:
              - name: device-plugin
                mountPath: /var/lib/kubelet/device-plugins
          volumes:
            - name: device-plugin
              hostPath:
                path: /var/lib/kubelet/device-plugins
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61

    最后执行

    kubectl create -f nvidia-device-plugin.yml
    
    • 1

    这时候集群就可以使用GPU了

    2.使用hostNetwork时的dns配置

    当yaml文件里面配置了pod使用hostnetwork模式时,此时pod会使用宿主机网络进行通信,集群的默认dns策略是dnsPolicy 默认为 ClusterFirst,这时候集群的pod之间的service就不能互相访问了,需要设置dns策略为ClusterFirstWithHostNet

          hostNetwork: true
          dnsPolicy: ClusterFirstWithHostNet
    
    • 1
    • 2

    3.拉镜像的密钥设置

    首先我们先在服务器登录一下镜像仓库

     docker login  {imageurl} //换成自己的镜像仓库地址
    
    • 1

    然后在指定的namespace创建一个密钥

    kubectl create secret generic {image-secret} --from-file=.dockerconfigjson=/root/.docker/config.json  --type=kubernetes.io/dockerconfigjson -n test
    
    • 1
  • 相关阅读:
    Go,从命名开始!Go的关键字和标识符全列表手册和代码示例!
    基于web的学校二手书城系统/二手书交易系统
    Day1跟李沐学AI-深度学习课程00-04【预告、课程安排、深度学习介绍、安装、数据操作+数据预处理】
    软件设计师教程(一)计算机系统知识-计算机系统基础知识
    微服务项目:尚融宝(2)(上手复习mybatisplus)
    【深入】k-means和FCM的差别
    WAF简介
    iwebsec靶场搭建
    分布式/微服务---第九篇
    如何从宏观层面构建优秀的大语言模型
  • 原文地址:https://blog.csdn.net/LPJCSY/article/details/126484421