参考:前两步Ubuntu云原生环境安装,docker+k8s+kubeedge(亲测好用)_爱吃关东煮的博客-CSDN博客_ubantu部署kubeedge
配置节点gpu:
K8S调用GPU资源配置指南_思影影思的博客-CSDN博客_k8s 使用gpu
- kubeadm reset
- iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X
只在主节点运行,apiserver-advertise-address填写主节点ip
- sudo kubeadm init \
- --apiserver-advertise-address=192.168.1.117 \
- --control-plane-endpoint=node4212 \
- --image-repository registry.cn-hangzhou.aliyuncs.com/google_containers \
- --kubernetes-version v1.21.10 \
- --service-cidr=10.96.0.0/12 \
- --pod-network-cidr=10.244.0.0/16
- mkdir -p $HOME/.kube
- sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
- sudo chown $(id -u):$(id -g) $HOME/.kube/config
主节点完成后,子节点运行主节点完成后展示的join命令
- curl https://docs.projectcalico.org/manifests/calico.yaml -O
- kubectl apply -f calico.yaml
- sudo kubectl apply -f /dashbord.yaml
-
- sudo kubectl edit svc kubernetes-dashboard -n kubernetes-dashboard
- 将type: ClusterIP 改为 type: NodePort
- # 找到端口,关闭对应防火墙
- sudo kubectl get svc -A |grep kubernetes-dashboard

任意主机ip:31678为实际访问连接(https://192.168.1.109:31678/)
- kubectl get pods --all-namespaces -o wide
- #查看pod状态
- kubectl describe pod kubernetes-dashboard-57c9bfc8c8-lmb67 --namespace kubernetes-dashboard
- #打印log
- kubectl logs nvidia-device-plugin-daemonset-xn7hx --namespace kube-system
kubectl apply -f /dashuser.yaml
kubectl -n kubernetes-dashboard get secret $(kubectl -n kubernetes-dashboard get sa/admin-user -o jsonpath="{.secrets[0].name}") -o go-template="{{.data.token | base64decode}}"

查看本地镜像:docker images
登陆docker账户
给docker打标签,左:本地名:tag 右hub用户名/仓库名:tag
docker tag deeplabv3plus:1.0.0 chenzishu/deepmodel:labv3
上传hub
docker push chenzishu/deepmodel:labv3

应用名随意,镜像地址填写docherhub上对应镜像地址(chenzishu/deepmodel:pytorch)
等待容器运行,需要时间

- ########
- #pod启动后一直重启,并报Back-off restarting failed container
- #找到对应的deployment添加
- command: ["/bin/bash", "-ce", "tail -f /dev/null"]
- ########

显示本地容器:docker ps -a
找到容器:
kubectl get pods --all-namespaces -o wide
进入容器:
kubectl exec -it segnet-747b798bf5-4bjqk /bin/bash
查看容器中文件:
ls
nvidia-smi查看容器是否可以调用gpu
先安装nvidia-docker2:
- distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
- && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
- && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
- sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
- sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
- sudo apt-get update
-
- sudo apt-get install -y nvidia-docker2
-
- sudo systemctl restart docker
-
- #测试
- sudo docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
update可能会报错:参见官方文档Conflicting values set for option Signed-By error when running apt update
- E: Conflicting values set for option Signed-By regarding source https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64/ /: /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg !=
- E: The list of sources could not be read.
解决方法:
grep -l "nvidia.github.io" /etc/apt/sources.list.d/* | grep -vE "/nvidia-container-toolkit.list\$"
删除列出的文件即可
安装 gpushare-device-plugin 之前,确保在 GPU 节点上已经安装 Nvidia-Driver 以及 Nvidia-Docker2,同时已将 docker 的默认运行时设置为 nvidia:
配置runtime:/etc/docker/daemon.json
- {
- "default-runtime": "nvidia",
- "runtimes": {
- "nvidia": {
- "path": "/usr/bin/nvidia-container-runtime",
- "runtimeArgs": []
- }
- }
- }
再参考阿里开发文档,写的很详细 :配置、使用nvidia-share:https://developer.aliyun.com/article/690623
K8S 集群使用阿里云 GPU sharing 实现 GPU 调度 - 点击领取 (dianjilingqu.com)
部署 GPU 共享调度插件 gpushare-schd-extender
- cd /tmp/
- curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/gpushare-schd-extender.yaml
- kubectl create -f gpushare-schd-extender.yaml
- # 需要能够在 master 上进行调度,在 gpushare-schd-extender.yaml 中将
- # nodeSelector:
- # node-role.kubernetes.io/master: ""
- # 这两句删除,使 k8s 能够在 master 上进行 GPU 调度
-
- ### 无法下载参考如下链接:
- wget http://49.232.8.65/yaml/gpushare-schd-extender.yaml
部署设备插件 gpushare-device-plugin
- cd /tmp/
- wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
- kubectl create -f device-plugin-rbac.yaml
- wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml
- # 默认情况下,GPU 显存以 GiB 为单位,若需要使用 MiB 为单位,需要在这个文件中,将 --memory-unit=GiB 修改为 --memory-unit=MiB
- kubectl create -f device-plugin-ds.yaml
-
- ### 无法下载参考如下链接:
- wget http://49.232.8.65/yaml/device-plugin-rbac.yaml
- wget http://49.232.8.65/yaml/device-plugin-ds.yaml
为 GPU 节点打标签
- # 为了将 GPU 程序调度到带有 GPU 的服务器,需要给服务打标签 gpushare=true
-
- kubectl get nodes
- # 选取 GPU 节点打标
- kubectl label node <target_node> gpushare=true
- kubectl describe node <target_node>
更新 kubectl 可执行程序
- wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
- chmod u+x kubectl-inspect-gpushare
- mv kubectl-inspect-gpushare /usr/local/bin
-
- ### 无法下载参考如下链接:
- wget http://49.232.8.64/k8s/kubectl-inspect-gpushare
查看 GPU 信息:若能看到 GPU 信息,则代表安装成功
- root@dell21[/root]# kubectl inspect gpushare
- NAME IPADDRESS GPU0(Allocated/Total) PENDING(Allocated) GPU Memory(GiB)
- 10.45.61.22 10.45.61.22 0/7 2 2/7
- ------------------------------------------------------
- Allocated/Total GPU Memory In Cluster:
- 2/7 (28%)
pod无法启动、资源不足
- #设置污点阈值
- systemctl status -l kubelet
- #文件路径
- /etc/systemd/system/kubelet.service.d/
- #放宽阈值
- #修改配置文件增加传参数,添加此配置项 --eviction-hard=nodefs.available<3%
- Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --eviction-hard=nodefs.available<3%"
-
- systemctl daemon-reload
- systemctl restart kubelet
pod反复重启:
- pod启动后一直重启,并报Back-off restarting failed container
- 找到对应的deployment
- command: ["/bin/bash", "-ce", "tail -f /dev/null"]
-
- spec:
- containers:
- - name: test-file
- image: xxx:v1
- command: ["/bin/bash", "-ce", "tail -f /dev/null"]
- imagePullPolicy: IfNotPresent