• Server - Kubernetes (K8S) 运行 PyTorchJob 的 YAML 配置


    欢迎关注我的CSDN:https://spike.blog.csdn.net/
    本文地址:https://blog.csdn.net/caroline_wendy/article/details/136499768

    K8S

    PyTorchJob 是 Kubernetes 中的自定义资源,用于在 Kubernetes 上运行 PyTorch 训练任务,这是 Kubeflow 组件的一部分,具有稳定的状态,PyTorchJob 允许像管理 Kubernetes 中的其他内置资源一样创建和管理 PyTorch 作业。要使用 PyTorchJob,需要先安装 PyTorch Operator。默认情况下,PyTorch Operator 会作为控制器部署在 training operator 中。

    YAML 配置如下,其中:

    • kindPyTorchJob
    • metadata/name,运行的 Job 名称,不要重名
    • 节点使用 Workerreplicas 重复的节点数量,resources 配置 GPU 数量,即支持2机1卡,或1机2卡
    • command 是运行命令

    源码:

    apiVersion: "kubeflow.org/v1"
    kind: PyTorchJob
    metadata:
      name: pytorch-simple-001
    spec:
      pytorchReplicaSpecs:
        Worker:
          replicas: 1
          template:
            metadata:
              annotations:
                sidecar.istio.io/inject: "false"
              labels:
                file-mount: "true"
                user-mount: "true"
            spec:
    #          hostNetwork: false  # New
              containers:
                - name: pytorch
                  command:
                    - /bin/sh
                    - -cl
                    - "bash k8s/run_grid0_for_gpu1.sh > nohup.test.log 2>&1"
                  image: "harbor.[xxx].com/cryoem:v1.3.1"
                  imagePullPolicy: Always
                  securityContext:  # New
                    privileged: false
                    capabilities:
                      add: [ "IPC_LOCK" ]
                  resources:
                    limits:
                      rdma/hca : 1
                      cpu: 12
                      memory: "100G"
                      nvidia.com/gpu: 2
                  workingDir: "workspace/cryoem-project/"
                  volumeMounts:
                    - name: cache-volume  # change the name to your volume on k8s
                      mountPath: /dev/shm
              nodeSelector:
                gpu.device: "a100"  # support 'a10' or 'a100'
                group: "algo2"
              tolerations:
              - effect: NoSchedule
                key: role
                operator: Equal
                value: "algo2"
              volumes:
               - name: cache-volume  # change the name to your volume on k8s
                 emptyDir:
                     medium: Memory
                     sizeLimit: "960G"
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52

    查看运行情况:

    kubectl get pytorchjobs
    # kubectl delete pytorchjobs pytorch-simple-001
    kubectl get pods
    kubectl exec -it -n [your name] pytorch-simple-001-worker-0 bash
    
    • 1
    • 2
    • 3
    • 4

    运行结果:

    Thu Mar  7 07:39:13 2024       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA A800-SXM...  On   | 00000000:58:00.0 Off |                    0 |
    | N/A   52C    P0   259W / 400W |   7833MiB / 81920MiB |     93%      Default |
    |                               |                      |             Disabled |
    +-------------------------------+----------------------+----------------------+
    |   1  NVIDIA A800-SXM...  On   | 00000000:D0:00.0 Off |                    0 |
    | N/A   52C    P0   235W / 400W |  12917MiB / 81920MiB |     93%      Default |
    |                               |                      |             Disabled |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
  • 相关阅读:
    springboot基于JAVA的电影推荐系统的开发与实现毕业设计源码112306
    三对角矩阵原理及C++实现
    mysql explain extra 信息分析
    初识设计模式 - 外观模式
    spring笔记-ioc容器 大概流程
    uniapp运行到安卓模拟器一直在“同步手机端程序文件完成“界面解决办法
    【Docker 内核详解】namespace 资源隔离(四):Mount namespace & Network namespace
    【iOS-知乎日报第四周总结】
    笙默考试管理系统-MyExamTest----codemirror(26)
    剑指 Offer II 091. 粉刷房子 : 状态机 DP 运用题
  • 原文地址:https://blog.csdn.net/u012515223/article/details/136536695