• Mindspore多机多卡AI分布式训练RuntimeError 1456


    调用的run_cluster.sh代码(代码来源:https://gitee.com/mindspore/docs/tree/r1.7/docs/sample_code/distributed_training/run_cluster.sh)如下:

    1. #!/bin/bash
    2. # applicable to Ascend
    3. echo "=============================================================================================================="
    4. echo "Please run the script as: "
    5. echo "bash run.sh DATA_PATH RANK_TABLE_FILE RANK_SIZE RANK_START"
    6. echo "For example: bash run.sh /path/dataset /path/rank_table.json 16 0"
    7. echo "It is better to use the absolute path."
    8. echo "=============================================================================================================="
    9. execute_path=$(pwd)
    10. echo ${execute_path}
    11. script_self=$(readlink -f "$0")
    12. self_path=$(dirname "${script_self}")
    13. echo ${self_path}
    14. export DATA_PATH=$1
    15. export RANK_TABLE_FILE=$2
    16. export RANK_SIZE=$3
    17. RANK_START=$4
    18. DEVICE_START=0
    19. for((i=0;i<=7;i++));
    20. do
    21. export RANK_ID=$[i+RANK_START]
    22. export DEVICE_ID=$[i+DEVICE_START]
    23. rm -rf ${execute_path}/device_$RANK_ID
    24. mkdir ${execute_path}/device_$RANK_ID
    25. cd ${execute_path}/device_$RANK_ID || exit
    26. pytest -s ${self_path}/resnet50_distributed_training.py >train$RANK_ID.log 2>&1 &

    使用Mindspore进行多机多卡的AI分布式训练。共使用两台机器,一台Ascend 910A 八卡,另一台也是Ascend 910A 八卡,共十六卡。但是在使用https://www.mindspore.cn/tutorials/experts/zh-CN/r1.7/parallel/train_ascend.html的分布式AI训练的教程时发现一些问题。

    运行报错,运行的命令如下:

    # server0
    bash run_cluster.sh /path/dataset /path/rank_table.json 16 0
    # server1
    bash run_cluster.sh /path/dataset /path/rank_table.json 16 8
    

    [ERROR] DEVICE(19823,ffff9ad057e0,python3.7):2022-08-03-14:47:37.722.356 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:1268] HcclInit] Invalid environment variable 'MINDSPORE_HCCL_CONFIG_PATH' or 'RANK_TABLE_FILE', the path is: cluster_rank_table_16pcs.json. Please check (1) whether the path exists, (2) whether the path has the access permission, (3) whether the path is too long.
    [ERROR] DEVICE(19823,ffff9ad057e0,python3.7):2022-08-03-14:47:37.722.434 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:1183] InitDevice] HcclInit init failed
    [CRITICAL] PIPELINE(19823,ffff9ad057e0,python3.7):2022-08-03-14:47:37.722.459 [mindspore/ccsrc/pipeline/jit/pipeline.cc:1456] InitHccl] Runtime init failed.
    ============================= test session starts ==============================
    platform linux -- Python 3.7.5, pytest-7.1.2, pluggy-1.0.0
    rootdir: /sunhanyuan/docs-r1.7/docs/sample_code/distributed_training
    collected 0 items / 1 error

    ==================================== ERRORS ====================================
    ______________ ERROR collecting resnet50_distributed_training.py _______________
    ../resnet50_distributed_training.py:35: in
        init()
    /usr/local/python37/lib/python3.7/site-packages/mindspore/communication/management.py:142: in init
        init_hccl()
    E   RuntimeError: mindspore/ccsrc/pipeline/jit/pipeline.cc:1456 InitHccl] Runtime init failed.
    =========================== short test summary info ============================
    ERROR ../resnet50_distributed_training.py - RuntimeError: mindspore/ccsrc/pip...
    !!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
    =============================== 1 error in 7.40s

     解答:

    经过查找发现是cluster_rank_table_16pcs.json中ip的设置问题!

  • 相关阅读:
    数据结构与算法编程题9
    前端 CSS 经典:边框转圈动画效果
    智慧城市大脑数据中台解决方案:PPT全套37页,附下载
    vi/vim 删除:一行, 一个字符, 单词, 每行第一个字符 命令
    原创先锋后台管理平台未授权访问
    2023江苏省领航杯(部分CRYPTO题目复现)
    【STM32】STM32H750VBT6 CubeMX USBFS-UVC设备实现,以及移植问题
    深度剖析贪心算法:原理、优势与实战
    lua profile 性能分析工具都有哪些
    面对数据增量同步需求,怎样保障准确性和及时性?
  • 原文地址:https://blog.csdn.net/weixin_45666880/article/details/127689415