• 【MindSpore分布式训练】训练中报错: Destroy info store failed


    Modelarts 中进行多机多卡分布式训练与验证

    【操作步骤&问题现象】

    在训练了一个多epoch,验证也做了十几次后。EvalCallback中模型在由训练网络转向验证网络,并执行验证出现了问题。

    目前看上去,第零个节点(8张昇腾910)已经完成了模型验证,并给出了验证准确率。但是第一个节点(8张昇腾910)会报错,并且没有打印验证准确率。报错内容如下:

    [ERROR] HCCL_ADPT(113,fffdfd7fa160,python):2021-12-22-02:04:11.583.220 [mindspore/ccsrc/runtime/hccl_adapter/hccl_adapter.cc:310] FinalizeKernelInfoStore] Destroy info store failed, ret = 1343225860

    [ERROR] HCCL_ADPT(119,fffde1ffb160,python):2021-12-22-02:04:11.599.194 [mindspore/ccsrc/runtime/hccl_adapter/hccl_adapter.cc:310] FinalizeKernelInfoStore] Destroy info store failed, ret = 1343225860

    [ERROR] HCCL_ADPT(111,fffdfbfff160,python):2021-12-22-02:04:11.602.991 [mindspore/ccsrc/runtime/hccl_adapter/hccl_adapter.cc:310] FinalizeKernelInfoStore] Destroy info store failed, ret = 1343225860

    [ERROR] HCCL_ADPT(121,fffdecff9160,python):2021-12-22-02:04:11.616.258 [mindspore/ccsrc/runtime/hccl_adapter/hccl_adapter.cc:310] FinalizeKernelInfoStore] Destroy info store failed, ret = 1343225860

    [ERROR] HCCL_ADPT(117,fffe18ff9160,python):2021-12-22-02:04:11.629.527 [mindspore/ccsrc/runtime/hccl_adapter/hccl_adapter.cc:310] FinalizeKernelInfoStore] Destroy info store failed, ret = 1343225860

    [ERROR] HCCL_ADPT(115,fffde27fc160,python):2021-12-22-02:04:11.735.578 [mindspore/ccsrc/runtime/hccl_adapter/hccl_adapter.cc:310] FinalizeKernelInfoStore] Destroy info store failed, ret = 1343225860

    [ERROR] HCCL_ADPT(109,fffded7fa160,python):2021-12-22-02:04:11.760.305 [mindspore/ccsrc/runtime/hccl_adapter/hccl_adapter.cc:310] FinalizeKernelInfoStore] Destroy info store failed, ret = 1343225860

    [ERROR] HCCL_ADPT(107,fffdde7fc160,python):2021-12-22-02:04:11.933.854 [mindspore/ccsrc/runtime/hccl_adapter/hccl_adapter.cc:310] FinalizeKernelInfoStore] Destroy info store failed, ret = 1343225860

    [Modelarts Service Log]2021-12-22 02:04:17,442 - ERROR - proc-rank-9-device-1 (pid: 109) has exited with non-zero code: -11

    [Modelarts Service Log]2021-12-22 02:04:17,443 - INFO - Begin destroy training processes

    这个plog的报错是因为其他卡异常退出了,而且报错都是相同的。

  • 相关阅读:
    hadoop-MapReduce
    darknet c++源码阅读笔记-01-activation_kernels.cu
    iMazing2023永久免费版苹果iOS设备管理软件
    高等数学(第七版)同济大学 习题7-6 个人解答
    面试常考数据结构:红黑树、B树、B+树各自适用的场景
    SAP SMARTFORMS 文本框显示默认浏览器
    10个常见的前端手写功能,你全都会吗?
    java计算机毕业设计知识库系统源码+系统+lw+数据库+调试运行
    【架构-15】NoSQL数据库
    若依架构下的质检项目的记录
  • 原文地址:https://blog.csdn.net/weixin_45666880/article/details/126500998