分布式代码:
------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------------------------------------------------------
启动分布式训练后报错信息如下:

[ERROR] MD(80689,fffea34fd1e0,python):2022-09-28-20:38:48.601.876 [mindspore/ccsrc/minddata/dataset/util/task_manager.cc:217] InterruptMaster] Task is terminated with err msg(more detail in info level log):Exception thrown from PyFunc. The actual amount of data read from generator 350 is different from generator.len 160146, you should adjust generator.len to make them match.
Line of code : 217
File : /home/jenkins/agent-working-dir/workspace/Compile_Ascend_ARM_CentOS/mindspore/mindspore/ccsrc/minddata/dataset/engine/datasetops/source/generator_op.cc
---------------------------------------------------------------------------------------------------------------------------------------------------------------
想问一下这个问题是什么引起的该怎么解决?
****************************************************解答*****************************************************
这个日志是说,某个卡发的数据数量,跟其他卡发的数据数量不同.
可能是
1. 某个卡发的数据数量,跟其他卡发的数据数量不同。在分布式训练里面,ascend是要求多个卡一样的
2. 其他挂了导致某个卡退出,然后发的数量少了