• 【MindSpore功能】昇腾910上跑Mindspore.ops中算子,AIcore利用率为0,启动多个进程报错


    昇腾910A训练卡上跑Mindspore.ops中算子,执行时npu AIcore利用率为0,启动多个进程运行,AIcore利用率依然为0,且报错。
    如何提高AIcore利用率呢?
    算子代码示例:
    def foo():
    x = Tensor(np.ones([10, 32, 32, 32]), mindspore.float32)
    weight = Tensor(np.ones([32, 32, 3, 3]), mindspore.float32)
    conv2d = ops.Conv2D(out_channel=32, kernel_size=3)
    output = conv2d(x, weight)
    print(output.shape)

    for i in range(10):
    p = Process(target=foo)
    p.start()

    【操作步骤&问题现象】

    Process Process-6:
    Traceback (most recent call last):
    File "/usr/local/python3.7.5/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
    File "/usr/local/python3.7.5/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
    File "testops.py", line 12, in foo
    output = conv2d(x, weight)
    File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 247, in __call__
    return _run_op(self, self.name, args)
    File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 78, in wrapper
    results = fn(*arg, **kwargs)
    File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 682, in _run_op
    output = real_run_op(obj, op_name, args)
    RuntimeError: mindspore/ccsrc/runtime/device/ascend/ascend_kernel_runtime.cc:364 Init] Ascend error occurred, error message: EE9999: Inner Error!
    [driver interface] halMemAlloc failed: device_id=0, size=32212254720, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocHugePageManaged][FILE:npu_driver.cc][LINE:700]
    [driver interface] halMemAlloc failed: size=32212254720, deviceId=0, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocManaged][FILE:npu_driver.cc][LINE:739]
    DevMemAlloc huge page failed: deviceId=0, type=2, size=32212254720, retCode=117571606![FUNC:DevMemAllocOnline][FILE:npu_driver.cc][LINE:858]
    Device malloc failed, size=32212254720, type=2.[FUNC:DevMalloc][FILE:logger.cc][LINE:349]
    rtMalloc execute failed, reason=[driver error:out of memory][FUNC:ReportFuncErrorReason][FILE:error_message_manage.cc][LINE:41]

    First error scene API: mindspore/ccsrc/runtime/device/ascend/ascend_memory_manager.cc:64 MallocDeviceMemory] Malloc device memory failed, size[32212254720], ret[207001], Device 0 may be other processes occupying this card, check as: ps -ef|grep python
    #
    #
    [EXCEPTION] DEVICE(118536,ffff66fdf200,python3.7):2021-12-28-11:19:16.296.973 [mindspore/ccsrc/runtime/device/ascend/ascend_memory_manager.cc:64] MallocDeviceMemory] Malloc device memory failed, size[32212254720], ret[207001], Device 0 may be other processes occupying this card, check as: ps -ef|grep python
    [EXCEPTION] DEVICE(118536,ffff66fdf200,python3.7):2021-12-28-11:19:16.297.154 [mindspore/ccsrc/runtime/device/ascend/ascend_kernel_runtime.cc:364] Init] Ascend error occurred, error message: EE9999: Inner Error!
    [driver interface] halMemAlloc failed: device_id=0, size=32212254720, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocHugePageManaged][FILE:npu_driver.cc][LINE:700]
    [driver interface] halMemAlloc failed: size=32212254720, deviceId=0, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocManaged][FILE:npu_driver.cc][LINE:739]
    DevMemAlloc huge page failed: deviceId=0, type=2, size=32212254720, retCode=117571606![FUNC:DevMemAllocOnline][FILE:npu_driver.cc][LINE:858]
    Device malloc failed, size=32212254720, type=2.[FUNC:DevMalloc][FILE:logger.cc][LINE:349]
    rtMalloc execute failed, reason=[driver error:out of memory][FUNC:ReportFuncErrorReason][FILE:error_message_manage.cc][LINE:41]

    First error scene API: mindspore/ccsrc/runtime/device/ascend/ascend_memory_manager.cc:64 MallocDeviceMemory] Malloc device memory failed, size[32212254720], ret[207001], Device 0 may be other processes occupying this card, check as: ps -ef|grep python

    检查用例跑的是不是Ascend上的算子,脚本中context里面的target设置是不是Ascend。 另外单算子执行时间本来就很短,可能在你查看利用率的时候用例已经执行结束了。如果一定要查看利用率,建议开两个终端窗口,一个执行用例,另一个实时监控。

  • 相关阅读:
    C++ 重载运算符和重载函数
    借鉴前端事件机制的Spring AOP
    ES6及更新版本的新特性
    服务端ZMQ(二)——管道通信方式
    以太坊硬分叉愈演愈烈:为了分叉而分叉or保全矿工利益?
    c++基本图形绘制
    如何设定员工满意度调研的维度?
    redis实战-实现用户签到&UV统计
    消费电子 SIC462ED SIC463ED DC/DC 稳压器 参数 应用
    MATLAB算法实战应用案例精讲-【数模应用】理想解法(TOPSIS)(附MATLAB和Python代码)
  • 原文地址:https://blog.csdn.net/weixin_45666880/article/details/126501684