昇腾910A训练卡上跑Mindspore.ops中算子,执行时npu AIcore利用率为0,启动多个进程运行,AIcore利用率依然为0,且报错。
如何提高AIcore利用率呢?
算子代码示例:
def foo():
x = Tensor(np.ones([10, 32, 32, 32]), mindspore.float32)
weight = Tensor(np.ones([32, 32, 3, 3]), mindspore.float32)
conv2d = ops.Conv2D(out_channel=32, kernel_size=3)
output = conv2d(x, weight)
print(output.shape)
for i in range(10):
p = Process(target=foo)
p.start()
【操作步骤&问题现象】
Process Process-6:
Traceback (most recent call last):
File "/usr/local/python3.7.5/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/usr/local/python3.7.5/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "testops.py", line 12, in foo
output = conv2d(x, weight)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 247, in __call__
return _run_op(self, self.name, args)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/common/api.py", line 78, in wrapper
results = fn(*arg, **kwargs)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 682, in _run_op
output = real_run_op(obj, op_name, args)
RuntimeError: mindspore/ccsrc/runtime/device/ascend/ascend_kernel_runtime.cc:364 Init] Ascend error occurred, error message: EE9999: Inner Error!
[driver interface] halMemAlloc failed: device_id=0, size=32212254720, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocHugePageManaged][FILE:npu_driver.cc][LINE:700]
[driver interface] halMemAlloc failed: size=32212254720, deviceId=0, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocManaged][FILE:npu_driver.cc][LINE:739]
DevMemAlloc huge page failed: deviceId=0, type=2, size=32212254720, retCode=117571606![FUNC:DevMemAllocOnline][FILE:npu_driver.cc][LINE:858]
Device malloc failed, size=32212254720, type=2.[FUNC:DevMalloc][FILE:logger.cc][LINE:349]
rtMalloc execute failed, reason=[driver error:out of memory][FUNC:ReportFuncErrorReason][FILE:error_message_manage.cc][LINE:41]
First error scene API: mindspore/ccsrc/runtime/device/ascend/ascend_memory_manager.cc:64 MallocDeviceMemory] Malloc device memory failed, size[32212254720], ret[207001], Device 0 may be other processes occupying this card, check as: ps -ef|grep python
#
#
[EXCEPTION] DEVICE(118536,ffff66fdf200,python3.7):2021-12-28-11:19:16.296.973 [mindspore/ccsrc/runtime/device/ascend/ascend_memory_manager.cc:64] MallocDeviceMemory] Malloc device memory failed, size[32212254720], ret[207001], Device 0 may be other processes occupying this card, check as: ps -ef|grep python
[EXCEPTION] DEVICE(118536,ffff66fdf200,python3.7):2021-12-28-11:19:16.297.154 [mindspore/ccsrc/runtime/device/ascend/ascend_kernel_runtime.cc:364] Init] Ascend error occurred, error message: EE9999: Inner Error!
[driver interface] halMemAlloc failed: device_id=0, size=32212254720, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocHugePageManaged][FILE:npu_driver.cc][LINE:700]
[driver interface] halMemAlloc failed: size=32212254720, deviceId=0, type=2, env_type=3, drvRetCode=6![FUNC:DevMemAllocManaged][FILE:npu_driver.cc][LINE:739]
DevMemAlloc huge page failed: deviceId=0, type=2, size=32212254720, retCode=117571606![FUNC:DevMemAllocOnline][FILE:npu_driver.cc][LINE:858]
Device malloc failed, size=32212254720, type=2.[FUNC:DevMalloc][FILE:logger.cc][LINE:349]
rtMalloc execute failed, reason=[driver error:out of memory][FUNC:ReportFuncErrorReason][FILE:error_message_manage.cc][LINE:41]
First error scene API: mindspore/ccsrc/runtime/device/ascend/ascend_memory_manager.cc:64 MallocDeviceMemory] Malloc device memory failed, size[32212254720], ret[207001], Device 0 may be other processes occupying this card, check as: ps -ef|grep python
检查用例跑的是不是Ascend上的算子,脚本中context里面的target设置是不是Ascend。 另外单算子执行时间本来就很短,可能在你查看利用率的时候用例已经执行结束了。如果一定要查看利用率,建议开两个终端窗口,一个执行用例,另一个实时监控。