• 昇腾910使用记录


    一. 压缩文件和解压文件

    1. 压缩文件

    tar -czvf UNITE-main.tar.gz ./UNITE-main/
    
    • 1

    2. 解压文件

    tar -xvf ./UNITE-main/
    
    • 1

    二. CUDA更改为NPU

    import torch_npu
    
    • 1
    data['label'] = data['label'].cuda()
    data['instance'] = data['instance'].cuda()
    data['image'] = data['image'].cuda()
    
    • 1
    • 2
    • 3

    更改为

    data['label'] = data['label'].npu()
    data['instance'] = data['instance'].npu()
    data['image'] = data['image'].npu()
    
    • 1
    • 2
    • 3

    三. 配置环境变量

    1. 创建env.sh

    touch env.sh
    
    • 1

    2. 打开env.sh

    vi env.sh
    
    • 1

    3. 配置环境变量

    # 配置CANN相关环境变量
    CANN_INSTALL_PATH_CONF='/etc/Ascend/ascend_cann_install.info'
    if [ -f $CANN_INSTALL_PATH_CONF ]; then
     DEFAULT_CANN_INSTALL_PATH=$(cat $CANN_INSTALL_PATH_CONF | grep Install_Path | cut -d "=" -f 2)
    else
     DEFAULT_CANN_INSTALL_PATH="/usr/local/Ascend/"
    fi
    CANN_INSTALL_PATH=${1:-${DEFAULT_CANN_INSTALL_PATH}}
    if [ -d ${CANN_INSTALL_PATH}/ascend-toolkit/latest ];then
     source ${CANN_INSTALL_PATH}/ascend-toolkit/set_env.sh
    else
     source ${CANN_INSTALL_PATH}/nnae/set_env.sh
    fi
    # 导入依赖库
    export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/openblas/lib
    export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib/
    export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/lib64/
    export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/lib/
    export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/lib/aarch64_64-linux-gnu
    # 配置自定义环境变量
    export HCCL_WHITELIST_DISABLE=1
    # log
    export ASCEND_SLOG_PRINT_TO_STDOUT=0 # 日志打屏, 可选
    export ASCEND_GLOBAL_LOG_LEVEL=3 # 日志级别常用 1 INFO级别; 3 ERROR级别
    export ASCEND_GLOBAL_EVENT_ENABLE=0 # 默认不使能event日志信息
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25

    并输入

    :wq!
    
    • 1

    4. 使用环境

    source env.sh
    
    • 1

    四. RuntimeError: ACL stream synchronize failed, error code:507018

    E39999: Inner Error, Please contact support engineer!
    E39999  Aicpu kernel execute failed, device_id=0, stream_id=0, task_id=6394, fault op_name=ScatterElements[FUNC:GetError][FILE:stream.cc][LINE:1044]
            TraceBack (most recent call last):
            rtStreamSynchronize execute failed, reason=[aicpu exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49]
            synchronize stream failed, runtime result = 507018[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
    
    
    DEVICE[0] PID[41411]: 
    EXCEPTION TASK:
      Exception info:TGID=2593324, model id=65535, stream id=0, stream phase=SCHEDULE, task id=742, task type=aicpu kernel, recently received task id=742, recently send task id=741, task phase=RUN
      Message info[0]:aicpu=0,slot_id=0,report_mailbox_flag=0x5a5a5a5a,state=0x5210
        Other info[0]:time=2023-10-12-11:22:01.273.951, function=proc_aicpu_task_done, line=972, error code=0x2a 
    EXCEPTION TASK:
      Exception info:TGID=2593324, model id=65535, stream id=0, stream phase=3, task id=6394, task type=aicpu kernel, recently received task id=6406, recently send task id=6393, task phase=RUN
      Message info[0]:aicpu=0,slot_id=0,report_mailbox_flag=0x5a5a5a5a,state=0x5210
        Other info[0]:time=2023-10-12-11:41:20.661.958, function=proc_aicpu_task_done, line=972, error code=0x2a
    Traceback (most recent call last):
      File "train.py", line 40, in <module>
        trainer.run_generator_one_step(data_i)
      File "/home/ma-user/work/SPADE-master/trainers/pix2pix_trainer.py", line 35, in run_generator_one_step
        g_losses, generated = self.pix2pix_model(data, mode='generator')
      File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
        return self.module(*inputs, **kwargs)
      File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ma-user/work/SPADE-master/models/pix2pix_model.py", line 43, in forward
        input_semantics, real_image = self.preprocess_input(data)
      File "/home/ma-user/work/SPADE-master/models/pix2pix_model.py", line 113, in preprocess_input
        data['label'] = data['label'].npu()
      File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/site-packages/torch_npu/utils/device_guard.py", line 38, in wrapper
        return func(*args, **kwargs)
      File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/site-packages/torch_npu/utils/tensor_methods.py", line 66, in _npu
        return torch_npu._C.npu(self, *args, **kwargs)
    RuntimeError: ACL stream synchronize failed, error code:507018
    THPModule_npu_shutdown success.
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37

    猜测可能是没有开混合精度

    五. 开启混合精度

    1. 在构建神经网络前,我们需要导入torch_npu中的AMP模块

    import time
    import torch
    import torch.nn as nn
    import torch_npu
    from torch_npu.npu import amp    # 导入AMP模块
    
    • 1
    • 2
    • 3
    • 4
    • 5

    2. 在模型、优化器定义之后,定义AMP功能中的GradScaler

    model = CNN().to(device)
    train_dataloader = DataLoader(train_data, batch_size=batch_size)    # 定义DataLoader
    loss_func = nn.CrossEntropyLoss().to(device)    # 定义损失函数
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)    # 定义优化器
    scaler = amp.GradScaler()    # 在模型、优化器定义之后,定义GradScaler
    
    • 1
    • 2
    • 3
    • 4
    • 5

    3. 在训练代码中添加AMP功能相关的代码开启AMP

    for epo in range(epochs):
    for imgs, labels in train_dataloader:
    imgs = imgs.to(device)
        labels = labels.to(device)
        with amp.autocast():
            outputs = model(imgs)    # 前向计算
            loss = loss_func(outputs, labels)    # 损失函数计算
        optimizer.zero_grad()
        # 进行反向传播前后的loss缩放、参数更新
        scaler.scale(loss).backward()    # loss缩放并反向传播
        scaler.step(optimizer)    # 更新参数(自动unscaling)
        scaler.update()    # 基于动态Loss Scale更新loss_scaling系数 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    六. 未知错误

    E39999: Inner Error, Please contact support engineer!
    E39999  An exception occurred during AICPU execution, stream_id:78, task_id:742, errcode:21008, msg:inner error[FUNC:ProcessAicpuErrorInfo][FILE:device_error_proc.cc][LINE:673]
            TraceBack (most recent call last):
            Kernel task happen error, retCode=0x2a, [aicpu exception].[FUNC:PreCheckTaskErr][FILE:task.cc][LINE:1068]
            Aicpu kernel execute failed, device_id=0, stream_id=78, task_id=742.[FUNC:PrintAicpuErrorInfo][FILE:task.cc][LINE:774]
            Aicpu kernel execute failed, device_id=0, stream_id=78, task_id=742, fault op_name=ScatterElements[FUNC:GetError][FILE:stream.cc][LINE:1044]
            rtStreamSynchronize execute failed, reason=[aicpu exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49]
            op[Minimum], The Minimum op dtype is not same, type1:DT_FLOAT16, type2:DT_FLOAT[FUNC:CheckTwoInputDtypeSame][FILE:util.cc][LINE:116]
            Verifying Minimum failed.[FUNC:InferShapeAndType][FILE:infershape_pass.cc][LINE:135]
            Call InferShapeAndType for node:Minimum(Minimum) failed[FUNC:Infer][FILE:infershape_pass.cc][LINE:117]
            process pass InferShapePass on node:Minimum failed, ret:4294967295[FUNC:RunPassesOnNode][FILE:base_pass.cc][LINE:530]
            build graph failed, graph id:894, ret:1343242270[FUNC:BuildModel][FILE:ge_generator.cc][LINE:1484]
            [Build][SingleOpModel]call ge interface generator.BuildSingleOpModel failed. ge result = 1343242270[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
            [Build][Op]Fail to build op model[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
            build op model failed, result = 500002[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
    
    
    DEVICE[0] PID[189368]: 
    EXCEPTION TASK:
      Exception info:TGID=3114744, model id=65535, stream id=78, stream phase=SCHEDULE, task id=742, task type=aicpu kernel, recently received task id=742, recently send task id=741, task phase=RUN
      Message info[0]:aicpu=0,slot_id=0,report_mailbox_flag=0x5a5a5a5a,state=0x5210
        Other info[0]:time=2023-10-12-12:12:22.763.259, function=proc_aicpu_task_done, line=972, error code=0x2a 
    EXCEPTION TASK:
      Exception info:TGID=3114744, model id=65535, stream id=78, stream phase=3, task id=4347, task type=aicpu kernel, recently received task id=4354, recently send task id=4346, task phase=RUN
      Message info[0]:aicpu=0,slot_id=0,report_mailbox_flag=0x5a5a5a5a,state=0x5210
        Other info[0]:time=2023-10-12-12:13:57.997.757, function=proc_aicpu_task_done, line=972, error code=0x2a
    Aborted (core dumped)
    (py38) [ma-user SPADE-master]$Process ForkServerProcess-2:
    Traceback (most recent call last):
      File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
        self.run()
      File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 108, in run
        self._target(*self._args, **self._kwargs)
      File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 61, in wrapper
        raise exp
      File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 58, in wrapper
        func(*args, **kwargs)
      File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 268, in task_distribute
        key, func_name, detail = resource_proxy[TASK_QUEUE].get()
      File "", line 2, in get
      File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/multiprocessing/managers.py", line 835, in _callmethod
        kind, result = conn.recv()
      File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/multiprocessing/connection.py", line 250, in recv
        buf = self._recv_bytes()
      File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
        buf = self._recv(4)
      File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
        raise EOFError
    EOFError
    /home/ma-user/anaconda3/envs/py38/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 91 leaked semaphore objects to clean up at shutdown
      warnings.warn('resource_tracker: There appear to be %d '
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51

    七. 数据集加载错误

    Traceback (most recent call last):
      File "train.py", line 126, in <module>
        content_images = next(content_iter).to(device)
      File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
        data = self._next_data()
      File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 569, in _next_data
        index = self._next_index()  # may raise StopIteration
      File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in _next_index
        return next(self._sampler_iter)  # may raise StopIteration
      File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 226, in __iter__
        for idx in self.sampler:
      File "/home/ma-user/work/CCPL-main/sampler.py", line 10, in InfiniteSampler
        yield order[i]
    IndexError: index -1 is out of bounds for axis 0 with size 0
    THPModule_npu_shutdown success.
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15

    数据集加载错误,需要改成

    ../iRay/train/nightA
    
    • 1

    八. 梯度错误

    Traceback (most recent call last):
      File "train.py", line 129, in <module>
        loss_c, loss_s, loss_ccp = network(content_images, style_images, args.tau, args.num_s, args.num_l)
      File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ma-user/work/CCPL-main/net.py", line 280, in forward
        loss_s = self.calc_style_loss(g_t_feats[0], style_feats[0]) 
      File "/home/ma-user/work/CCPL-main/net.py", line 265, in calc_style_loss
        input_mean, input_std = calc_mean_std(input)
      File "/home/ma-user/work/CCPL-main/function.py", line 8, in calc_mean_std
        feat_var = feat.view(N, C, -1).var(dim=2) + eps
    RuntimeError: cannot resize variables that require grad
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    错误提示 “RuntimeError: cannot resize variables that require grad” 指示无法调整需要梯度计算的变量的大小。这通常是由于尝试在具有梯度的张量上执行不可微分操作(如 .view())导致的。

    修改为

    feat_var = feat.data.view(N, C, -1).var(dim=2) + eps
    
    • 1

    九. Tensor类型错误

    Traceback (most recent call last):
      File "train.py", line 129, in <module>
        loss_c, loss_s, loss_ccp = network(content_images, style_images, args.tau, args.num_s, args.num_l)
      File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ma-user/work/CCPL-main/net.py", line 287, in forward
        loss_ccp = self.CCPL(g_t_feats, content_feats, num_s, start_layer, end_layer)
      File "/home/ma-user/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ma-user/work/CCPL-main/net.py", line 211, in forward
        f_q, sample_ids = self.NeighborSample(feats_q[i], i, num_s, [])
      File "/home/ma-user/work/CCPL-main/net.py", line 179, in NeighborSample
        print(feat_r[:, c_ids, :].shape, feat_r[:, n_ids, :].shape)
    IndexError: tensors used as indices must be long, byte or bool tensors
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    错误提示指出索引操作中使用的张量必须是 long、byte 或 bool 类型的张量。请确保 c_ids 和 n_ids 是正确的张量类型。如果它们是其他类型的张量,如 float 或 int 类型,请将它们转换为 long 类型。可以使用 .long() 方法将张量转换为 long 类型。

    c_ids = c_ids.long()
    n_ids = n_ids.long()
    
    • 1
    • 2

    十. Tensor位置错误

    Traceback (most recent call last):
    File "train.py", line 59, in <module>
    visualizer.display_current_results(visuals, epoch, iter_counter.total_steps_so_far)
    File "/home/ma-user/work/SPADE-master/util/visualizer.py", line 45, in display_current_results
    visuals = self.convert_visuals_to_numpy(visuals)
    File "/home/ma-user/work/SPADE-master/util/visualizer.py", line 134, in convert_visuals_to_numpy
    t = util.tensor2im(t, tile=tile)
    File "/home/ma-user/work/SPADE-master/util/util.py", line 77, in tensor2im
    one_image_np = tensor2im(one_image)
    File "/home/ma-user/work/SPADE-master/util/util.py", line 88, in tensor2im
    image_numpy = image_tensor.detach().npu().float().numpy()
    TypeError: can't convert xla:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
    THPModule_npu_shutdown success.
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    这个错误提示表明在将张量转换为NumPy数组时出现了问题。错误信息中提到了不能将类型为xla:0的设备类型张量转换为NumPy数组。它建议你先使用Tensor.cpu()将张量复制到主机内存中。
    对于解决这个问题,你可以修改代码中的相关部分,确保在将张量转换为NumPy数组之前将其从设备中移动到主机内存中。具体而言,你可以在调用tensor2im()函数之前添加Tensor.cpu()方法。

    image_numpy = image_tensor.detach().cpu().float().numpy()
    
    • 1

    通过这种方式,你将在将张量转换为NumPy数组之前将其移动到CPU上,并且不再会出现无法将xla:0设备类型张量转换为NumPy数组的错误。

    十一. 数据集加载错误

    Traceback (most recent call last):
      File "main.py", line 51, in <module>
        main(config)
      File "main.py", line 33, in main
        solver = Solver(config, get_loader(config))
      File "/home/ma-user/work/SCFT_with_gradient_cosine_logging/data_loader.py", line 107, in get_loader
        drop_last=True)
      File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 277, in __init__
        sampler = RandomSampler(dataset, generator=generator)  # type: ignore[arg-type]
      File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 98, in __init__
        "value, but got num_samples={}".format(self.num_samples))
    ValueError: num_samples should be a positive integer value, but got num_samples=0
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    **
    原因:shuffle的参数设置错误导致。
    解决方法:因为已经有batch_size了,就不需要shuffle来进行随机了,将shuffle设置为FALSE即可。(网上这么说的,更改之后确实可以用)**

    十二. python错误

    main.py:49: ResourceWarning: unclosed file <_io.TextIOWrapper name='config.yml' mode='r' encoding='UTF-8'>
      config = yaml.load(open(params.config, 'r'), Loader=yaml.FullLoader)
    ResourceWarning: Enable tracemalloc to get the object allocation traceback
    
    • 1
    • 2
    • 3

    原因分析: 缺少close()

    def get_data(file_name):
        rows = []
        testReportDir = "../test/"
        testReportDir_FileName = testReportDir + file_name
        data_file = open(testReportDir_FileName, mode="r", encoding="utf-8")
        reader = csv.reader(data_file)
        next(reader, None)
        for row in reader:
            rows.append(row)
        return rows
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    with open(testReportDir_FileName, mode="r", encoding="utf-8") as f:
            data_file = f.read()
    
    • 1
    • 2

    with open的用途:是python用来打开本地文件的,它会在使用完毕后,自动关闭文件,无需手动书写close()(网上这么说,没影响,未改)

    参考链接1
    参考链接2:昇腾官网

  • 相关阅读:
    华为机试真题 Python 实现【最大化控制资源成本】【2022.11 Q4 新题】
    “世界首台USB-C iPhone”被拍卖,目前出价63万人民币
    Redis - 二进制位数组
    C# 解决“xxx正由另一进程使用,因此该进程无法访问该文件。“的错误
    NFTFi终极指南
    4点告诉你实时聊天与聊天机器人组合的优势
    kubernetes负载均衡---MetalLB
    专业知识单选题练习系列(二)
    Visual Studio Code将中文写入变量时,中文老是乱码问题
    Java数据结构————队列
  • 原文地址:https://blog.csdn.net/qq_40721108/article/details/133783070