本文为我汇报在Ascend / pytorch 社区的一个bug,其中对NPU的实际算力进行了测试,并发现了华为NPU实际显存与销售宣传时存在着较大差差距的问题(算力问题见问题一、显存问题见问题二)。
究竟是遥遥领先,还是?
本机NPU为Atlas 300I Pro
本issue一共汇报两个问题,并在最后附上自己的环境信息,希望尽快寻得解释
问题一:推理速度慢且最终报错,只有1.20it/s
问题二:NPU显存计算与GPU似乎不同,原因为何?
使用NPU进行fp16
import torch
import torch_npu
from accelerate import Accelerator
accelerator = Accelerator()
from accelerate import dispatch_model
device = accelerator.device
# source '/home/HwHiAiUser/Ascend/ascend-toolkit/set_env.sh'
x = torch.randn(2, 2).npu()
y = torch.randn(2, 2).npu()
z = x.mm(y)
print(z)
print(device)
# from modelscope import Model, AutoTokenizer
# model = Model.from_pretrained("modelscope/Llama-2-7b-ms", revision='v1.0.1', device_map=device, torch_dtype=torch.float16)
# tokenizer = AutoTokenizer.from_pretrained("modelscope/Llama-2-7b-ms", revision='v1.0.1')
# prompt = "Hey, are you conscious? Can you talk to me?"
# inputs = tokenizer(prompt, return_tensors="pt")
# # Generate
# generate_ids = model.generate(inputs.input_ids.to(model.device), max_length=30)
# print(tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
from modelscope import AutoModelForCausalLM, AutoTokenizer, snapshot_download
from modelscope import GenerationConfig
# Note: The default behavior now has injection attack prevention off.
model_dir = snapshot_download("qwen/Qwen-7B-Chat", revision = 'v1.1.4')
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
# use bf16
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map=device, trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval()
# use auto mode, automatically select precision based on the device.
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map=device, trust_remote_code=True).eval()
# Specify hyperparameters for generation
model.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
# 第一轮对话 1st dialogue turn
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
# 你好!很高兴为你提供帮助。
# 第二轮对话 2nd dialogue turn
response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
print(response)
# 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
# 故事的主人公叫李明,他来自一个普通的家庭,父母都是普通的工人。从小,李明就立下了一个目标:要成为一名成功的企业家。
# 为了实现这个目标,李明勤奋学习,考上了大学。在大学期间,他积极参加各种创业比赛,获得了不少奖项。他还利用课余时间去实习,积累了宝贵的经验。
# 毕业后,李明决定开始自己的创业之路。他开始寻找投资机会,但多次都被拒绝了。然而,他并没有放弃。他继续努力,不断改进自己的创业计划,并寻找新的投资机会。
# 最终,李明成功地获得了一笔投资,开始了自己的创业之路。他成立了一家科技公司,专注于开发新型软件。在他的领导下,公司迅速发展起来,成为了一家成功的科技企业。
# 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险,不断学习和改进自己。他的成功也证明了,只要努力奋斗,任何人都有可能取得成功。
# 第三轮对话 3rd dialogue turn
response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
print(response)
# 《奋斗创业:一个年轻人的成功之路》
补充:我对代码进行了修改,使得能够区分报错发生在哪一步,具体控制台输出均在报错内容中
import torch
import torch_npu
from accelerate import Accelerator
accelerator = Accelerator()
from accelerate import dispatch_model
device = accelerator.device
# source '/home/HwHiAiUser/Ascend/ascend-toolkit/set_env.sh'
x = torch.randn(2, 2).npu()
y = torch.randn(2, 2).npu()
z = x.mm(y)
print(z)
print(device)
# from modelscope import Model, AutoTokenizer
# model = Model.from_pretrained("modelscope/Llama-2-7b-ms", revision='v1.0.1', device_map=device, torch_dtype=torch.float16)
# tokenizer = AutoTokenizer.from_pretrained("modelscope/Llama-2-7b-ms", revision='v1.0.1')
# prompt = "Hey, are you conscious? Can you talk to me?"
# inputs = tokenizer(prompt, return_tensors="pt")
# # Generate
# generate_ids = model.generate(inputs.input_ids.to(model.device), max_length=30)
# print(tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
from modelscope import AutoModelForCausalLM, AutoTokenizer, snapshot_download
from modelscope import GenerationConfig
# Note: The default behavior now has injection attack prevention off.
model_dir = snapshot_download("qwen/Qwen-7B-Chat", revision = 'v1.1.4')
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
# use bf16
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map=device, trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval()
# use auto mode, automatically select precision based on the device.
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map=device, trust_remote_code=True).eval()
# Specify hyperparameters for generation
model.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
response, history = model.chat(tokenizer, input("请输入问题:"), history=None)
print(response)
while True:
# 第二轮对话 2nd dialogue turn
response, history = model.chat(tokenizer, input("请输入问题:"), history=history)
print(response)
上述代码推理占用时间近似纯CPU推理,且第三轮对话报错,控制台输出如下:
(NPU) (base) [HwHiAiUser@bogon Code]$ /home/HwHiAiUser/下载/yes/envs/NPU/bin/python /home/HwHiAiUser/Code/main.py
Warning: Device do not support double dtype now, dtype cast repalce with float.
tensor([[-1.8972, -0.0742],
[-0.6470, -0.0174]], device='npu:0')
npu
2023-10-23 00:27:21,121 - modelscope - INFO - PyTorch version 2.1.0+cpu Found.
2023-10-23 00:27:21,122 - modelscope - INFO - Loading ast index from /home/HwHiAiUser/.cache/modelscope/ast_indexer
2023-10-23 00:27:21,141 - modelscope - INFO - Loading done! Current index file version is 1.9.3, with md5 068f7e60e6f05d224ec8ad9a969f5922 and a total number of 943 components indexed
2023-10-23 00:27:21,675 - modelscope - INFO - Use user-specified model revision: v1.1.4
/home/HwHiAiUser/下载/yes/envs/NPU/lib/python3.9/site-packages/tiktoken/core.py:50: ResourceWarning: unclosed
self._core_bpe = _tiktoken.CoreBPE(mergeable_ranks, special_tokens, pat_str)
Warning: please make sure that you are using the latest codes and checkpoints, especially if you used Qwen-7B before 09.25.2023.请使用最新模型和代码,尤其如果你在9月25日前已经开始使用Qwen-7B,千万注意不要使用错误代码和模型。
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|██████████████████████| 8/8 [00:06<00:00, 1.20it/s]
[W OpCommand.cpp:75] Warning: [Check][offset] Check input storage_offset[%ld] = 0 failed, result is untrustworthy4096 (function operator())
/home/HwHiAiUser/下载/yes/envs/NPU/lib/python3.9/site-packages/transformers/generation/logits_process.py:407: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at /opt/_internal/cpython-3.9.0/lib/python3.9/site-packages/torch/include/ATen/core/LegacyTypeDispatch.h:74.)
sorted_indices_to_remove[..., -self.min_tokens_to_keep :] = 0
[W AddKernelNpu.cpp:86] Warning: The oprator of add is executed, Currently High Accuracy but Low Performance OP with 64-bit has been used, Please Do Some Cast at Python Functions with 32-bit for Better Performance! (function operator())
[W NeKernelNpu.cpp:28] Warning: The oprator of ne is executed, Currently High Accuracy but Low Performance OP with 64-bit has been used, Please Do Some Cast at Python Functions with 32-bit for Better Performance! (function operator())
你好!有什么我能为你做的吗?
好的,我给你讲一个年轻人奋斗创业最终取得成功的故事。这个故事叫做《奋斗》。
故事的主人公是一个叫做李明的年轻人,他出生在一个普通的家庭,但他有一个梦想,那就是成为一名企业家。他从小就对创业有着浓厚的兴趣,经常参加各种创业比赛,也曾经在大学期间创办过一家小型的创业公司。
然而,创业的道路并不容易,李明经历了许多挫折和困难。他的公司一度面临破产的危险,但他并没有放弃,而是更加努力地工作,寻找新的机会和资源。
最终,李明的努力得到了回报,他的公司开始慢慢发展起来,他也因此获得了许多荣誉和奖励。他的故事告诉我们,只要有梦想,有勇气,有毅力,就一定能够实现自己的创业梦想。
EZ9999: Inner Error!
EZ9999 Kernel task happen error, retCode=0x28, [aicpu timeout].[FUNC:PreCheckTaskErr][FILE:task_info.cc][LINE:1574]
TraceBack (most recent call last):
rtStreamSynchronizeWithTimeout execute failed, reason=[aicpu timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
synchronize stream failed, runtime result = 507017[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
DEVICE[0] PID[18400]:
EXCEPTION STREAM:
Exception info:TGID=18400, model id=65535, stream id=3, stream phase=3
Message info[0]:RTS_HWTS: Aicpu timeout, slot_id=12, stream_id=3, task_id=6200
Other info[0]:time=2023-10-23-00:50:30.892.993, function=process_hwts_timeout_exception, line=3745, error code=0x28
Traceback (most recent call last):
File "/home/HwHiAiUser/Code/main.py", line 66, in
response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
File "/home/HwHiAiUser/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 1199, in chat
outputs = self.generate(
File "/home/HwHiAiUser/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 1318, in generate
return super().generate(
File "/home/HwHiAiUser/下载/yes/envs/NPU/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/HwHiAiUser/下载/yes/envs/NPU/lib/python3.9/site-packages/transformers/generation/utils.py", line 1652, in generate
return self.sample(
File "/home/HwHiAiUser/下载/yes/envs/NPU/lib/python3.9/site-packages/transformers/generation/utils.py", line 2793, in sample
if unfinished_sequences.max() == 0:
RuntimeError: ACL stream synchronize failed.
[W NPUStream.cpp:372] Warning: NPU warning, error code is 507017[Error]:
[Error]: The aicpu execution times out.
Rectify the fault based on the error information in the log, or you can ask us at follwing gitee link by issues: https://gitee.com/ascend/pytorch/issue
EH9999: Inner Error!
rtDeviceSynchronize execute failed, reason=[aicpu timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
EH9999 wait for compute device to finish failed, runtime result = 507017.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
TraceBack (most recent call last):
(function npuSynchronizeDevice)
[W NPUStream.cpp:372] Warning: NPU warning, error code is 507017[Error]:
[Error]: The aicpu execution times out.
Rectify the fault based on the error information in the log, or you can ask us at follwing gitee link by issues: https://gitee.com/ascend/pytorch/issue
EH9999: Inner Error!
rtDeviceSynchronize execute failed, reason=[aicpu timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
EH9999 wait for compute device to finish failed, runtime result = 507017.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
TraceBack (most recent call last):
(function npuSynchronizeDevice)
/home/HwHiAiUser/下载/yes/envs/NPU/lib/python3.9/tempfile.py:821: ResourceWarning: Implicitly cleaning up
_warnings.warn(warn_message, ResourceWarning)
修改后的代码:
(NPU) (base) [HwHiAiUser@bogon Code]$ /home/HwHiAiUser/下载/yes/envs/NPU/bin/python /home/HwHiAiUser/Code/main.py
Warning: Device do not support double dtype now, dtype cast repalce with float.
tensor([[-0.3798, 0.5290],
[-0.7580, -0.6727]], device='npu:0')
npu
2023-10-23 08:28:35,064 - modelscope - INFO - PyTorch version 2.1.0+cpu Found.
2023-10-23 08:28:35,065 - modelscope - INFO - Loading ast index from /home/HwHiAiUser/.cache/modelscope/ast_indexer
2023-10-23 08:28:35,084 - modelscope - INFO - Loading done! Current index file version is 1.9.3, with md5 068f7e60e6f05d224ec8ad9a969f5922 and a total number of 943 components indexed
2023-10-23 08:28:35,538 - modelscope - INFO - Use user-specified model revision: v1.1.4
/home/HwHiAiUser/下载/yes/envs/NPU/lib/python3.9/site-packages/tiktoken/core.py:50: ResourceWarning: unclosed
self._core_bpe = _tiktoken.CoreBPE(mergeable_ranks, special_tokens, pat_str)
Warning: please make sure that you are using the latest codes and checkpoints, especially if you used Qwen-7B before 09.25.2023.请使用最新模型和代码,尤其如果你在9月25日前已经开始使用Qwen-7B,千万注意不要使用错误代码和模型。
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|██████████████████████| 8/8 [00:05<00:00, 1.37it/s]
请输入问题:你好!你是谁
[W OpCommand.cpp:75] Warning: [Check][offset] Check input storage_offset[%ld] = 0 failed, result is untrustworthy4096 (function operator())
/home/HwHiAiUser/下载/yes/envs/NPU/lib/python3.9/site-packages/transformers/generation/logits_process.py:407: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at /opt/_internal/cpython-3.9.0/lib/python3.9/site-packages/torch/include/ATen/core/LegacyTypeDispatch.h:74.)
sorted_indices_to_remove[..., -self.min_tokens_to_keep :] = 0
[W AddKernelNpu.cpp:86] Warning: The oprator of add is executed, Currently High Accuracy but Low Performance OP with 64-bit has been used, Please Do Some Cast at Python Functions with 32-bit for Better Performance! (function operator())
[W NeKernelNpu.cpp:28] Warning: The oprator of ne is executed, Currently High Accuracy but Low Performance OP with 64-bit has been used, Please Do Some Cast at Python Functions with 32-bit for Better Performance! (function operator())
我是通义千问,由阿里云开发的AI助手。我被设计用来回答各种问题、提供信息和与用户进行对话。有什么我可以帮助你的吗?
请输入问题:
import torch
import torch_npu
from accelerate import Accelerator
accelerator = Accelerator()
from accelerate import dispatch_model
device = accelerator.device
# source '/home/HwHiAiUser/Ascend/ascend-toolkit/set_env.sh'
x = torch.randn(2, 2).npu()
y = torch.randn(2, 2).npu()
z = x.mm(y)
print(z)
print(device)
# from modelscope import Model, AutoTokenizer
# model = Model.from_pretrained("modelscope/Llama-2-7b-ms", revision='v1.0.1', device_map=device, torch_dtype=torch.float16)
# tokenizer = AutoTokenizer.from_pretrained("modelscope/Llama-2-7b-ms", revision='v1.0.1')
# prompt = "Hey, are you conscious? Can you talk to me?"
# inputs = tokenizer(prompt, return_tensors="pt")
# # Generate
# generate_ids = model.generate(inputs.input_ids.to(model.device), max_length=30)
# print(tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
from modelscope import AutoModelForCausalLM, AutoTokenizer, snapshot_download
from modelscope import GenerationConfig
# Note: The default behavior now has injection attack prevention off.
model_dir = snapshot_download("qwen/Qwen-7B-Chat", revision = 'v1.1.4')
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
# use bf16
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map=device, trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval()
# use auto mode, automatically select precision based on the device.
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map=device, trust_remote_code=True).eval()
# Specify hyperparameters for generation
model.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参
# 第一轮对话 1st dialogue turn
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
# 你好!很高兴为你提供帮助。
# 第二轮对话 2nd dialogue turn
response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
print(response)
# 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
# 故事的主人公叫李明,他来自一个普通的家庭,父母都是普通的工人。从小,李明就立下了一个目标:要成为一名成功的企业家。
# 为了实现这个目标,李明勤奋学习,考上了大学。在大学期间,他积极参加各种创业比赛,获得了不少奖项。他还利用课余时间去实习,积累了宝贵的经验。
# 毕业后,李明决定开始自己的创业之路。他开始寻找投资机会,但多次都被拒绝了。然而,他并没有放弃。他继续努力,不断改进自己的创业计划,并寻找新的投资机会。
# 最终,李明成功地获得了一笔投资,开始了自己的创业之路。他成立了一家科技公司,专注于开发新型软件。在他的领导下,公司迅速发展起来,成为了一家成功的科技企业。
# 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险,不断学习和改进自己。他的成功也证明了,只要努力奋斗,任何人都有可能取得成功。
# 第三轮对话 3rd dialogue turn
response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
print(response)
# 《奋斗创业:一个年轻人的成功之路》
加载一个不量化全精度的7B模型进行推理,使用GPU加载占用的显存绝对不会超过20G,然而使用GPU却超显存了。
同时,Atlas 300I Pro在我购买时,标注的是24G显存,实际到手只有20G,我需要一个合理的解释。
以下是第一个代码段的执行结果与报错
(NPU) (base) [HwHiAiUser@bogon Code]$ /home/HwHiAiUser/下载/yes/envs/NPU/bin/python /home/HwHiAiUser/Code/main.py
Warning: Device do not support double dtype now, dtype cast repalce with float.
tensor([[ 0.0766, 0.2028],
[-2.3419, -1.6132]], device='npu:0')
npu
2023-10-23 07:56:42,901 - modelscope - INFO - PyTorch version 2.1.0+cpu Found.
2023-10-23 07:56:42,901 - modelscope - INFO - Loading ast index from /home/HwHiAiUser/.cache/modelscope/ast_indexer
2023-10-23 07:56:42,919 - modelscope - INFO - Loading done! Current index file version is 1.9.3, with md5 068f7e60e6f05d224ec8ad9a969f5922 and a total number of 943 components indexed
2023-10-23 07:56:43,931 - modelscope - INFO - Use user-specified model revision: v1.1.4
/home/HwHiAiUser/下载/yes/envs/NPU/lib/python3.9/site-packages/tiktoken/core.py:50: ResourceWarning: unclosed
self._core_bpe = _tiktoken.CoreBPE(mergeable_ranks, special_tokens, pat_str)
Warning: please make sure that you are using the latest codes and checkpoints, especially if you used Qwen-7B before 09.25.2023.请使用最新模型和代码,尤其如果你在9月25日前已经开始使用Qwen-7B,千万注意不要使用错误代码和模型。
Flash attention will be disabled because it does NOT support fp32.
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 62%|█████████████▊ | 5/8 [00:06<00:04, 1.38s/it]
Traceback (most recent call last):
File "/home/HwHiAiUser/Code/main.py", line 45, in
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map=device, trust_remote_code=True).eval()
File "/home/HwHiAiUser/下载/yes/envs/NPU/lib/python3.9/site-packages/modelscope/utils/hf_util.py", line 181, in from_pretrained
module_obj = module_class.from_pretrained(model_dir, *model_args,
File "/home/HwHiAiUser/下载/yes/envs/NPU/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 560, in from_pretrained
return model_class.from_pretrained(
File "/home/HwHiAiUser/下载/yes/envs/NPU/lib/python3.9/site-packages/modelscope/utils/hf_util.py", line 78, in from_pretrained
return ori_from_pretrained(cls, model_dir, *model_args, **kwargs)
File "/home/HwHiAiUser/下载/yes/envs/NPU/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
) = cls._load_pretrained_model(
File "/home/HwHiAiUser/下载/yes/envs/NPU/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3695, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/HwHiAiUser/下载/yes/envs/NPU/lib/python3.9/site-packages/transformers/modeling_utils.py", line 741, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/HwHiAiUser/下载/yes/envs/NPU/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 317, in set_module_tensor_to_device
new_value = value.to(device)
RuntimeError: NPU out of memory. Tried to allocate 66.00 MiB (NPU 0; 0 bytes total capacity; 19.09 GiB already allocated; 19.09 GiB current active; 0 bytes free; 19.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
/home/HwHiAiUser/下载/yes/envs/NPU/lib/python3.9/tempfile.py:821: ResourceWarning: Implicitly cleaning up
_warnings.warn(warn_message, ResourceWarning)
固件版本检查
(NPU) [HwHiAiUser@localhost ~]$ sudo /usr/local/Ascend/driver/tools/upgrade-tool --device_index -1 --component -1 --version
{
Get component version(6.4.12.1.241) succeed for deviceId(0), componentType(11).
{"device_id":0, "component":hboot1a, "version":6.4.12.1.241}
Get component version(6.4.12.1.241) succeed for deviceId(0), componentType(12).
{"device_id":0, "component":hboot1b, "version":6.4.12.1.241}
Get component version(6.4.12.1.241) succeed for deviceId(0), componentType(18).
{"device_id":0, "component":hlink, "version":6.4.12.1.241}
}
npu-smi info
(NPU) [HwHiAiUser@localhost ~]$ npu-smi info
+--------------------------------------------------------------------------------------------------------+
| npu-smi 23.0.rc2 Version: 23.0.rc2 |
+-------------------------------+-----------------+------------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page) |
| Chip Device | Bus-Id | AICore(%) Memory-Usage(MB) |
+===============================+=================+======================================================+
| 8 310P3 | OK | NA 37 0 / 0 |
| 0 0 | 0000:01:00.0 | 0 1700 / 21527 |
+===============================+=================+======================================================+
+-------------------------------+-----------------+------------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===============================+=================+======================================================+
| No running processes found in NPU 8 |
+===============================+=================+======================================================+
CANN安装
已安装适应pytorch2.1.0版本的CANN7.0.RC1.alpha003,并且环境配置正确,如下代码运行正常:
import torch
import torch_npu
# source '/home/HwHiAiUser/Ascend/ascend-toolkit/set_env.sh'
x = torch.randn(2, 2).npu()
y = torch.randn(2, 2).npu()
z = x.mm(y)
print(z)
运行结果:
(NPU) (base) [HwHiAiUser@bogon Code]$ /home/HwHiAiUser/下载/yes/envs/NPU/bin/python /home/HwHiAiUser/Code/main.py
Warning: Device do not support double dtype now, dtype cast repalce with float.
tensor([[ 0.0766, 0.2028],
[-2.3419, -1.6132]], device='npu:0')