偶然读到一篇很不错的知乎文章《PyTorch显存分配原理——以BERT为例》,记录一下关于 cuda 显存分配、使用 torchinfo 查看模型参数量,以及使用 gpustat 查看显存占用及进程信息等知识。
nvidia-smi 里看到的占用 = CUDA 上下文 + pytorch 缓存区(已分配+未分配)
实验一下:
import torch
a = torch.zeros(size=(1024,1024)).cuda() # 1024*1024*4 = 4M
torch.cuda.memory_allocated() /1024/1024 # 缓存区当前Tensor占用的显存:4M
torch.cuda.memory_reserved() /1024/1024 # 缓存区总共占用的显存:20M
# torch缓存区总显存占 20M(未使用缓存 + 已使用缓存)
# Tensor占 4M (已使用缓存)
# 立即推:未使用缓存占 20-4=16M
""" nvidia-smi: muyao(1251M) """
# nvidia-smi 里看到的占用 = CUDA上下文 + 缓存区显存占用 = 1251M
# 立即推:CUDA上下文占 1251-20=1231M
删除临时变量 a 后:
del a
torch.cuda.memory_allocated() /1024/1024 # 缓存区当前Tensor占用的显存:0M
torch.cuda.memory_reserved() /1024/1024 # 缓存区总共占用的显存:20M
清除缓存后:
torch.cuda.empty_cache()
print(torch.cuda.memory_allocated() /1024/1024) # 0M
print(torch.cuda.memory_reserved() /1024/1024) # 0M
print(torch.cuda.memory_summary())
安装:pip install torchinfo
使用:以T5为例。
from transformers import T5Config, T5ForConditionalGeneration
from torchinfo import summary
model_name_or_path = "ptms/checkpoint-xxxxx"
config = T5Config.from_pretrained(model_name_or_path)
model = T5ForConditionalGeneration.from_pretrained(model_name_or_path, config=config)
model.to("cuda:1")
summary(model)
输出 summary:
=====================================================================================
Layer (type:depth-idx) Param #
=====================================================================================
T5ForConditionalGeneration –
├─Embedding: 1-1 24,674,304
├─T5Stack: 1-2 24,674,304
│ └─Embedding: 2-1 (recursive)
│ └─ModuleList: 2-2 –
│ │ └─T5Block: 3-1 7,079,808
│ │ └─T5Block: 3-2 7,079,424
│ │ └─T5Block: 3-3 7,079,424
│ │ └─T5Block: 3-4 7,079,424
│ │ └─T5Block: 3-5 7,079,424
│ │ └─T5Block: 3-6 7,079,424
│ │ └─T5Block: 3-7 7,079,424
│ │ └─T5Block: 3-8 7,079,424
│ │ └─T5Block: 3-9 7,079,424
│ │ └─T5Block: 3-10 7,079,424
│ │ └─T5Block: 3-11 7,079,424
│ │ └─T5Block: 3-12 7,079,424
│ └─T5LayerNorm: 2-3 768
│ └─Dropout: 2-4 –
├─T5Stack: 1-3 24,674,304
│ └─Embedding: 2-5 (recursive)
│ └─ModuleList: 2-6 –
│ │ └─T5Block: 3-13 9,439,872
│ │ └─T5Block: 3-14 9,439,488
│ │ └─T5Block: 3-15 9,439,488
│ │ └─T5Block: 3-16 9,439,488
│ │ └─T5Block: 3-17 9,439,488
│ │ └─T5Block: 3-18 9,439,488
│ │ └─T5Block: 3-19 9,439,488
│ │ └─T5Block: 3-20 9,439,488
│ │ └─T5Block: 3-21 9,439,488
│ │ └─T5Block: 3-22 9,439,488
│ │ └─T5Block: 3-23 9,439,488
│ │ └─T5Block: 3-24 9,439,488
│ └─T5LayerNorm: 2-7 768
│ └─Dropout: 2-8 –
├─Linear: 1-4 24,674,304
=====================================================================================
Total params: 247,577,856
Trainable params: 247,577,856
Non-trainable params: 0
=====================================================================================
gpustat 参数:
usage: gpustat [-h] [–force-color | --no-color] [-c] [-u] [-p] [-F] [–json] [-v] [-P [{,draw,limit,draw,limit,limit,draw}]] [-i [INTERVAL]] [–no-header] [–gpuname-width GPUNAME_WIDTH] [–debug]
比如以下场景:
显示占用及进程PID:gpustat -i -p
查看这个PID对应的执行路径:pwdx [PID]