nvidia-smi 命令(又称NVSMI)的全称是 NVIDIA System Management Interface,用于监控和管理GPU设备。
直接在终端执行 nvidia-smi 可查看所有的GPU设备及其相关信息:
root@container-14dc11ad52-9e0fd82d:~# nvidia-smi
Sun Sep 18 10:21:55 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-DGXS... Off | 00000000:07:00.0 Off | 0 |
| N/A 48C P0 175W / 300W | 5955MiB / 32508MiB | 6% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-DGXS... Off | 00000000:08:00.0 Off | 0 |
| N/A 58C P0 257W / 300W | 27128MiB / 32508MiB | 93% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-DGXS... Off | 00000000:0E:00.0 Off | 0 |
| N/A 48C P0 52W / 300W | 2768MiB / 32508MiB | 32% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-DGXS... Off | 00000000:0F:00.0 Off | 0 |
| N/A 46C P0 40W / 300W | 13MiB / 32508MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2151 G /usr/lib/xorg/Xorg 58MiB |
| 0 N/A N/A 2255 G /usr/bin/gnome-shell 83MiB |
| 0 N/A N/A 7145 C python 2839MiB |
| 0 N/A N/A 7364 C python 2755MiB |
| 0 N/A N/A 20935 G /usr/lib/xorg/Xorg 24MiB |
| 0 N/A N/A 21079 G /usr/bin/gnome-shell 189MiB |
| 1 N/A N/A 2151 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 20935 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 34676 C python 27115MiB |
| 2 N/A N/A 2151 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 20565 C python 2755MiB |
| 2 N/A N/A 20935 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 2151 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 20935 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
关于该面板的解读可参考这篇文章。
输入 nvidia-smi -h 可查看该命令的帮助手册。
输入 nvidia-smi -L 可以列出所有的GPU设备及其UUID
root@container-14dc11ad52-9e0fd82d:~# nvidia-smi -L
GPU 0: Tesla V100-DGXS-32GB (UUID: GPU-8e82d306-7c7b-b020-2847-afe95fd09f33)
GPU 1: Tesla V100-DGXS-32GB (UUID: GPU-8c4978ad-c5d1-e4d0-19ac-c659644fdb02)
GPU 2: Tesla V100-DGXS-32GB (UUID: GPU-8aec1981-46ca-fd72-376d-51d9eeaf166b)
GPU 3: Tesla V100-DGXS-32GB (UUID: GPU-b0a24c4f-6928-3ac2-7fba-a2969bbad8ba)
输入 nvidia-smi -q 可以列出所有GPU设备的详细信息。如果只想列出某一GPU的详细信息,可使用 -i 选项指定。
输入 nvidia-smi -i [GPU编号] 可以只列出某一GPU设备的信息。因为该主机只有4块GPU,所以 [GPU编号] 的取值范围为 {0, 1, 2, 3}。
root@container-14dc11ad52-9e0fd82d:~# nvidia-smi -i 1
Sun Sep 18 10:18:52 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 1 Tesla V100-DGXS... Off | 00000000:08:00.0 Off | 0 |
| N/A 57C P0 229W / 300W | 27128MiB / 32508MiB | 99% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 2151 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 20935 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 34676 C python 27115MiB |
+-----------------------------------------------------------------------------+
-i 选项也可配合其他选项使用,例如
root@container-14dc11ad52-9e0fd82d:~# nvidia-smi -q -i 0
的作用就是列出第0块GPU的详细信息。
输入 nvidia-smi -l [second] 后会每隔 second 秒刷新一次面板。监控GPU利用率通常会选择每隔1秒刷新一次,即
root@container-14dc11ad52-9e0fd82d:~# nvidia-smi -l 1
📄 更多内容可参考官方文档。