随着科技的发展,人工智能对于图像的处理越来越频繁,对于图像处理的场景也越来越多,GPU的使用也随之而来;那对于性能测试的图像比对、3D处理等类型的系统使用到了GPU,我们需要对GPU进行细化了解和监控。
nvidia_gpu_exporter是prometheus 的 Nvidia GPU 导出器,使用nvidia-smi
二进制文件收集指标。
- docker run -d \
- --name nvidia_smi_exporter \
- --restart unless-stopped \
- --device /dev/nvidiactl:/dev/nvidiactl \
- --device /dev/nvidia0:/dev/nvidia0 \
- -v /usr/lib64/libnvidia-ml.so:/usr/lib/aarch64-linux-gnu/libnvidia-ml.so \
- -v /usr/lib64/libnvidia-ml.so.1:/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1 \
- -v /usr/bin/nvidia-smi:/usr/bin/nvidia-smi \
- -p 9835:9835 \
- utkuozdemir/nvidia_gpu_exporter:1.1.0
GitHub - utkuozdemir/nvidia_gpu_exporter: Nvidia GPU exporter for prometheus using nvidia-smi binary
,面板链接Nvidia GPU Metrics | Grafana Labs
# TYPE nvidia_smi_clocks_throttle_reasons_sync_boost gauge nvidia_smi_clocks_throttle_reasons_sync_boost{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0 # HELP nvidia_smi_command_exit_code Exit code of the last scrape command # TYPE nvidia_smi_command_exit_code gauge nvidia_smi_command_exit_code 0 # HELP nvidia_smi_compute_mode compute_mode # TYPE nvidia_smi_compute_mode gauge nvidia_smi_compute_mode{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0 # HELP nvidia_smi_count count # TYPE nvidia_smi_count gauge nvidia_smi_count{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 1 # HELP nvidia_smi_display_active display_active # TYPE nvidia_smi_display_active gauge nvidia_smi_display_active{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0 # HELP nvidia_smi_display_mode display_mode # TYPE nvidia_smi_display_mode gauge nvidia_smi_display_mode{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 1 # HELP nvidia_smi_ecc_errors_corrected_aggregate_device_memory ecc.errors.corrected.aggregate.device_memory # TYPE nvidia_smi_ecc_errors_corrected_aggregate_device_memory gauge nvidia_smi_ecc_errors_corrected_aggregate_device_memory{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0 # HELP nvidia_smi_ecc_errors_corrected_aggregate_dram ecc.errors.corrected.aggregate.dram # TYPE nvidia_smi_ecc_errors_corrected_aggregate_dram gauge nvidia_smi_ecc_errors_corrected_aggregate_dram{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0 # HELP nvidia_smi_ecc_errors_corrected_aggregate_sram ecc.errors.corrected.aggregate.sram # TYPE nvidia_smi_ecc_errors_corrected_aggregate_sram gauge nvidia_smi_ecc_errors_corrected_aggregate_sram{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0 # HELP nvidia_smi_ecc_errors_corrected_aggregate_total ecc.errors.corrected.aggregate.total # TYPE nvidia_smi_ecc_errors_corrected_aggregate_total gauge nvidia_smi_ecc_errors_corrected_aggregate_total{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0 # HELP nvidia_smi_ecc_errors_corrected_volatile_device_memory ecc.errors.corrected.volatile.device_memory # TYPE nvidia_smi_ecc_errors_corrected_volatile_device_memory gauge nvidia_smi_ecc_errors_corrected_volatile_device_memory{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0 # HELP nvidia_smi_ecc_errors_corrected_volatile_dram ecc.errors.corrected.volatile.dram # TYPE nvidia_smi_ecc_errors_corrected_volatile_dram gauge nvidia_smi_ecc_errors_corrected_volatile_dram{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0 # HELP nvidia_smi_ecc_errors_corrected_volatile_sram ecc.errors.corrected.volatile.sram # TYPE nvidia_smi_ecc_errors_corrected_volatile_sram gauge nvidia_smi_ecc_errors_corrected_volatile_sram{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0 # HELP nvidia_smi_ecc_errors_corrected_volatile_total ecc.errors.corrected.volatile.total # TYPE nvidia_smi_ecc_errors_corrected_volatile_total gauge nvidia_smi_ecc_errors_corrected_volatile_total{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0