• aarch64 麒麟v10系统使用docker部署nvidia_gpu_exporter监控GPU


    nvidia_gpu_exporter

     随着科技的发展,人工智能对于图像的处理越来越频繁,对于图像处理的场景也越来越多,GPU的使用也随之而来;那对于性能测试的图像比对、3D处理等类型的系统使用到了GPU,我们需要对GPU进行细化了解和监控。

     nvidia_gpu_exporter是prometheus 的 Nvidia GPU 导出器,使用nvidia-smi二进制文件收集指标。

    容器启动命令

    1. docker run -d \
    2. --name nvidia_smi_exporter \
    3. --restart unless-stopped \
    4. --device /dev/nvidiactl:/dev/nvidiactl \
    5. --device /dev/nvidia0:/dev/nvidia0 \
    6. -v /usr/lib64/libnvidia-ml.so:/usr/lib/aarch64-linux-gnu/libnvidia-ml.so \
    7. -v /usr/lib64/libnvidia-ml.so.1:/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.1 \
    8. -v /usr/bin/nvidia-smi:/usr/bin/nvidia-smi \
    9. -p 9835:9835 \
    10. utkuozdemir/nvidia_gpu_exporter:1.1.0

     github

    GitHub - utkuozdemir/nvidia_gpu_exporter: Nvidia GPU exporter for prometheus using nvidia-smi binary

    grafana展示

    ,面板链接Nvidia GPU Metrics | Grafana Labs

     

     采集数据

    # TYPE nvidia_smi_clocks_throttle_reasons_sync_boost gauge
    nvidia_smi_clocks_throttle_reasons_sync_boost{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0
    # HELP nvidia_smi_command_exit_code Exit code of the last scrape command
    # TYPE nvidia_smi_command_exit_code gauge
    nvidia_smi_command_exit_code 0
    # HELP nvidia_smi_compute_mode compute_mode
    # TYPE nvidia_smi_compute_mode gauge
    nvidia_smi_compute_mode{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0
    # HELP nvidia_smi_count count
    # TYPE nvidia_smi_count gauge
    nvidia_smi_count{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 1
    # HELP nvidia_smi_display_active display_active
    # TYPE nvidia_smi_display_active gauge
    nvidia_smi_display_active{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0
    # HELP nvidia_smi_display_mode display_mode
    # TYPE nvidia_smi_display_mode gauge
    nvidia_smi_display_mode{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 1
    # HELP nvidia_smi_ecc_errors_corrected_aggregate_device_memory ecc.errors.corrected.aggregate.device_memory
    # TYPE nvidia_smi_ecc_errors_corrected_aggregate_device_memory gauge
    nvidia_smi_ecc_errors_corrected_aggregate_device_memory{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0
    # HELP nvidia_smi_ecc_errors_corrected_aggregate_dram ecc.errors.corrected.aggregate.dram
    # TYPE nvidia_smi_ecc_errors_corrected_aggregate_dram gauge
    nvidia_smi_ecc_errors_corrected_aggregate_dram{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0
    # HELP nvidia_smi_ecc_errors_corrected_aggregate_sram ecc.errors.corrected.aggregate.sram
    # TYPE nvidia_smi_ecc_errors_corrected_aggregate_sram gauge
    nvidia_smi_ecc_errors_corrected_aggregate_sram{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0
    # HELP nvidia_smi_ecc_errors_corrected_aggregate_total ecc.errors.corrected.aggregate.total
    # TYPE nvidia_smi_ecc_errors_corrected_aggregate_total gauge
    nvidia_smi_ecc_errors_corrected_aggregate_total{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0
    # HELP nvidia_smi_ecc_errors_corrected_volatile_device_memory ecc.errors.corrected.volatile.device_memory
    # TYPE nvidia_smi_ecc_errors_corrected_volatile_device_memory gauge
    nvidia_smi_ecc_errors_corrected_volatile_device_memory{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0
    # HELP nvidia_smi_ecc_errors_corrected_volatile_dram ecc.errors.corrected.volatile.dram
    # TYPE nvidia_smi_ecc_errors_corrected_volatile_dram gauge
    nvidia_smi_ecc_errors_corrected_volatile_dram{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0
    # HELP nvidia_smi_ecc_errors_corrected_volatile_sram ecc.errors.corrected.volatile.sram
    # TYPE nvidia_smi_ecc_errors_corrected_volatile_sram gauge
    nvidia_smi_ecc_errors_corrected_volatile_sram{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0
    # HELP nvidia_smi_ecc_errors_corrected_volatile_total ecc.errors.corrected.volatile.total
    # TYPE nvidia_smi_ecc_errors_corrected_volatile_total gauge
    nvidia_smi_ecc_errors_corrected_volatile_total{uuid="cb792315-81df-27cd-cc86-78f7ff647b43"} 0
    ​​​​​​​

  • 相关阅读:
    【论文笔记】—低照度图像增强—Unsupervised—EnlightenGAN—2019-TIP
    地图双屏鼠标跟随效果
    OrangePiLinux连接小米手机使用adb显示“List of devices attached”的问题解决
    Nginx+Tomcat 实现反向代理
    阿里云推出AI编程工具“通义灵码“;生成式 AI 入门教程 2
    Linux(centos)服务器10秒快速配置Java环境
    JAVA要点
    两个不起眼的站内小细节,决定你的独立站转化率
    让孩子更快乐的学编程,一套积木就够了,长毛象AI百变编程积木套件体验
    PromptPort:为大模型定制的创意AI提示词工具库
  • 原文地址:https://blog.csdn.net/qq_43159578/article/details/126387734