• 服务器重启之后NVIDIA出现问题原因汇总


    问题一: nvidia-smi报错:NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driver 原因及避坑解决方案

            场景描述: 由于训练服务器卡顿, 服务器重启后, 再次跑模型的时候, 发现cuda不可用, 于是输入“nvidia-smi”才发现了一个错误,如下:

    NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver

            这是由于重启服务器,linux内核升级导致的,由于linux内核升级,之前的Nvidia驱动就不匹配连接了,但是此时Nvidia驱动还在,可以通过命令 nvcc -V 找到答案

            但是输入nvcc -V 命令的时候, 发现没有此命令, 说明没有安装, 然后安装nvidia-cuda-toolkit, 安装命令为: sudo apt install nvidia-cuda-toolkit

    安装完成之后, 使用nvcc -V 命令, 展示如下:

    上网搜索各种方案之后, 解法方法如下:  

    第一步: 安装dkms:

    sudo apt-get install dkms

    第二步: 查看本机连接不上的驱动版本

    ls -l /usr/src/

    可以看到有个一nvidia的文件, 这里是nvidia-470.94. 如果没有这类文件, 请先下载对应的文件.下载🔗: xxxxx

    第三步: 安装适合的驱动:

    sudo /data/disk-2T/xxxx/softwares/cuda/NVIDIA-Linux-x86_64-470.94.run

    这个安装的路径, 写自己下载470.94.run所在的文件路径.适当自己更换一下路径即可.

    或者是命令

    sudo dkms install -m nvidia -v 470.94

    这条命令 -v 后面需要填写本机的nvidia驱动版本,根据第二步得到! 如果这个安装过程中出现问题, 请看下面的问题三!

    到了这里, 如果安装成功, 那么就恭喜了, 哈哈哈, 使用nvidia-smi如果可以展示下图即为成功!.

     

    如果安装失败了, 就是下面的问题二了~~~~

    问题二:  Nvidia 显卡 Failed to initialize NVML Driver/library version mismatch 错误解决方案

    问题复现: 

    1. $ nvidia-smi
    2. -->
    3. Failed to initialize NVML: Driver/library version mismatch

    问题分析: 

            NVIDIA内核驱动版本与系统驱动不一致

    定位问题: 

    1. 查看显卡驱动所使用的内核版本, 命令如下:

    cat /proc/driver/nvidia/version

    结果如下: 

    1. NVRM version: NVIDIA UNIX x86_64 Kernel Module 470.94 Mon Dec 6 22:42:02 UTC 2021
    2. GCC version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)

             内核版本Kerner Module为470.93 系统内核18.04

    2. 查看系统驱动日志, 命令如下:

    cat /var/log/dpkg.log | grep nvidia

    结果如下:

    1. 2022-06-24 11:45:00 install libnvidia-compute-515:amd64 <none> 515.48.07-0ubuntu0.18.04.1
    2. 2022-06-24 11:45:00 status half-installed libnvidia-compute-515:amd64 515.48.07-0ubuntu0.18.04.1
    3. 2022-06-24 11:45:04 status unpacked libnvidia-compute-515:amd64 515.48.07-0ubuntu0.18.04.1
    4. 2022-06-24 11:45:04 status unpacked libnvidia-compute-515:amd64 515.48.07-0ubuntu0.18.04.1
    5. 2022-06-24 11:45:18 install nvidia-cuda-dev:amd64 <none> 9.1.85-3ubuntu1
    6. 2022-06-24 11:45:18 status half-installed nvidia-cuda-dev:amd64 9.1.85-3ubuntu1
    7. 2022-06-24 11:45:34 status unpacked nvidia-cuda-dev:amd64 9.1.85-3ubuntu1
    8. 2022-06-24 11:45:34 status unpacked nvidia-cuda-dev:amd64 9.1.85-3ubuntu1
    9. 2022-06-24 11:45:34 install nvidia-cuda-doc:all <none> 9.1.85-3ubuntu1
    10. 2022-06-24 11:45:34 status half-installed nvidia-cuda-doc:all 9.1.85-3ubuntu1
    11. 2022-06-24 11:45:38 status unpacked nvidia-cuda-doc:all 9.1.85-3ubuntu1
    12. 2022-06-24 11:45:38 status unpacked nvidia-cuda-doc:all 9.1.85-3ubuntu1
    13. 2022-06-24 11:45:38 install nvidia-cuda-gdb:amd64 <none> 9.1.85-3ubuntu1
    14. 2022-06-24 11:45:38 status half-installed nvidia-cuda-gdb:amd64 9.1.85-3ubuntu1
    15. 2022-06-24 11:45:38 status unpacked nvidia-cuda-gdb:amd64 9.1.85-3ubuntu1
    16. 2022-06-24 11:45:38 status unpacked nvidia-cuda-gdb:amd64 9.1.85-3ubuntu1
    17. 2022-06-24 11:45:38 install nvidia-profiler:amd64 <none> 9.1.85-3ubuntu1
    18. 2022-06-24 11:45:38 status half-installed nvidia-profiler:amd64 9.1.85-3ubuntu1
    19. 2022-06-24 11:45:39 status unpacked nvidia-profiler:amd64 9.1.85-3ubuntu1
    20. 2022-06-24 11:45:39 status unpacked nvidia-profiler:amd64 9.1.85-3ubuntu1
    21. 2022-06-24 11:45:39 install nvidia-opencl-dev:amd64 <none> 9.1.85-3ubuntu1
    22. 2022-06-24 11:45:39 status half-installed nvidia-opencl-dev:amd64 9.1.85-3ubuntu1
    23. 2022-06-24 11:45:39 status unpacked nvidia-opencl-dev:amd64 9.1.85-3ubuntu1
    24. 2022-06-24 11:45:39 status unpacked nvidia-opencl-dev:amd64 9.1.85-3ubuntu1
    25. 2022-06-24 11:45:39 install nvidia-cuda-toolkit:amd64 <none> 9.1.85-3ubuntu1
    26. 2022-06-24 11:45:39 status half-installed nvidia-cuda-toolkit:amd64 9.1.85-3ubuntu1
    27. 2022-06-24 11:45:40 status unpacked nvidia-cuda-toolkit:amd64 9.1.85-3ubuntu1
    28. 2022-06-24 11:45:40 status unpacked nvidia-cuda-toolkit:amd64 9.1.85-3ubuntu1
    29. 2022-06-24 11:45:40 install nvidia-visual-profiler:amd64 <none> 9.1.85-3ubuntu1
    30. 2022-06-24 11:45:40 status half-installed nvidia-visual-profiler:amd64 9.1.85-3ubuntu1
    31. 2022-06-24 11:45:46 status unpacked nvidia-visual-profiler:amd64 9.1.85-3ubuntu1
    32. 2022-06-24 11:45:46 status unpacked nvidia-visual-profiler:amd64 9.1.85-3ubuntu1
    33. 2022-06-24 11:45:46 configure nvidia-cuda-doc:all 9.1.85-3ubuntu1 <none>
    34. 2022-06-24 11:45:46 status unpacked nvidia-cuda-doc:all 9.1.85-3ubuntu1
    35. 2022-06-24 11:45:46 status half-configured nvidia-cuda-doc:all 9.1.85-3ubuntu1
    36. 2022-06-24 11:45:46 status installed nvidia-cuda-doc:all 9.1.85-3ubuntu1
    37. 2022-06-24 11:45:47 configure libnvidia-compute-515:amd64 515.48.07-0ubuntu0.18.04.1 <none>
    38. 2022-06-24 11:45:47 status unpacked libnvidia-compute-515:amd64 515.48.07-0ubuntu0.18.04.1
    39. 2022-06-24 11:45:47 status unpacked libnvidia-compute-515:amd64 515.48.07-0ubuntu0.18.04.1
    40. 2022-06-24 11:45:47 status half-configured libnvidia-compute-515:amd64 515.48.07-0ubuntu0.18.04.1
    41. 2022-06-24 11:45:47 status installed libnvidia-compute-515:amd64 515.48.07-0ubuntu0.18.04.1
    42. 2022-06-24 11:45:47 configure nvidia-opencl-dev:amd64 9.1.85-3ubuntu1 <none>
    43. 2022-06-24 11:45:47 status unpacked nvidia-opencl-dev:amd64 9.1.85-3ubuntu1
    44. 2022-06-24 11:45:47 status half-configured nvidia-opencl-dev:amd64 9.1.85-3ubuntu1
    45. 2022-06-24 11:45:47 status installed nvidia-opencl-dev:amd64 9.1.85-3ubuntu1
    46. 2022-06-24 11:45:47 configure nvidia-cuda-gdb:amd64 9.1.85-3ubuntu1 <none>
    47. 2022-06-24 11:45:47 status unpacked nvidia-cuda-gdb:amd64 9.1.85-3ubuntu1
    48. 2022-06-24 11:45:47 status half-configured nvidia-cuda-gdb:amd64 9.1.85-3ubuntu1
    49. 2022-06-24 11:45:47 status installed nvidia-cuda-gdb:amd64 9.1.85-3ubuntu1
    50. 2022-06-24 11:45:47 configure nvidia-profiler:amd64 9.1.85-3ubuntu1 <none>
    51. 2022-06-24 11:45:47 status unpacked nvidia-profiler:amd64 9.1.85-3ubuntu1
    52. 2022-06-24 11:45:47 status half-configured nvidia-profiler:amd64 9.1.85-3ubuntu1
    53. 2022-06-24 11:45:47 status installed nvidia-profiler:amd64 9.1.85-3ubuntu1
    54. 2022-06-24 11:45:47 configure nvidia-visual-profiler:amd64 9.1.85-3ubuntu1 <none>
    55. 2022-06-24 11:45:47 status unpacked nvidia-visual-profiler:amd64 9.1.85-3ubuntu1
    56. 2022-06-24 11:45:47 status half-configured nvidia-visual-profiler:amd64 9.1.85-3ubuntu1
    57. 2022-06-24 11:45:48 status installed nvidia-visual-profiler:amd64 9.1.85-3ubuntu1
    58. 2022-06-24 11:45:48 configure nvidia-cuda-dev:amd64 9.1.85-3ubuntu1 <none>
    59. 2022-06-24 11:45:48 status unpacked nvidia-cuda-dev:amd64 9.1.85-3ubuntu1
    60. 2022-06-24 11:45:48 status half-configured nvidia-cuda-dev:amd64 9.1.85-3ubuntu1
    61. 2022-06-24 11:45:48 status installed nvidia-cuda-dev:amd64 9.1.85-3ubuntu1
    62. 2022-06-24 11:45:48 configure nvidia-cuda-toolkit:amd64 9.1.85-3ubuntu1 <none>
    63. 2022-06-24 11:45:48 status unpacked nvidia-cuda-toolkit:amd64 9.1.85-3ubuntu1
    64. 2022-06-24 11:45:48 status unpacked nvidia-cuda-toolkit:amd64 9.1.85-3ubuntu1
    65. 2022-06-24 11:45:48 status half-configured nvidia-cuda-toolkit:amd64 9.1.85-3ubuntu1
    66. 2022-06-24 11:45:48 status installed nvidia-cuda-toolkit:amd64 9.1.85-3ubuntu1

            可以看到曾经安装过系统内核18.04的515.48.07的驱动

    3. 查看驱动程序, 命令如下: 

    sudo dpkg --list | grep nvidia-*

    结果如下: 

    1. ii libnvidia-compute-460-server:amd64 515.48.07-0ubuntu0.18.04.1 amd64 NVIDIA libcompute package
    2. ii libnvidia-container-tools 1.0.5-1 amd64 NVIDIA container runtime library (command-line tools)
    3. ii libnvidia-container1:amd64 1.0.5-1 amd64 NVIDIA container runtime library
    4. ii nvidia-container-runtime 3.1.4-1 amd64 NVIDIA container runtime
    5. ii nvidia-container-toolkit 1.0.5-1 amd64 NVIDIA container runtime hook
    6. ii nvidia-cuda-dev 9.1.85-3ubuntu1 amd64 NVIDIA CUDA development files
    7. ii nvidia-cuda-doc 9.1.85-3ubuntu1 all NVIDIA CUDA and OpenCL documentation
    • 可以看到系统安装了ubuntu 内核18.04 下的 nvidia 515 驱动
    • 实际系统内核版本与驱动需求的版本不一致是问题产生的根源

    解决方案:  

    • 卸载现有驱动,重新安装

    卸载驱动: 

    1. sudo /usr/bin/nvidia-uninstall
    2. sudo apt-get --purge remove nvidia-*
    3. sudo apt-get purge nvidia*
    4. sudo apt-get purge libnvidia*

    输入下面命令不在有任何内容

    sudo dpkg --list | grep nvidia-*
    

    重新安装: 

    sudo /data/disk-2T/xxxx/softwares/cuda/NVIDIA-Linux-x86_64-470.94.run

    安装完成之后, 使用nvidia-smi查看结果.

    问题三: gcc版本不匹配造成的安装失败

    此时,如果你的gcc(尽量大于7.3版本)版本过低,那么上述命令sudo dkms install -m nvidia -v 470.103.01失败的原因就找到了,查看现有的gcc版本

    gcc --version

    gcc在/usr/bin目录下, 输入命令查看所有的gcc:

    1. ls /usr/bin/gcc*
    2. ls /usr/bin/g++*

    结果如下: 

     将查到的版本加入gcc候选中,最后的数字是优先级,如下:

    1. sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 20 --slave /usr/bin/g++ g++ /usr/bin/g++-7
    2. sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 10 --slave /usr/bin/g++ g++ /usr/bin/g++-9

    完成上面的操作之后,我们就可以通过下面的指令来选择不同的gcc和g++的版本了

    sudo update-alternatives --config gcc

    结果如下: 

    这里我选择的是gcc-7选0或是1都可以.成功!~~~🙃

    重启电脑, 输入nvidia-smi, 链接成功

    以上为总结的三个问题, 哈哈哈, 能帮助大家解决问题的, 就点赞支持一下吧!🤞✌️

  • 相关阅读:
    ssm springboot关于java mysql关于查询或者添加中文乱码问题
    flink operator 1.7 更换日志框架log4j 到logback
    http协议各个版本的详细介绍
    如何运营一个微信公众号?
    【HTTPS】运营商劫持、中间人攻击 与 加密
    20231014后台面经总结
    如何写好提示词?《Midjourney常用关键词大全》-附关键词文件
    Marked.js让您的文档编辑更加轻松自如!
    实验四 基本数据管理(一)
    Flask:jinja2.exception.TemplateNotFound
  • 原文地址:https://blog.csdn.net/junjunzai123/article/details/125445430