• 问题记录(待解决)|由 apt install nvidia-cuda-toolkit 引发的灾难


    捣鼓环境的时候,按照网上的办法执行 sudo apt install nvidia-cuda-toolkit 后,28号机器的 nvidia-smi 命令直接无法使用了……

    # nvidia-smi
    Failed to initialize NVML: Driver/library version mismatch
    
    • 1
    • 2

    cuda 也无法被正确识别:

    # python 
    Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
    [GCC 7.3.0] :: Anaconda, Inc. on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import torch
    >>> torch.cuda.is_available()
    /root/anaconda3/lib/python3.8/site-packages/torch/cuda/__init__.py:83: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:109.)
      return torch._C._cuda_getDeviceCount() > 0
    False
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    于是根据此篇博客,尝试

    >>> sudo dpkg --list | grep nvidia-*
    iU  libnvidia-cfg1-525:amd64         525.147.05-0ubuntu0~gpu18.04.1      amd64        NVIDIA binary OpenGL/GLX configuration library
    iU  libnvidia-common-525             525.147.05-0ubuntu0~gpu18.04.1      all          Shared files used by the NVIDIA libraries
    iU  libnvidia-compute-525:amd64      525.147.05-0ubuntu0~gpu18.04.1      amd64        NVIDIA libcompute package
    iU  libnvidia-decode-525:amd64       525.147.05-0ubuntu0~gpu18.04.1      amd64        NVIDIA Video Decoding runtime libraries
    iU  libnvidia-encode-525:amd64       525.147.05-0ubuntu0~gpu18.04.1      amd64        NVENC Video Encoding runtime library
    iU  libnvidia-extra-525:amd64        525.147.05-0ubuntu0~gpu18.04.1      amd64        Extra libraries for the NVIDIA driver
    iU  libnvidia-fbc1-525:amd64         525.147.05-0ubuntu0~gpu18.04.1      amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
    iU  libnvidia-gl-525:amd64           525.147.05-0ubuntu0~gpu18.04.1      amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
    iU  nvidia-dkms-525                  525.147.05-0ubuntu0~gpu18.04.1      amd64        NVIDIA DKMS package
    iU  nvidia-driver-510                525.147.05-0ubuntu0~gpu18.04.1      amd64        Transitional package for nvidia-driver-525
    iU  nvidia-driver-525                525.147.05-0ubuntu0~gpu18.04.1      amd64        NVIDIA driver metapackage
    iU  nvidia-kernel-common-525         525.147.05-0ubuntu0~gpu18.04.1      amd64        Shared files used with the kernel module
    iU  nvidia-kernel-source-525         525.147.05-0ubuntu0~gpu18.04.1      amd64        NVIDIA kernel source package
    iU  nvidia-prime                     0.8.16~0.18.04.1                    all          Tools to enable NVIDIA's Prime
    iU  nvidia-settings                  470.57.01-0ubuntu0.18.04.1          amd64        Tool for configuring the NVIDIA graphics driver
    iU  xserver-xorg-video-nvidia-525    525.147.05-0ubuntu0~gpu18.04.1      amd64        NVIDIA binary Xorg driver
    >>> cat /proc/driver/nvidia/version
    NVRM version: NVIDIA UNIX x86_64 Kernel Module  510.47.03  Mon Jan 24 22:58:54 UTC 2022
    GCC version:  gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20

    发现就是cuda和显卡驱动版本不匹配。需要把版本统一为 510.47.03 按照如下方法:

    1. 卸载驱动:
    sudo apt-get purge nvidia*
    
    • 1
    1. 把显卡驱动加入ppa(个人软件包文档,仅支持Ubuntu):
    sudo add-apt-repository ppa:graphics-drivers
    sudo apt-get update
    
    • 1
    • 2
    1. 重新安装驱动:
    apt-get install nvidia-driver-510 nvidia-settings nvidia-prime
    
    • 1

    但是一直报如下错误:

    Errors were encountered while processing:
     /tmp/apt-dpkg-install-T8KJGT/08-nvidia-compute-utils-525_525.147.05-0ubuntu0~gpu18.04.1_amd64.deb
     /tmp/apt-dpkg-install-T8KJGT/12-nvidia-utils-525_525.147.05-0ubuntu0~gpu18.04.1_amd64.deb
    E: Sub-process /usr/bin/dpkg returned an error code (1)
    
    • 1
    • 2
    • 3
    • 4

    解决这个问题再来更新。

    PS:有知道解决办法的小伙伴欢迎在评论区补充!


    参考链接

    【nvidia-smi报错】Failed to initialize NVML: Driver/library version mismatch-CSDN博客

  • 相关阅读:
    MySQL性能优化-范式设计和反范式设计
    了解Spring的变迁从Spring3到Spring5
    10行代码集2000张美女图,Python爬虫120例,再上征途
    MySQL安装文档
    【音视频】H264视频压缩格式
    若依系统左侧菜单,即使菜单又是目录
    【21】c++设计模式——>装饰模式
    尝试使用jmeter-maven-plugin
    ASP.Net Core异步编程
    幸福里基于 Flink & Paimon 的流式数仓实践
  • 原文地址:https://blog.csdn.net/qq_36332660/article/details/134286026