• GPU服务器安装驱动、cuda和cudnn和tensorflow


    系统版本兼容要求

    1. centos7.2 cuda9.0 cudnn7.4
    2. centos7.5 cuda9.2 cudnn7.4

    安装gcc

    1. yum -y install gcc gcc-c++ kernel-devel 
    2. package manage-overview
    3. https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-overview

    1、安装gpu显卡驱动

    查看nvidia gpu信息

    # nvidia-smi

    2、安装nvidia检测

    2.1添加ElRepo源

    1. # rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org 
    2. # rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org  
    3. # rpm -Uvh https://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm

    2.2、安装显卡驱动检查

    yum install nvidia-detect

    2.3 运行

    1. # nvidia-detect -v
    2. Probing for supported NVIDIA devices...
    3. [10de:15f8] NVIDIA Corporation Device 15f8
    4. This device requires the current 410.78 NVIDIA driver kmod-nvidia
    5. [10de:15f8] NVIDIA Corporation Device 15f8
    6. This device requires the current 410.78 NVIDIA driver kmod-nvidia
    7. [102b:0538] Matrox Electronics Systems Ltd. Device 0538


    2.4、编辑grub文件
    vim /etc/default/grub
    在“GRUB_CMDLINE_LINUX”中添加

    rd.driver.blacklist=nouveau nouveau.modeset=0

    改完后的文件如下:

    1. GRUB_TIMEOUT=5
    2. GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
    3. GRUB_DEFAULT=saved
    4. GRUB_DISABLE_SUBMENU=true
    5. GRUB_TERMINAL_OUTPUT="console"
    6. GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rd.driver.blacklist=nouveau nouveau.modeset=0 rhgb quiet"
    7. GRUB_DISABLE_RECOVERY="true"

    随后生成配置

    grub2-mkconfig -o /boot/grub2/grub.cfg

    2.5、创建blacklist

    vim /etc/modprobe.d/blacklist.conf

    添加

    blacklist nouveau

    2.6、更新配置

    1. mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
    2. dracut /boot/initramfs-$(uname -r).img $(uname -r)

    2.7、重启

    reboot

    2.8、确认禁用了nouveau

    lsmod | grep nouveau

    若无输出则禁用成功
    3、安装cuda
    cuda下载地址:

    1. https://developer.nvidia.com/cuda-toolkit
    2. # sh cuda_9.0.176_384.81_linux.run

    如果出现you appear to be running an x server please exit x before installing
    执行init 3 进入命令行模式,杀掉x server,然后再执行安装命令

    1. ===========
    2. = Summary =
    3. ===========
    4. Driver:   Installed
    5. Toolkit:  Installed in /usr/local/cuda-9.0
    6. Samples:  Installed in /root, but missing recommended libraries
    7. Please make sure that
    8.  -   PATH includes /usr/local/cuda-9.0/bin
    9.  -   LD_LIBRARY_PATH includes /usr/local/cuda-9.0/lib64, or, add /usr/local/cuda-9.0/lib64 to /etc/ld.so.conf and run ldconfig as root
    10. To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.0/bin
    11. To uninstall the NVIDIA Driver, run nvidia-uninstall
    12. Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-9.0/doc/pdf for detailed information on setting up CUDA.
    13. Logfile is /tmp/cuda_install_7874.log

    验证CUDA 9.0 是否安装成功 
    终端输入:

    nvcc -V

    可以看到cuda的版本信息

    接着尝试运行一下cuda中自带的例子:

    1. cd /usr/local/cuda-9.0/samples/1_Utilities/deviceQuery
    2. make
    3. ./deviceQuery

    可以看到输出成功 

    1. deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 2
    2. Result = PASS

    卸载

    1. To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.0/bin
    2. To uninstall the NVIDIA Driver, run nvidia-uninstall

    3、安装cudnnv7

    https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html

    下载完成以后将其解压到Cuda的目录当中,依次执行如下命令:

    1. tar -xzvf cudnn-9.0-linux-x64-v7.4.1.5.tgz
    2. sudo cp cuda/include/cudnn.h /usr/local/cuda/include
    3. sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
    4. sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*


    运行一个小Demo即可.

    如果安装了 例程和用户指南 这个包的话,我们可以找到位于 /usr/src/cudnn_samples_v7的mnistCUDNN这个小例子.
    拷贝到 你的home/yourdir 任意文件夹下

    $cp -r /usr/src/cudnn_samples_v7/ $HOME

    进入 mnistCUDNN

    $ cd $HOME/cudnn_samples_v7/mnistCUDNN

    编译

    $make clean && make

    运行

    $ ./mnistCUDNN

    如果安装成功了,你会看到这样结果

    Test passed!
    其实还可以cmake 一下你的caffe/build,也能很快测试是否安装成功

    13.安装gpu版的TensorFlow(先配置加速器)

    $ sudo pip install tensorflow-gpu

    root用户在根目录下新建.pip目录,在目录中创建文件pip.conf(/root/.pip/pip.conf),配置内容如下,这里使用的清华源,还是挺快的:

    1. [global]
    2. index-url=https://pypi.tuna.tsinghua.edu.cn/simple


    配置完成,无需任何操作,直接通过pip install即可安装任何想要的工具,再次来对比一下(输入pip install tensorflow之后立马截图就已经是如下图所示的效果)。

    14.测试TensorFlow
    走过前面的沟沟坎坎,终于到了测试这一步了,是不是很happy。

    1. [root@gpuserver ~]# python
    2. Python 2.7.5 (default, Nov 20 2015, 02:00:19
    3. [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
    4. Type "help", "copyright", "credits" or "license" for more information.
    5. >>> import tensorflow as tf
    6. >>> hello = tf.constant('Hello, TensorFlow!')
    7. >>> sess = tf.Session()
    8. 2018-12-12 17:10:51.572488: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
    9. >>> sess = tf.Session()
    10. >>> print(sess.run(hello))
    11. Hello, TensorFlow!
    12. >>> 


    如果你可以正确的运行上面这个小的例子,那么恭喜你,gpu版的TensorFlow安装成功了,还等什么,赶紧造起来吧!

    centos7.2安装pip

    1. yum install -y epel-release
    2. yum install -y python-pip

    6、安装kernel-devel

    yum -y install kernel-devel

    centos7.2配置图形化界面启动

    1. # systemctl get-default
    2. multi-user.target
    3. # systemctl set-default graphical.target 


    附录:
    1、cuda安装过程记录

    1. Installing the NVIDIA display driver...
    2. Installing the CUDA Toolkit in /usr/local/cuda-10.0 ...
    3. Missing recommended library: libGLU.so
    4. Missing recommended library: libX11.so
    5. Missing recommended library: libXi.so
    6. Missing recommended library: libXmu.so
    7. Installing the CUDA Samples in /root ...
    8. Copying samples to /root/NVIDIA_CUDA-10.0_Samples now...
    9. Finished copying samples.
    10. ===========
    11. = Summary =
    12. ===========
    13. Driver:   Installed
    14. Toolkit:  Installed in /usr/local/cuda-10.0
    15. Samples:  Installed in /root, but missing recommended libraries
    16. Please make sure that
    17.  -   PATH includes /usr/local/cuda-10.0/bin
    18.  -   LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root
    19. To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.0/bin
    20. To uninstall the NVIDIA Driver, run nvidia-uninstall
    21. Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.0/doc/pdf for detailed information on setting up CUDA.
    22. Logfile is /tmp/cuda_install_16878.log


     

  • 相关阅读:
    政安晨:【Keras机器学习示例演绎】(十四)—— 用于弱光图像增强的零 DCE
    《Linux运维总结:使用elasticdump工具迁移单节点elasticsearch数据(方案一)》
    中文编程开发语言工具编程实际案例:美发店会员管理系统软件编程实例
    java任务跟踪系统
    JDBC-04:PreparedStatement针对不同表的通用查询操作
    【剑指 Offer 05. 替换空格】
    springboot配置静态资源访问
    windows C 开发
    Android SystemUI去掉拖动亮度条QSPanel界面隐藏功能
    docker启动mysql服务
  • 原文地址:https://blog.csdn.net/LG_15011399296/article/details/133737916