• shuffle文件损坏导致nodemanager重启失败


    现象

    CDH集群,生产环境发现一台数据节点nodemanager失联,重启失败。尝试重启整个yarn,结果导致所有nodemanager失联不可用。

    问题原因

    最开始的一个NM失联是因为内存溢出,导致NM挂了

    JAVA_HOME=/usr/java/jdk1.8.0_181
    using /usr/java/jdk1.8.0_181 as JAVA_HOME
    using 6 as CDH_VERSION
    using /opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop-yarn as CDH_YARN_HOME
    using /opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop-mapreduce as CDH_MR2_HOME
    using /var/run/cloudera-scm-agent/process/5742-yarn-NODEMANAGER as CONF_DIR
    CONF_DIR=/var/run/cloudera-scm-agent/process/5742-yarn-NODEMANAGER
    CMF_CONF_DIR=
    java.lang.OutOfMemoryError: Java heap space
    Dumping heap to /tmp/yarn_yarn-NODEMANAGER-cb40d8036dd59b6cc2f0d7748f7561ce_pid440050.hprof ...
    Heap dump file created [3141289255 bytes in 18.481 secs]
    #
    # java.lang.OutOfMemoryError: Java heap space
    # -XX:OnOutOfMemoryError="/opt/cloudera/cm-agent/service/common/killparent.sh"
    #   Executing /bin/sh -c "/opt/cloudera/cm-agent/service/common/killparent.sh"...
    2022年 08月 10日 星期三 08:56:18 CST
    JAVA_HOME=/usr/java/jdk1.8.0_181
    using /usr/java/jdk1.8.0_181 as JAVA_HOME
    using 6 as CDH_VERSION
    using /opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop-yarn as CDH_YARN_HOME
    using /opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop-mapreduce as CDH_MR2_HOME
    using /var/run/cloudera-scm-agent/process/5742-yarn-NODEMANAGER as CONF_DIR
    CONF_DIR=/var/run/cloudera-scm-agent/process/5742-yarn-NODEMANAGER
    CMF_CONF_DIR=
    java.lang.OutOfMemoryError: Java heap space
    Dumping heap to /tmp/yarn_yarn-NODEMANAGER-cb40d8036dd59b6cc2f0d7748f7561ce_pid450025.hprof ...
    Heap dump file created [3141287223 bytes in 17.360 secs]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27

    正常情况下,此时重启NM即可解决。那为什么这次重启失败呢?
    咨询cloudera官方,将NM的日志级别调为debug,发现Node Manager 启动时做recovery出了问题,推测可能是shuffle文件的问题

    解决办法

    删除数据节点上所有的shuffle文件:

    rm -rf /var/lib/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shufflemapreduce_shuffle_state/
    rm -rf /var/lib/hadoop-yarn/yarn-nm-recovery/nm-aux-services/spark_shuffle/registeredExecutors.ldb/
    rm -rf /var/lib/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/*
    
    • 1
    • 2
    • 3

    重启成功!
    同样的操作,在其他NM节点上都进行一遍,所有的NM都恢复正常。

    分析总结

    recovery只有在重启的时候才会进行,所以平时都正常运行没有问题。一旦进行重启操作,就会触发这个问题,导致重启失败。

    PS:遇到yarn的NM问题可以去以下几个地方看看日志:

    • /var/run/cloudera-scm-agent/process/*-yarn-NODEMANAGER/logs
    • /var/log/hadoop-yarn
  • 相关阅读:
    Flutter绘制拖尾效果
    vue3自定义全局Loading
    简单写个JS插件替换网页上的文本
    docker-swarm集群搭建
    金蝶云星空企业版v8.0内网穿透配置详解:实现便捷的异地远程访问
    JNI基础知识总结
    排序算法-归并排序
    allatori8.0文档翻译-第十三步:Android Studio整合
    LeetCode_哈希表_困难_149. 直线上最多的点数
    Vue.js 页面加载时触发函数
  • 原文地址:https://blog.csdn.net/magicchu/article/details/126598734