• Full GC (Ergonomics)排查分析


    问题描述

    有一个线上Spark 任务频繁GC最后导致挂掉,下面是对它的排查分析

    GC日志

    {Heap before GC invocations=4312 (full 121):
     PSYoungGen      total 673792K, used 648704K [0x00000000d5580000, 0x0000000100000000, 0x0000000100000000)
      eden space 648704K, 100% used [0x00000000d5580000,0x00000000fcf00000,0x00000000fcf00000)
      from space 25088K, 0% used [0x00000000fcf00000,0x00000000fcf00000,0x00000000fe780000)
      to   space 25088K, 0% used [0x00000000fe780000,0x00000000fe780000,0x0000000100000000)
     ParOldGen       total 1398272K, used 1398167K [0x0000000080000000, 0x00000000d5580000, 0x00000000d5580000)
      object space 1398272K, 99% used [0x0000000080000000,0x00000000d5565cf0,0x00000000d5580000)
     Metaspace       used 133520K, capacity 143098K, committed 143508K, reserved 1175552K
      class space    used 16519K, capacity 17415K, committed 17596K, reserved 1048576K
    2022-07-27T00:10:42.519+0800: 18668.196: [Full GC (Ergonomics) [PSYoungGen: 648704K->57867K(673792K)] [ParOldGen: 1398167K->1397781K(1398272K)] 2046871K->1455648K(2072064K), [Metaspace: 133520K->133520K(1175552K)], 0.6053255 secs] [Times: user=14.25 sys=0.02, real=0.61 secs] 
    Heap after GC invocations=4312 (full 121):
     PSYoungGen      total 673792K, used 57867K [0x00000000d5580000, 0x0000000100000000, 0x0000000100000000)
      eden space 648704K, 8% used [0x00000000d5580000,0x00000000d8e02da0,0x00000000fcf00000)
      from space 25088K, 0% used [0x00000000fcf00000,0x00000000fcf00000,0x00000000fe780000)
      to   space 25088K, 0% used [0x00000000fe780000,0x00000000fe780000,0x0000000100000000)
     ParOldGen       total 1398272K, used 1397781K [0x0000000080000000, 0x00000000d5580000, 0x00000000d5580000)
      object space 1398272K, 99% used [0x0000000080000000,0x00000000d55054d0,0x00000000d5580000)
     Metaspace       used 133520K, capacity 143098K, committed 143508K, reserved 1175552K
      class space    used 16519K, capacity 17415K, committed 17596K, reserved 1048576K
    }
    {Heap before GC invocations=4313 (full 122):
     PSYoungGen      total 673792K, used 648704K [0x00000000d5580000, 0x0000000100000000, 0x0000000100000000)
      eden space 648704K, 100% used [0x00000000d5580000,0x00000000fcf00000,0x00000000fcf00000)
      from space 25088K, 0% used [0x00000000fcf00000,0x00000000fcf00000,0x00000000fe780000)
      to   space 25088K, 0% used [0x00000000fe780000,0x00000000fe780000,0x0000000100000000)
     ParOldGen       total 1398272K, used 1397781K [0x0000000080000000, 0x00000000d5580000, 0x00000000d5580000)
      object space 1398272K, 99% used [0x0000000080000000,0x00000000d55054d0,0x00000000d5580000)
     Metaspace       used 133520K, capacity 143098K, committed 143508K, reserved 1175552K
      class space    used 16519K, capacity 17415K, committed 17596K, reserved 1048576K
    2022-07-27T00:10:48.605+0800: 18674.282: [Full GC (Ergonomics) [PSYoungGen: 648704K->56957K(673792K)] [ParOldGen: 1397781K->1398167K(1398272K)] 2046485K->1455125K(2072064K), [Metaspace: 133520K->133520K(1175552K)], 0.7382079 secs] [Times: user=18.29 sys=0.05, real=0.74 secs] 
    Heap after GC invocations=4313 (full 122):
     PSYoungGen      total 673792K, used 56957K [0x00000000d5580000, 0x0000000100000000, 0x0000000100000000)
      eden space 648704K, 8% used [0x00000000d5580000,0x00000000d8d1f710,0x00000000fcf00000)
      from space 25088K, 0% used [0x00000000fcf00000,0x00000000fcf00000,0x00000000fe780000)
      to   space 25088K, 0% used [0x00000000fe780000,0x00000000fe780000,0x0000000100000000)
     ParOldGen       total 1398272K, used 1398167K [0x0000000080000000, 0x00000000d5580000, 0x00000000d5580000)
      object space 1398272K, 99% used [0x0000000080000000,0x00000000d5565df0,0x00000000d5580000)
     Metaspace       used 133520K, capacity 143098K, committed 143508K, reserved 1175552K
      class space    used 16519K, capacity 17415K, committed 17596K, reserved 1048576K
    }
    {Heap before GC invocations=4314 (full 123):
     PSYoungGen      total 673792K, used 648704K [0x00000000d5580000, 0x0000000100000000, 0x0000000100000000)
      eden space 648704K, 100% used [0x00000000d5580000,0x00000000fcf00000,0x00000000fcf00000)
      from space 25088K, 0% used [0x00000000fcf00000,0x00000000fcf00000,0x00000000fe780000)
      to   space 25088K, 0% used [0x00000000fe780000,0x00000000fe780000,0x0000000100000000)
     ParOldGen       total 1398272K, used 1398167K [0x0000000080000000, 0x00000000d5580000, 0x00000000d5580000)
      object space 1398272K, 99% used [0x0000000080000000,0x00000000d5565df0,0x00000000d5580000)
     Metaspace       used 133520K, capacity 143098K, committed 143508K, reserved 1175552K
      class space    used 16519K, capacity 17415K, committed 17596K, reserved 1048576K
    2022-07-27T00:10:54.991+0800: 18680.667: [Full GC (Ergonomics) [PSYoungGen: 648704K->57325K(673792K)] [ParOldGen: 1398167K->1398090K(1398272K)] 2046871K->1455415K(2072064K), [Metaspace: 133520K->133520K(1175552K)], 0.6338603 secs] [Times: user=14.77 sys=0.02, real=0.64 secs] 
    Heap after GC invocations=4314 (full 123):
     PSYoungGen      total 673792K, used 57325K [0x00000000d5580000, 0x0000000100000000, 0x0000000100000000)
      eden space 648704K, 8% used [0x00000000d5580000,0x00000000d8d7b600,0x00000000fcf00000)
      from space 25088K, 0% used [0x00000000fcf00000,0x00000000fcf00000,0x00000000fe780000)
      to   space 25088K, 0% used [0x00000000fe780000,0x00000000fe780000,0x0000000100000000)
     ParOldGen       total 1398272K, used 1398090K [0x0000000080000000, 0x00000000d5580000, 0x00000000d5580000)
      object space 1398272K, 99% used [0x0000000080000000,0x00000000d55528a0,0x00000000d5580000)
     Metaspace       used 133520K, capacity 143098K, committed 143508K, reserved 1175552K
      class space    used 16519K, capacity 17415K, committed 17596K, reserved 1048576K
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60

    在页面上搜Full GC 可以看到非常的频繁
    在这里插入图片描述


    原因分析:

    jstat -gcutil < pid >

    可以看到 youngGC 基本不变,Full GC一致飙升
    在这里插入图片描述

    S0: 新生代中Survivor space 0区已使用空间的百分比
    S1: 新生代中Survivor space 1区已使用空间的百分比
    E: 新生代已使用空间的百分比
    O: 老年代已使用空间的百分比
    P: 永久带已使用空间的百分比

    YGC: 从应用程序启动到当前,发生Yang GC 的次数
    YGCT: 从应用程序启动到当前,Yang GC所用的时间【单位秒】
    FGC: 从应用程序启动到当前,发生Full GC的次数
    FGCT: 从应用程序启动到当前,Full GC所用的时间
    GCT: 从应用程序启动到当前,用于垃圾回收的总时间【单位秒】

    GC日志

    可以看到Full GC原因基本是 (Ergonomics)而且GC对老年代已经起不了作用了

    2022-07-27T00:10:48.605+0800: 18674.282: [Full GC (Ergonomics) [PSYoungGen: 648704K->56957K(673792K)] [ParOldGen: 1397781K->1398167K(1398272K)] 2046485K->1455125K(2072064K), [Metaspace: 133520K->133520K(1175552K)], 0.7382079 secs] [Times: user=18.29 sys=0.05, real=0.74 secs] 
    
    • 1

    查找资料,发现当使用 Server 模式下的ParallelGC 收集器组合(Parallel Scavenge + Serial Old)时,会在 Minor GC前进行一次判断,也就是 内存空间分配担保机制:
    Eden 空间不足发生 Minor GC 之前,虚拟机先检查老年代最大可用的连续空间是否大于新生代所有对象总空间,如果条件成立的话,那么 Minor GC 可以确认是安全的。如果不成立的话虚拟机查看 HandlePromotionFailure 的值是否允许担保失败,如果HandlePromotionFailure 的值不允许冒险,那么就要进行一次 Full GC。如果允许则检查老年代最大可用的连续空间是否大于历次晋升到老年代对象的平均大小,如果大于将继续尝试进行 Minor GC,小于的话就要进行 Full GC以保证 Minor GC 的空间担保成功
    这个 Java 服务没有通过启动参数配置垃圾收集器,查看堆配置发现是使用的默认 Parallel GC 。显然,合理的推测是空间分配担保失败导致了Full GC

    jmap -heap < pid>

    看到发现老年代的使用已经达到了 99.97697796737938%,只剩下了 0.03M,是可以和之前的推测相佐证的

    PS Old Generation
    capacity = 179306496 (171.0MB)
    used     = 179265216 (170.96063232421875MB)
    free     = 41280 (0.03936767578125MB)
    99.97697796737938% used
    
    • 1
    • 2
    • 3
    • 4
    • 5

    jmap -histo < pid> | head -n 30

    查看堆内存中占用内存最多的前 30 个类实例,发现可疑点 ExectionListenerBus一直增加,导致堆内存一直被占用得不到释放


    解决方案:

    查了spark 的github看到有人提了PR,得知是spark框架的bug,本次bug在spark3.0.3已经修复

  • 相关阅读:
    保姆级k8s集群安装教程
    ArrayList与顺序表
    TCP协议相关实验
    RNA-seq 保姆教程:差异表达分析(二)
    【鸿蒙 HarmonyOS 4.0】多设备响应式布局
    【upload靶场12-16】截断、图片马
    csp 202212-2 训练计划
    算法-栈操作
    用一个 Python 脚本实现依次运行其他多个带 argparse 命令行参数的 .py 文件
    VUE3中的setup
  • 原文地址:https://blog.csdn.net/Lzx116/article/details/126076696