有一个线上Spark 任务频繁GC最后导致挂掉,下面是对它的排查分析
{Heap before GC invocations=4312 (full 121):
PSYoungGen total 673792K, used 648704K [0x00000000d5580000, 0x0000000100000000, 0x0000000100000000)
eden space 648704K, 100% used [0x00000000d5580000,0x00000000fcf00000,0x00000000fcf00000)
from space 25088K, 0% used [0x00000000fcf00000,0x00000000fcf00000,0x00000000fe780000)
to space 25088K, 0% used [0x00000000fe780000,0x00000000fe780000,0x0000000100000000)
ParOldGen total 1398272K, used 1398167K [0x0000000080000000, 0x00000000d5580000, 0x00000000d5580000)
object space 1398272K, 99% used [0x0000000080000000,0x00000000d5565cf0,0x00000000d5580000)
Metaspace used 133520K, capacity 143098K, committed 143508K, reserved 1175552K
class space used 16519K, capacity 17415K, committed 17596K, reserved 1048576K
2022-07-27T00:10:42.519+0800: 18668.196: [Full GC (Ergonomics) [PSYoungGen: 648704K->57867K(673792K)] [ParOldGen: 1398167K->1397781K(1398272K)] 2046871K->1455648K(2072064K), [Metaspace: 133520K->133520K(1175552K)], 0.6053255 secs] [Times: user=14.25 sys=0.02, real=0.61 secs]
Heap after GC invocations=4312 (full 121):
PSYoungGen total 673792K, used 57867K [0x00000000d5580000, 0x0000000100000000, 0x0000000100000000)
eden space 648704K, 8% used [0x00000000d5580000,0x00000000d8e02da0,0x00000000fcf00000)
from space 25088K, 0% used [0x00000000fcf00000,0x00000000fcf00000,0x00000000fe780000)
to space 25088K, 0% used [0x00000000fe780000,0x00000000fe780000,0x0000000100000000)
ParOldGen total 1398272K, used 1397781K [0x0000000080000000, 0x00000000d5580000, 0x00000000d5580000)
object space 1398272K, 99% used [0x0000000080000000,0x00000000d55054d0,0x00000000d5580000)
Metaspace used 133520K, capacity 143098K, committed 143508K, reserved 1175552K
class space used 16519K, capacity 17415K, committed 17596K, reserved 1048576K
}
{Heap before GC invocations=4313 (full 122):
PSYoungGen total 673792K, used 648704K [0x00000000d5580000, 0x0000000100000000, 0x0000000100000000)
eden space 648704K, 100% used [0x00000000d5580000,0x00000000fcf00000,0x00000000fcf00000)
from space 25088K, 0% used [0x00000000fcf00000,0x00000000fcf00000,0x00000000fe780000)
to space 25088K, 0% used [0x00000000fe780000,0x00000000fe780000,0x0000000100000000)
ParOldGen total 1398272K, used 1397781K [0x0000000080000000, 0x00000000d5580000, 0x00000000d5580000)
object space 1398272K, 99% used [0x0000000080000000,0x00000000d55054d0,0x00000000d5580000)
Metaspace used 133520K, capacity 143098K, committed 143508K, reserved 1175552K
class space used 16519K, capacity 17415K, committed 17596K, reserved 1048576K
2022-07-27T00:10:48.605+0800: 18674.282: [Full GC (Ergonomics) [PSYoungGen: 648704K->56957K(673792K)] [ParOldGen: 1397781K->1398167K(1398272K)] 2046485K->1455125K(2072064K), [Metaspace: 133520K->133520K(1175552K)], 0.7382079 secs] [Times: user=18.29 sys=0.05, real=0.74 secs]
Heap after GC invocations=4313 (full 122):
PSYoungGen total 673792K, used 56957K [0x00000000d5580000, 0x0000000100000000, 0x0000000100000000)
eden space 648704K, 8% used [0x00000000d5580000,0x00000000d8d1f710,0x00000000fcf00000)
from space 25088K, 0% used [0x00000000fcf00000,0x00000000fcf00000,0x00000000fe780000)
to space 25088K, 0% used [0x00000000fe780000,0x00000000fe780000,0x0000000100000000)
ParOldGen total 1398272K, used 1398167K [0x0000000080000000, 0x00000000d5580000, 0x00000000d5580000)
object space 1398272K, 99% used [0x0000000080000000,0x00000000d5565df0,0x00000000d5580000)
Metaspace used 133520K, capacity 143098K, committed 143508K, reserved 1175552K
class space used 16519K, capacity 17415K, committed 17596K, reserved 1048576K
}
{Heap before GC invocations=4314 (full 123):
PSYoungGen total 673792K, used 648704K [0x00000000d5580000, 0x0000000100000000, 0x0000000100000000)
eden space 648704K, 100% used [0x00000000d5580000,0x00000000fcf00000,0x00000000fcf00000)
from space 25088K, 0% used [0x00000000fcf00000,0x00000000fcf00000,0x00000000fe780000)
to space 25088K, 0% used [0x00000000fe780000,0x00000000fe780000,0x0000000100000000)
ParOldGen total 1398272K, used 1398167K [0x0000000080000000, 0x00000000d5580000, 0x00000000d5580000)
object space 1398272K, 99% used [0x0000000080000000,0x00000000d5565df0,0x00000000d5580000)
Metaspace used 133520K, capacity 143098K, committed 143508K, reserved 1175552K
class space used 16519K, capacity 17415K, committed 17596K, reserved 1048576K
2022-07-27T00:10:54.991+0800: 18680.667: [Full GC (Ergonomics) [PSYoungGen: 648704K->57325K(673792K)] [ParOldGen: 1398167K->1398090K(1398272K)] 2046871K->1455415K(2072064K), [Metaspace: 133520K->133520K(1175552K)], 0.6338603 secs] [Times: user=14.77 sys=0.02, real=0.64 secs]
Heap after GC invocations=4314 (full 123):
PSYoungGen total 673792K, used 57325K [0x00000000d5580000, 0x0000000100000000, 0x0000000100000000)
eden space 648704K, 8% used [0x00000000d5580000,0x00000000d8d7b600,0x00000000fcf00000)
from space 25088K, 0% used [0x00000000fcf00000,0x00000000fcf00000,0x00000000fe780000)
to space 25088K, 0% used [0x00000000fe780000,0x00000000fe780000,0x0000000100000000)
ParOldGen total 1398272K, used 1398090K [0x0000000080000000, 0x00000000d5580000, 0x00000000d5580000)
object space 1398272K, 99% used [0x0000000080000000,0x00000000d55528a0,0x00000000d5580000)
Metaspace used 133520K, capacity 143098K, committed 143508K, reserved 1175552K
class space used 16519K, capacity 17415K, committed 17596K, reserved 1048576K
}
在页面上搜Full GC 可以看到非常的频繁
可以看到 youngGC 基本不变,Full GC一致飙升
S0: 新生代中Survivor space 0区已使用空间的百分比
S1: 新生代中Survivor space 1区已使用空间的百分比
E: 新生代已使用空间的百分比
O: 老年代已使用空间的百分比
P: 永久带已使用空间的百分比
YGC: 从应用程序启动到当前,发生Yang GC 的次数
YGCT: 从应用程序启动到当前,Yang GC所用的时间【单位秒】
FGC: 从应用程序启动到当前,发生Full GC的次数
FGCT: 从应用程序启动到当前,Full GC所用的时间
GCT: 从应用程序启动到当前,用于垃圾回收的总时间【单位秒】
可以看到Full GC原因基本是 (Ergonomics)而且GC对老年代已经起不了作用了
2022-07-27T00:10:48.605+0800: 18674.282: [Full GC (Ergonomics) [PSYoungGen: 648704K->56957K(673792K)] [ParOldGen: 1397781K->1398167K(1398272K)] 2046485K->1455125K(2072064K), [Metaspace: 133520K->133520K(1175552K)], 0.7382079 secs] [Times: user=18.29 sys=0.05, real=0.74 secs]
查找资料,发现当使用 Server 模式下的ParallelGC 收集器组合(Parallel Scavenge + Serial Old)时,会在 Minor GC前进行一次判断,也就是 内存空间分配担保机制:
Eden 空间不足发生 Minor GC 之前,虚拟机先检查老年代最大可用的连续空间是否大于新生代所有对象总空间,如果条件成立的话,那么 Minor GC 可以确认是安全的。如果不成立的话虚拟机查看 HandlePromotionFailure 的值是否允许担保失败,如果HandlePromotionFailure 的值不允许冒险,那么就要进行一次 Full GC。如果允许则检查老年代最大可用的连续空间是否大于历次晋升到老年代对象的平均大小,如果大于将继续尝试进行 Minor GC,小于的话就要进行 Full GC以保证 Minor GC 的空间担保成功
这个 Java 服务没有通过启动参数配置垃圾收集器,查看堆配置发现是使用的默认 Parallel GC 。显然,合理的推测是空间分配担保失败导致了Full GC
看到发现老年代的使用已经达到了 99.97697796737938%,只剩下了 0.03M,是可以和之前的推测相佐证的
PS Old Generation
capacity = 179306496 (171.0MB)
used = 179265216 (170.96063232421875MB)
free = 41280 (0.03936767578125MB)
99.97697796737938% used
查看堆内存中占用内存最多的前 30 个类实例,发现可疑点
ExectionListenerBus
一直增加,导致堆内存一直被占用得不到释放
查了spark 的github看到有人提了PR,得知是spark框架的bug,本次bug在spark3.0.3已经修复