最近项目上线,隔几天就出现一次OutOfDirectMemoryError,项目使用netty进行网络数据处理,笔者输出这篇文章记录分析与处理的过程,分享给所有读者,希望能帮助到遇到同样问题的人。(备注:本篇文章为笔者模拟当时的情况,还原问题处理过程。)
10:14:02.857 [AlarmFile-7-2] WARN netty.channel.DefaultChannelPipeline[1152]An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 1073741824, max: 1073741824)
at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:802) ~[netty-common-4.1.72.Final.jar:4.1.72.Final]
at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:731) ~[netty-common-4.1.72.Final.jar:4.1.72.Final]
at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:648) ~[netty-buffer-4.1.72.Final.jar:4.1.72.Final]
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:623) ~[netty-buffer-4.1.72.Final.jar:4.1.72.Final]
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:202) ~[netty-buffer-4.1.72.Final.jar:4.1.72.Final]
at io.netty.buffer.PoolArena.tcacheAllocateSmall(PoolArena.java:172) ~[netty-buffer-4.1.72.Final.jar:4.1.72.Final]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:134) ~[netty-buffer-4.1.72.Final.jar:4.1.72.Final]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:126) ~[netty-buffer-4.1.72.Final.jar:4.1.72.Final]
at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:395) ~[netty-buffer-4.1.72.Final.jar:4.1.72.Final]
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188) ~[netty-buffer-4.1.72.Final.jar:4.1.72.Final]
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179) ~[netty-buffer-4.1.72.Final.jar:4.1.72.Final]
at io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:116) ~[netty-buffer-4.1.72.Final.jar:4.1.72.Final]
...
日志显示OutOfDirectMemoryError,来自程序中netty分配堆外内存的时候。
先跟踪一下堆外内存使用情况,netty里提供了相应的api查询堆外内存使用情况,在程序中添加如下代码:
LOG.info("directmemory use:{}, max:{}",
PlatformDependent.usedDirectMemory(), PlatformDependent.maxDirectMemory());
然后跟踪一下日志:
10:17:48.178 [*******-4-4] INFO ****************[66]directmemory use:234881024, max:1073741824
10:17:53.373 [*******-4-1] INFO ****************[66]directmemory use:234881024, max:1073741824
10:17:57.467 [*******-4-4] INFO ****************[66]directmemory use:234881024, max:1073741824
10:17:57.599 [*******-4-4] INFO ****************[66]directmemory use:234881024, max:1073741824
10:17:57.617 [*******-4-2] INFO ****************[66]directmemory use:234881024, max:1073741824
10:17:58.638 [*******-4-1] INFO ****************[66]directmemory use:234881024, max:1073741824
10:17:58.748 [*******-4-4] INFO ****************[66]directmemory use:234881024, max:1073741824
10:17:59.540 [*******-4-2] INFO ****************[66]directmemory use:234881024, max:1073741824
...
10:50:51.873 [*******-4-2] INFO ****************[66]directmemory use:268435456, max:1073741824
10:50:58.220 [*******-4-4] INFO ****************[66]directmemory use:268435456, max:1073741824
10:50:58.973 [*******-4-5] INFO ****************[66]directmemory use:268435456, max:1073741824
...
11:08:49.983 [*******-4-5] INFO ****************[66]directmemory use:301989888, max:1073741824
11:08:52.401 [*******-4-1] INFO ****************[66]directmemory use:301989888, max:1073741824
11:08:53.459 [*******-4-2] INFO ****************[66]directmemory use:301989888, max:1073741824
观察到的现象是:堆外内存一直增长不降,没有回收,这种情况大概率是代码中出现了内存泄露。
先用arthas抓几个dump看看情况
java -jar arthas-boot.jar
...
heapdump /tmp/*****-08291045-dump.hprof
抓取了2个时间点的dump文件,发现堆外内存中netty的PooledUnsafeDirectByteBuf数量增长很多。
到这里只能知道是代码中ByteBuf引起的,但具体是哪里的ByteBuf引起的呢?
netty的官方文档,有两种方式可以 打印出泄漏的详细信息
第一种是在netty定义时添加如下代码:
ResourceLeakDetector.setLevel(ResourceLeakDetector.Level.ADVANCED);
第二种是在jvm中添加如下参数:
-Dio.netty.leakDetectionLevel=advanced
这里笔者选用第二种方式,添加参数后重新启动项目,观察错误日志:
16:48:33.249 [808-TCP-4-1] ERROR netty.util.ResourceLeakDetector[319]LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records:
#1:
io.netty.buffer.AdvancedLeakAwareByteBuf.readUnsignedInt(AdvancedLeakAwareByteBuf.java:443)
com.*****.***********.******.protostar.schema.NumberPSchema$DWORD2Long.readAndSet(NumberPSchema.java:177)
com.*****.***********.******.protostar.schema.RuntimeSchema.mergeFrom(RuntimeSchema.java:63)
com.*****.***********.******.protocol.codec.JTMessageDecoder.decryptMessage(JTMessageDecoder.java:160)
...
#2:
io.netty.buffer.AdvancedLeakAwareByteBuf.readBytes(AdvancedLeakAwareByteBuf.java:497)
com.*****.***********.******.protostar.schema.StringSchema$STR.readFrom(StringSchema.java:58)
com.*****.***********.******.protostar.schema.StringSchema$STR.readFrom(StringSchema.java:38)
com.*****.***********.******.protostar.Schema.readFrom(Schema.java:17)
...
#3:
io.netty.buffer.AdvancedLeakAwareByteBuf.getUnsignedByte(AdvancedLeakAwareByteBuf.java:161)
io.netty.buffer.ByteBufUtil$HexUtil.hexDump(ByteBufUtil.java:1496)
io.netty.buffer.ByteBufUtil$HexUtil.access$000(ByteBufUtil.java:1420)
io.netty.buffer.ByteBufUtil.hexDump(ByteBufUtil.java:143)
io.netty.buffer.ByteBufUtil.hexDump(ByteBufUtil.java:135)
com.*****.***********.******.protocol.codec.JTMessageDecoder.decrypt(JTMessageDecoder.java:238)
com.*****.***********.******.protocol.codec.JTMessageDecoder.decryptMessage(JTMessageDecoder.java:145)
com.*****.***********.******.protocol.codec.JTMessageDecoder.decode(JTMessageDecoder.java:128)
...
#4:
io.netty.buffer.AdvancedLeakAwareByteBuf.writeBytes(AdvancedLeakAwareByteBuf.java:611)
com.*****.***********.******.protocol.codec.JTMessageDecoder.decrypt(JTMessageDecoder.java:237)
com.*****.***********.******.protocol.codec.JTMessageDecoder.decryptMessage(JTMessageDecoder.java:145)
com.*****.***********.******.protocol.codec.JTMessageDecoder.decode(JTMessageDecoder.java:128)
...
Created at:
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:402)
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:116)
com.*****.***********.******.protocol.codec.JTMessageDecoder.decrypt(JTMessageDecoder.java:229)
com.*****.***********.******.protocol.codec.JTMessageDecoder.decryptMessage(JTMessageDecoder.java:145)
com.*****.***********.******.protocol.codec.JTMessageDecoder.decode(JTMessageDecoder.java:128)
...
打印的错误堆栈里就可以看到具体泄露的地方了。
找到这些泄露的地方,我们做release释放就可以解决问题了。
对泄露的bytebuf确保释放即可:
private void decryptMessage(...) {
// ...
try {
// ...
} catch (Exception e) {
throw e;
} finally {
// 释放 release 或者 ReferenceCountUtil.safeRelease(buf);
buf.release();
}
}
更新程序后上线,观察运行情况,堆外内存没有再出现异常情况。