背景
线上发现elasticsearch集群状态red,并且有个es节点jvm内存使用不断升高,直到gc后依然内存不够使用,服务停止。查看日志,elasticsearch出现OOM报错。
[2023-12-06T08:21:26,706][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-10.136.5.85] fatal error in thread [Thread-1243], exiting java.lang.OutOfMemoryError: Java heap space at io.netty.util.internal.PlatformDependent.allocateUninitializedArray(PlatformDependent.java:281) ~[?:?] at io.netty.buffer.PoolArena$HeapArena.newByteArray(PoolArena.java:662) ~[?:?] at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:672) ~[?:?] at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:247) ~[?:?] at io.netty.buffer.PoolArena.allocate(PoolArena.java:227) ~[?:?] at io.netty.buffer.PoolArena.allocate(PoolArena.java:147) ~[?:?] at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:339) ~[?:?] at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:168) ~[?:?] at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:159) ~[?:?] at org.elasticsearch.transport.NettyAllocator$NoDirectBuffers.heapBuffer(NettyAllocator.java:137) ~[?:?] at org.elasticsearch.transport.NettyAllocator$NoDirectBuffers.ioBuffer(NettyAllocator.java:122) ~[?:?] at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:114) ~[?:?] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:147) ~[?:?] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) ~[?:?] at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?] at java.lang.Thread.run(Thread.java:832) [?:?] [2023-12-06T08:21:26,707][WARN ][o.e.h.AbstractHttpServerTransport] [node-10.136.5.85] caught exception while handling client http traffic, closing connection Netty4HttpChannel{localAddress=/10.136.5.85:9200, remoteAddress=/10.136.5.71:49648} java.lang.Exception: java.lang.OutOfMemoryError: Java heap space at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.exceptionCaught(Netty4HttpRequestHandler.java:69) [transport-netty4-client-7.8.1.jar:7.8.1] at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) [netty-transport-4.1.49.Final.jar:4.1.49.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) [netty-transport-4.1.49.Final.jar:4.1.49.Final] at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273) [netty-transport-4.1.49.Final.jar:4.1.49.Final] at io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377) [netty-transport-4.1.49.Final.jar:4.1.49.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) [netty-transport-4.1.49.Final.jar:4.1.49.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) [netty-transport-4.1.49.Final.jar:4.1.49.Final] at io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907) [netty-transport-4.1.49.Final.jar:4.1.49.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:125) [netty-transport-4.1.49.Final.jar:4.1.49.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:174) [netty-transport-4.1.49.Final.jar:4.1.49.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) [netty-transport-4.1.49.Final.jar:4.1.49.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) [netty-transport-4.1.49.Final.jar:4.1.49.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) [netty-transport-4.1.49.Final.jar:4.1.49.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) [netty-transport-4.1.49.Final.jar:4.1.49.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [netty-common-4.1.49.Final.jar:4.1.49.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.49.Final.jar:4.1.49.Final] at java.lang.Thread.run(Thread.java:832) [?:?] Caused by: java.lang.OutOfMemoryError: Java heap space at io.netty.util.internal.PlatformDependent.allocateUninitializedArray(PlatformDependent.java:281) ~[?:?] at io.netty.buffer.PoolArena$HeapArena.newByteArray(PoolArena.java:662) ~[?:?] at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:672) ~[?:?] at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:247) ~[?:?] at io.netty.buffer.PoolArena.allocate(PoolArena.java:227) ~[?:?] at io.netty.buffer.PoolArena.allocate(PoolArena.java:147) ~[?:?] at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:339) ~[?:?] at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:168) ~[?:?] at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:159) ~[?:?] at org.elasticsearch.transport.NettyAllocator$NoDirectBuffers.heapBuffer(NettyAllocator.java:137) ~[?:?] at org.elasticsearch.transport.NettyAllocator$NoDirectBuffers.ioBuffer(NettyAllocator.java:122) ~[?:?] at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:114) ~[?:?] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:147) ~[?:?] ... 7 more
搜索之前的广大网友的经验,discuss论坛有和我这一模一样的报错,但是没有回答。相似的报错,GitHub上有一个issue,但是已经在7.4版本解决了。我的版本号是7.8.1,说明问题不是一个问题。
先排查机器内存是否不够用,发现不是,是jvm.options配置的-Xms和-Xmx内存不够用,尝试配置到机器内存一半,配置30G,观察,依然出现同样的问题。
排查报错日志,发现是netty的报错,怀疑是写入出现问题。
Dump一下内存,分析下是什么占用了这么多的内存。
并发写入,用到了ScheduledThreadPool ,创建了大量线程 gc无法回收线程中内存导致内存不够用了。
既然是写入问题,那测试下硬盘问题,使用fio测试硬盘随机读写
使用iostat -xdm 10
查看实时的硬盘读写,大概3M/s 磁盘随机读写都有10M/s了,说明不是磁盘的问题。
查看网络问题,ping有问题的机器,发现了端倪,只有他又慢,延迟又高,其它都比较低。
使用iperf检测网络问题。还没检查,得知客户给这台机器插得网口是百兆网口,其它机器都是千兆网口。
结合elasticsearch 集群Transport业务逻辑,分片平均分配到所有机器上,每台机器都接收写入的请求。整个集群的网络分发数据有木桶效应,一个网口慢会让整个集群都慢。(如果没有特别指定分片所处的节点)
结论
出现上述报错,优先排查集群网络问题,查看数据的写入量是否超过了网口速率的上限。改成千兆网口即可
__EOF__
本文作者:赛博朋克V
本文链接:https://www.cnblogs.com/pengliblogs/p/17949246.html
关于博主:评论和私信会在第一时间回复。或者直接私信我。
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
声援博主:如果您觉得文章对您有帮助,可以点击文章右下角【推荐】一下。您的鼓励是博主的最大动力!
本文链接:https://www.cnblogs.com/pengliblogs/p/17949246.html
关于博主:评论和私信会在第一时间回复。或者直接私信我。
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
声援博主:如果您觉得文章对您有帮助,可以点击文章右下角【推荐】一下。您的鼓励是博主的最大动力!