起因:矜矜业业的替老板省钱,新用户半价购买了服务器。分布式项目迁移到新服务器后出现了服务掉线超时的情况,实际上是SpringBootAdmin 掉线,nacos中是存在的。
- java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 60000ms in 'map' (and no fallback has been configured)
- at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.handleTimeout(FluxTimeout.java:295)
- Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:
- Error has been observed at the following site(s):
- |_ checkpoint ⇢ Request to GET health [DefaultWebClient]
- Stack trace:
- at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.handleTimeout(FluxTimeout.java:295)
- at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.doTimeout(FluxTimeout.java:280)
- at reactor.core.publisher.FluxTimeout$TimeoutTimeoutSubscriber.onNext(FluxTimeout.java:419)
- at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onNext(FluxOnErrorResume.java:79)
- at reactor.core.publisher.MonoDelay$MonoDelayRunnable.propagateDelay(MonoDelay.java:271)
- at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:286)
- at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
- at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
- at java.util.concurrent.FutureTask.run(FutureTask.java:266)
- at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
- at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
- at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
- at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
- at java.lang.Thread.run(Thread.java:750)
经过了长达半个月(加班找方案,加班加的人都要凉了)的不断摸索与尝试终于找到了答案:
Springbootadmin每隔一段时间就会检查消费者的健康接口,如果返回结果超时,就会掉线。健康接口会检查邮箱,redis等服务。这些服务可能会导致延迟。邮箱的校验是比较特殊的,不能够随时返回结果,偶尔会发现出现超高延迟的现象,从而阻塞进程,导致健康接口无法返回数据。获取不到返回结果的Springbootadmin会判定该消费者为离线状态。我的问题就是邮箱校验导致的。换个服务器邮箱校验就会频繁掉线,醉了。
解决方案: 消费者中添加如下配置 management: health: mail: enabled: false
以下是尝试过的其他方案,虽然没有解决问题,留给后来者参考,避免查阅大量无用资料。
原因1.SpringbootAdmin版本问题,据说2.6.1有这个bug,在后期的版本修复。笔者调整为2.6.6
原因2.监控健康的超时时间默认是5S,修改一下,在Springbootadmin所在项目的配置文件中增加
spring: boot: admin: monitor: default-timeout: 30000 status-interval: 15000 status-lifetime: 15000
原因3:cpu休眠导致的返回超时,这个具体解决请百度,笔者没尝试。
原因4:不正确的使用线程池导致进程阻塞,查看日志文件,寻找相应的代码即可
原因5:服务使用内存超过上线,需要增加服务器内存。会有相应的内存溢出报错,所以很好排查