今天遇到了rabbitmq宕机,具体现象是java publisher发送消息后,consumer没有收到消息,publisher也没有报错(未使用ack机制);于是重启docker(rabbitmq装在docker中),重启失败,查看日志出现
2022-11-10 08:11:22.831 [info] <0.44.0> Application rabbit exited with
reason:{{could_not_write_file,“/var/lib/rabbitmq/mnesia/rabbit@5d259b8e06a1/cluster_nodes.config”,enospc},{rabbit,start,[normal,[]]}}
{“Kernel pid
terminated”,application_controller,“{application_start_failure,rabbit,{{could_not_write_file,”/var/lib/rabbitmq/mnesia/rabbit@5d259b8e06a1/cluster_nodes.config",enospc},{rabbit,start,[normal,[]]}}}"}
08:17:58.111 [warning] Failed to write PID file
“/var/lib/rabbitmq/mnesia/rabbit@5d259b8e06a1.pid”: no space left on
device
导出log, docker logs contianerId > log.file
或者 查看最新log
docker logs -f --tail=100 containerId
执行docker info
,找到根路径
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry:https://index.docker.io/v1/
查看根路径所在磁盘的使用率,Use已经100%,说明磁盘没有空间了。
[root@xxx ~]# df -h /var/lib/docker
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 40G 40G 0 100% /
将没用的日志,删除释放空间,然后rabbitmq重启,就好了。
跟踪日志发现,早在前一天就已经出现警告,警告的意思是磁盘空间不足,如果不修正,那么publisher会被阻塞,不能发送。经查阅资料得知,默认的rabbitmq达到物理内存的40%、或者磁盘空间不足50M时,就会发出警告,阻塞publisher发布消息。
**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************
2022-11-09 19:13:11.675 [info] <0.437.0> Free disk space is sufficient. Free bytes: 51777536. Limit: 50000000
2022-11-09 19:13:11.675 [warning] <0.433.0> disk resource limit alarm cleared on node rabbit@5d259b8e06a1
2022-11-09 19:13:11.675 [warning] <0.433.0> disk resource limit alarm cleared across the cluster
2022-11-09 20:40:10.980 [info] <0.437.0> Free disk space is insufficient. Free bytes: 49848320. Limit: 50000000
2022-11-09 20:40:10.980 [warning] <0.433.0> disk resource limit alarm set on node rabbit@5d259b8e06a1.
du -sh
查看的是文件所占用的大小,比如
[root@xxx lib]# du -sh /var/lib/docker
4.4G /var/lib/docker
df -h
查看的是,磁盘使用情况
[root@xxx lib]# df -h /var/lib/docker
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 40G 12G 27G 31% /
而df查看的是全部磁盘使用情况
[root@xxx lib]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 16G 1.1M 16G 1% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/vda1 40G 12G 27G 31% /
tmpfs 3.1G 0 3.1G 0% /run/user/0
overlay 40G 12G 27G 31% /var/lib/docker/overlay2/584601df5d3aa8130909fd7f333bcfbff75c774a3ec363bff1be43e1f89e4508/merged
overlay 40G 12G 27G 31% /var/lib/docker/overlay2/87076bd5e7d8d907952d91a01582ef8197135cae69868d17df4932a7e2abe75d/merged
overlay 40G 12G 27G 31% /var/lib/docker/overlay2/e09ecced1347db602ad05103abf50b3fb305e62ee86729dd6dff52bec88dbaba/merged
overlay 40G 12G 27G 31% /var/lib/docker/overlay2/3854314a6cff111d497813c01517e5c6be4429dca9d1ed6a3deca4e3f477b318/merged
启动的docker是4个,所以有4条overlay记录;每一条都是12G指的是所在磁盘已经使用12G,而不是每一条占用12G
某天,某队列消息消费的比较慢,为了不影响业务,首先重启了rabbitmq,重启后,出现大量
2024-03-08 14:57:45.152 [error] <0.556.0> Discarding message {'$gen_cast',{confirm,"+",<0.556.0>}} from <0.556.0> to <0.996.0> in an old incarnation (1709878733) of this node (1709880166)
2024-03-08 14:57:45.152 [error] emulator Discarding message {'$gen_cast',{confirm,"+",<0.556.0>}} from <0.556.0> to <0.996.0> in an old incarnation (1709878733) of this node (1709880166)
2024-03-08 14:58:43.224 [info] <0.60.0> SIGTERM received - shutting down
没过一会rabbitmq崩溃了,然后又重启又崩溃,后来查到是因为消费失败不断重试导致,配置如下
spring:
rabbitmq:
listener:
simple:
acknowledge-mode: NONE
# 抛出异常时,不再入队列
default-requeue-rejected: false
retry:
# 消息失败时,会重试,默认是阻塞重试
enabled: true
将重试改为false,可参考springboot rabbitmq 延时消息、延迟消息、非阻塞重试机制实现
spring:
rabbitmq:
listener:
simple:
acknowledge-mode: NONE
# 抛出异常时,不再入队列
default-requeue-rejected: false
retry:
# 这里改为false
enabled: false
顺便说一下,rabbitmq的重试有发送重试(默认未开启)、消费重试,上边修改的是消费重试。因为是两个应用都是用了同一个vhost(根目录),推测如果通过vhost区分开也许会好一些,vhost是逻辑隔离,如果阻塞严重也会还会影响其他vhost,但未验证。