• 手动修复 rabbitmq 报错 “Crash dump is being written to“


    rabbitmq 报错:

    2023-11-07 16:38:52.682 [error] emulator Error in process <0.368.0> on node 'rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local' with exit value:
    {shutdown,[{mnesia_loader,handle_exit,2,[{file,"mnesia_loader.erl"},{line,963}]},{mnesia_loader,tab_receiver,5,[{file,"mnesia_loader.erl"},{line,440}]},{mnesia_loader,spawned_receiver,8,[{file,"mnesia_loader.erl"},{line,343}]}]}
    2023-11-07 16:38:52.683 [error] emulator Error in process <0.367.0> on node 'rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local' with exit value:
    {badarg,[{ets,insert,[mnesia_gvar,{last_error,{{shutdown,[{mnesia_loader,handle_exit,2,[{file,"mnesia_loader.erl"},{line,963}]},{mnesia_loader,tab_receiver,5,[{file,"mnesia_loader.erl"},{line,440}]},{mnesia_loader,spawned_receiver,8,[{file,"mnesia_loader.erl"},{line,343}]}]},[{mnesia_loader,wait_on_load_complete,1,[{file,"mnesia_loader.erl"},{line,359}]},{mnesia_tm,apply_fun,3,[{file,"mnesia_tm.erl"},{line,840}]},{mnesia_tm,execute_transaction,5,[{file,"mnesia_tm.erl"},{line,816}]},{mnesia_loader,init_receiver,5,[{file,"mnesia_loader.erl"},{line,285}]},{mnesia_loader,do_get_network_copy,5,[{file,"mnesia_loader.erl"},{line,221}]},{mnesia_controller,'-load_table_fun/1-fun-4-',5,[{file,"mnesia_controller.erl"},{line,2186}]},{mnesia_controller,'-load_and_reply/2-fun-0-',2,[{file,"mnesia_controller.erl"},{line,2133}]}]}}],[]},{mnesia_lib,set,2,[{file,"mnesia_lib.erl"},{line,443}]},{mnesia_lib,fix_error,1,[{file,"mnesia_lib.erl"},{line,906}]},{mnesia_tm,return_abort,3,[{file,"mnesia_tm.erl"},{line,962}]},{mnesia_loader,init_receiver,5,[{file,"mnesia_loader.erl"},{line,285}]},{mnesia_loader,do_get_network_copy,5,[{file,"mnesia_loader.erl"},{line,221}]},{mnesia_controller,'-load_table_fun/1-fun-4-',5,[{file,"mnesia_controller.erl"},{line,2186}]},{mnesia_controller,'-load_and_reply/2-fun-0-',2,[{file,"mnesia_controller.erl"},{line,2133}]}]}
    2023-11-07 16:38:52.685 [info] <0.43.0> Application mnesia exited with reason: stopped
    2023-11-07 16:38:52.685 [info] <0.43.0> Application tools exited with reason: stopped
    2023-11-07 16:38:52.685 [error] <0.8.0> 
    Error description:
        init:do_boot/3
        init:start_em/1
        rabbit:start_it/1 line 465
        rabbit:broker_start/1 line 341
        rabbit:start_loaded_apps/2 line 586
        app_utils:manage_applications/6 line 126
        lists:foldl/3 line 1263
        rabbit:'-handle_app_error/1-fun-0-'/3 line 709
    throw:{could_not_start,ra,
           {ra,
            {{shutdown,
              {failed_to_start_child,ra_system_sup,
               {shutdown,
                {failed_to_start_child,ra_log_sup,
                 {shutdown,
                  {failed_to_start_child,ra_log_wal_sup,
                   {shutdown,
                    {failed_to_start_child,ra_log_wal,
                     {{case_clause,{ok,<<>>}},
                      [{ra_log_wal,open_existing,1,
                        [{file,"src/ra_log_wal.erl"},{line,556}]},
                       {ra_log_wal,'-recover_wal/2-lc$^0/1-0-',1,
                        [{file,"src/ra_log_wal.erl"},{line,240}]},
                       {ra_log_wal,recover_wal,2,
                        [{file,"src/ra_log_wal.erl"},{line,243}]},
                       {ra_log_wal,init,1,
                        [{file,"src/ra_log_wal.erl"},{line,186}]},
                       {gen_batch_server,init_it,6,
                        [{file,"src/gen_batch_server.erl"},{line,125}]},
                       {proc_lib,init_p_do_apply,3,
                        [{file,"proc_lib.erl"},{line,249}]}]}}}}}}}}},
             {ra_app,start,[normal,[]]}}}}
    Log file(s) (may contain more information):
       
    
    BOOT FAILED
    ===========
    
    Error description:
        init:do_boot/3
        init:start_em/1
        rabbit:start_it/1 line 465
        rabbit:broker_start/1 line 341
        rabbit:start_loaded_apps/2 line 586
        app_utils:manage_applications/6 line 126
        lists:foldl/3 line 1263
        rabbit:'-handle_app_error/1-fun-0-'/3 line 709
    throw:{could_not_start,ra,
           {ra,
            {{shutdown,
              {failed_to_start_child,ra_system_sup,
               {shutdown,
                {failed_to_start_child,ra_log_sup,
                 {shutdown,
                  {failed_to_start_child,ra_log_wal_sup,
                   {shutdown,
                    {failed_to_start_child,ra_log_wal,
                     {{case_clause,{ok,<<>>}},
                      [{ra_log_wal,open_existing,1,
                        [{file,"src/ra_log_wal.erl"},{line,556}]},
                       {ra_log_wal,'-recover_wal/2-lc$^0/1-0-',1,
                        [{file,"src/ra_log_wal.erl"},{line,240}]},
                       {ra_log_wal,recover_wal,2,
                        [{file,"src/ra_log_wal.erl"},{line,243}]},
                       {ra_log_wal,init,1,
                        [{file,"src/ra_log_wal.erl"},{line,186}]},
                       {gen_batch_server,init_it,6,
                        [{file,"src/gen_batch_server.erl"},{line,125}]},
                       {proc_lib,init_p_do_apply,3,
                        [{file,"proc_lib.erl"},{line,249}]}]}}}}}}}}},
             {ra_app,start,[normal,[]]}}}}
    Log file(s) (may contain more information):
       
    
    {"init terminating in do_boot",{could_not_start,ra,{ra,{{shutdown,{failed_to_start_child,ra_system_sup,{shutdown,{failed_to_start_child,ra_log_sup,{shutdown,{failed_to_start_child,ra_log_wal_sup,{shutdown,{failed_to_start_child,ra_log_wal,{{case_clause,{ok,<<>>}},[{ra_log_wal,open_existing,1,[{file,"src/ra_log_wal.erl"},{line,556}]},{ra_log_wal,'-recover_wal/2-lc$^0/1-0-',1,[{file,"src/ra_log_wal.erl"},{line,240}]},{ra_log_wal,recover_wal,2,[{file,"src/ra_log_wal.erl"},{line,243}]},{ra_log_wal,init,1,[{file,"src/ra_log_wal.erl"},{line,186}]},{gen_batch_server,init_it,6,[{file,"src/gen_batch_server.erl"},{line,125}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}}}}}}}}},{ra_app,start,[normal,[]]}}}}}
    init terminating in do_boot ({could_not_start,ra,{ra,{{shutdown,{_}},{ra_app,start,[_]}}}})
    
    Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87

    修复方法:
    (1) 找到 rabbitmq 使用的 pv,例如: rabbitmq-0 的 pod:

    # kubectl get pv | grep rabbitmq-0
    pvc-70ed48bf-bef8-4658-b530-1fd3a6ef5937   200Gi      RWO            Delete           Bound    openstack/rabbitmq-data-rabbitmq-0                                    ceph-ssd                6d17h
    
    • 1
    • 2

    (2) 找到 pv 使用的信息:

    # kubectl get pv pvc-70ed48bf-bef8-4658-b530-1fd3a6ef5937 -o yaml
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      annotations:
        kubernetes.io/createdby: rbd-dynamic-provisioner
        pv.kubernetes.io/bound-by-controller: "yes"
        pv.kubernetes.io/provisioned-by: kubernetes.io/rbd
      creationTimestamp: "2023-10-31T15:40:59Z"
      finalizers:
      - kubernetes.io/pv-protection
      name: pvc-70ed48bf-bef8-4658-b530-1fd3a6ef5937
      resourceVersion: "7552"
      uid: 6848417a-dd4f-430c-85e5-f3234a1ac6bf
    spec:
      accessModes:
      - ReadWriteOnce
      capacity:
        storage: 200Gi
      claimRef:
        apiVersion: v1
        kind: PersistentVolumeClaim
        name: rabbitmq-data-rabbitmq-0
        namespace: openstack
        resourceVersion: "4704"
        uid: 70ed48bf-bef8-4658-b530-1fd3a6ef5937
      persistentVolumeReclaimPolicy: Delete
      rbd:
        image: kubernetes-dynamic-pvc-c8a3585f-dc7b-438c-a22e-cca9d84c341f
        keyring: /etc/ceph/keyring
        monitors:
        - ceph-mon.ceph.svc.cluster.local:6789
        pool: ssdpool
        secretRef:
          name: pvc-ceph-client-key
        user: admin
      storageClassName: ceph-ssd
      volumeMode: Filesystem
    status:
      phase: Bound
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40

    需要的信息:

        image: kubernetes-dynamic-pvc-c8a3585f-dc7b-438c-a22e-cca9d84c341f
    
    • 1

    (3) 在 pod 节点上查看对应的物理设备

    # ssh node-2 rbd showmapped | grep kubernetes-dynamic-pvc-c8a3585f-dc7b-438c-a22e-cca9d84c341f
    0  ssdpool           kubernetes-dynamic-pvc-c8a3585f-dc7b-438c-a22e-cca9d84c341f -    /dev/rbd0  
    
    • 1
    • 2

    (4) 查看设备挂载目录

    # ssh node-2 mount | grep rbd0
    /dev/rbd0 on /var/lib/kubelet/plugins/kubernetes.io/rbd/mounts/ssdpool-image-kubernetes-dynamic-pvc-c8a3585f-dc7b-438c-a22e-cca9d84c341f type ext4 (rw,relatime,stripe=1024)
    /dev/rbd0 on /var/lib/kubelet/pods/3a37e264-4fd5-4cb8-844b-6b6cd4a6859c/volumes/kubernetes.io~rbd/pvc-70ed48bf-bef8-4658-b530-1fd3a6ef5937 type ext4 (rw,relatime,stripe=1024)
    
    • 1
    • 2
    • 3

    (5) 查找 wal 文件路径,查找的路径来自步骤 (4)

    # ssh node-2 find /var/lib/kubelet/pods/3a37e264-4fd5-4cb8-844b-6b6cd4a6859c/volumes/kubernetes.io~rbd/pvc-70ed48bf-bef8-4658-b530-1fd3a6ef5937 -name "*.wal"
    /var/lib/kubelet/pods/3a37e264-4fd5-4cb8-844b-6b6cd4a6859c/volumes/kubernetes.io~rbd/pvc-70ed48bf-bef8-4658-b530-1fd3a6ef5937/mnesia/rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local/quorum/rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local/00000025.wal
    
    • 1
    • 2

    (6) 删除 wal 文件
    此步骤请慎重操作,建议将文件备份后再操作。

    # ssh node-2 rm -rf /var/lib/kubelet/pods/3a37e264-4fd5-4cb8-844b-6b6cd4a6859c/volumes/kubernetes.io~rbd/pvc-70ed48bf-bef8-4658-b530-1fd3a6ef5937/mnesia/rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local/quorum/rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local/00000025.wal
    Warning: Permanently added 'node-2' (ED25519) to the list of known hosts.
    
    • 1
    • 2

    (7) 删除 pod,重新启动 pod

    # kubectl delete pods rabbitmq-0 -n openstack 
    pod "rabbitmq-0" deleted
    
    • 1
    • 2

    等待 pod 再次启动,过一会重新数据同步恢复。

  • 相关阅读:
    V-Value in fiber(光纤中的V值)
    渲染流程之应用阶段及几何处理阶段
    体验亚马逊的 CodeWhisperer 感觉
    在网络安全对抗中,供应链攻击的手法有哪些?
    【状语从句练习题】综合训练
    主流开发语言和开发环境介绍
    Flutter 小技巧之 MediaQuery 和 build 优化你不知道的秘密
    线程死锁与检测
    C++ 判断闰年 & 洛谷习题P5737题解
    Java多线程笔记1
  • 原文地址:https://blog.csdn.net/Hello_NB1/article/details/134271012