• KingbaseES V8R3集群管理和维护案例之---failover切换wal日志变化分析


    案例说明:
    本案例通过对KingbaseES V8R3集群failover切换过程进行观察,分析了主备库切换后wal日志的变化,对应用者了解KingbaseES V8R3(R6) failover切换过程有一定的帮助。

    以下为现场案例:
    failover切换后主备库的wal日志信息:

    新主库数据库服务启动故障:(sys_log)

    =如下所示,在sys_log中,新主库启动startup后,建立流复制,流复制的起始wal日志是:“ 00000004000000050000002A”,导致复制失败。=

    适用版本: KingbaseES V8R3/R6

    节点信息:

    集群节点状态信息:

    1. [kingbase@node102 bin]$ ./ksql -U SYSTEM -W 123456 TEST -p 9999
    2. ksql (V008R003C002B0290)
    3. Type "help" for help.
    4. TEST=# show pool_nodes;
    5. node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
    6. ---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
    7. 0 | 192.168.1.101 | 54321 | up | 0.500000 | standby | 0 | true | 0
    8. 1 | 192.168.1.102 | 54321 | up | 0.500000 | primary | 0 | false | 0
    9. (2 rows)

    一、查看failover切换前节点信息

    1、原主库wal日志

    1. [kingbase@node102 sys_xlog]$ ls -lh
    2. .......
    3. -rw------- 1 kingbase kingbase 16M Jul 7 15:25 00000009000000000000002D.partial
    4. -rw------- 1 kingbase kingbase 339 Jul 7 15:09 00000009.history
    5. -rw------- 1 kingbase kingbase 16M Jul 29 10:56 0000000A000000000000002D
    6. -rw------- 1 kingbase kingbase 16M Jul 29 16:32 0000000A000000000000002E
    7. -rw------- 1 kingbase kingbase 16M Aug 3 10:22 0000000A000000000000002F
    8. -rw------- 1 kingbase kingbase 382 Jul 7 15:25 0000000A.history

    2、原主库控制文件信息

    1. [kingbase@node102 bin]$ ./sys_controldata -D ../data
    2. sys_control version number: 830
    3. Catalog version number: 201608131
    4. Database system identifier: 7080416207291699599
    5. Database cluster state: in production
    6. sys_control last modified: Wed 03 Aug 2022 10:26:57 AM CST
    7. Latest checkpoint location: 0/2F000108
    8. Prior checkpoint location: 0/2F000028
    9. Latest checkpoint's REDO location: 0/2F0000D0
    10. Latest checkpoint's REDO WAL file: 0000000A000000000000002F
    11. Latest checkpoint's TimeLineID: 10
    12. Latest checkpoint's PrevTimeLineID: 10

    3、备库wal日志

    1. [kingbase@node101 bin]$ ls -lh ../data/sys_xlog
    2. .......
    3. -rw------- 1 kingbase kingbase 16M Jul 7 15:25 00000009000000000000002D
    4. -rw------- 1 kingbase kingbase 339 Jun 22 16:15 00000009.history
    5. -rw------- 1 kingbase kingbase 16M Jul 29 16:14 0000000A000000000000002D
    6. -rw------- 1 kingbase kingbase 16M Aug 3 10:22 0000000A000000000000002E
    7. -rw------- 1 kingbase kingbase 16M Aug 3 10:27 0000000A000000000000002F
    8. -rw------- 1 kingbase kingbase 382 Jul 29 10:33 0000000A.history

    4、备库控制文件信息

    1. [kingbase@node101 bin]$ ./sys_controldata -D ../data
    2. sys_control version number: 830
    3. Catalog version number: 201608131
    4. Database system identifier: 7080416207291699599
    5. Database cluster state: in archive recovery
    6. sys_control last modified: Wed 03 Aug 2022 10:26:55 AM CST
    7. Latest checkpoint location: 0/2F000028
    8. Prior checkpoint location: 0/2E0002C8
    9. Latest checkpoint's REDO location: 0/2F000028
    10. Latest checkpoint's REDO WAL file: 0000000A000000000000002F
    11. Latest checkpoint's TimeLineID: 10
    12. Latest checkpoint's PrevTimeLineID: 10

    二、执行failover切换(关闭主库数据库服务)

    1、关闭主库数据库服务

    1. [kingbase@node102 bin]$ ./sys_ctl stop -D ../data
    2. waiting for server to shut down....... done
    3. server stopped

    三、failover切换完成主备状态信息1、新主库wal日志

    1. [kingbase@node101 bin]$ ls -lh ../data/sys_xlog
    2. .......
    3. -rw------- 1 kingbase kingbase 339 Jun 22 16:15 00000009.history
    4. -rw------- 1 kingbase kingbase 16M Jul 29 16:14 0000000A000000000000002D
    5. -rw------- 1 kingbase kingbase 16M Aug 3 10:22 0000000A000000000000002E
    6. -rw------- 1 kingbase kingbase 16M Aug 3 10:30 0000000A000000000000002F
    7. -rw------- 1 kingbase kingbase 16M Aug 3 10:30 0000000A0000000000000030.partial
    8. -rw------- 1 kingbase kingbase 382 Jul 29 10:33 0000000A.history
    9. -rw------- 1 kingbase kingbase 16M Aug 3 10:31 0000000B0000000000000030
    10. -rw------- 1 kingbase kingbase 426 Aug 3 10:30 0000000B.history

    切换完成后timeline发生切换:

    查看timeline history文件信息:

    2、新主库控制文件信息

    1. [kingbase@node101 bin]$ ./sys_controldata -D ../data
    2. sys_control version number: 830
    3. Catalog version number: 201608131
    4. Database system identifier: 7080416207291699599
    5. Database cluster state: in production
    6. sys_control last modified: Wed 03 Aug 2022 10:35:48 AM CST
    7. Latest checkpoint location: 0/3005E110
    8. Prior checkpoint location: 0/30004BD8
    9. Latest checkpoint's REDO location: 0/3005B370
    10. Latest checkpoint's REDO WAL file: 0000000B0000000000000030
    11. Latest checkpoint's TimeLineID: 11
    12. Latest checkpoint's PrevTimeLineID: 11

    3、新备库wal日志

    1. [kingbase@node102 bin]$ ls -lh ../data/sys_xlog
    2. .......
    3. -rw------- 1 kingbase kingbase 16M Jul 29 10:56 0000000A000000000000002D
    4. -rw------- 1 kingbase kingbase 16M Jul 29 16:32 0000000A000000000000002E
    5. -rw------- 1 kingbase kingbase 16M Aug 3 10:34 0000000A000000000000002F
    6. -rw------- 1 kingbase kingbase 16M Aug 3 10:34 0000000A0000000000000030.partial
    7. -rw------- 1 kingbase kingbase 382 Aug 3 10:34 0000000A.history
    8. -rw------- 1 kingbase kingbase 16M Aug 3 10:34 0000000B0000000000000030
    9. -rw------- 1 kingbase kingbase 426 Aug 3 10:34 0000000B.history

    4、新备库控制文件信息

    1. [kingbase@node102 bin]$ ./sys_controldata -D ../data
    2. sys_control version number: 830
    3. Catalog version number: 201608131
    4. Database system identifier: 7080416207291699599
    5. Database cluster state: in archive recovery
    6. sys_control last modified: Wed 03 Aug 2022 10:35:42 AM CST
    7. Latest checkpoint location: 0/30004BD8
    8. Prior checkpoint location: 0/30004BD8
    9. Latest checkpoint's REDO location: 0/30004BA0
    10. Latest checkpoint's REDO WAL file: 0000000B0000000000000030
    11. Latest checkpoint's TimeLineID: 11
    12. Latest checkpoint's PrevTimeLineID: 11

    四、将原主库作为备库恢复到集群

    1、在原主库data下创建recovery.conf[kingbase@node102 data]$ cp ../etc/recovery.done ./recovery.conf

    2、查看recovery.log信息

    1. primary node/Im node status is changed, primary ip[192.168.1.101], recovery.conf NEED_CHANGE [0] (0 is need ), I,m status is [1] (1 is down), I will be in recovery.
    2. node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
    3. ---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
    4. 0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | true | 0
    5. 1 | 192.168.1.102 | 54321 | down | 0.500000 | standby | 0 | false | 0
    6. (2 rows)
    7. if recover node up, let it down , for rewind
    8. 2022-08-03 10:34:35 sys_rewind...
    9. sys_rewind --target-data=/home/kingbase/cluster/R3HA/db/data --source-server="host=192.168.1.101 port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST"
    10. datadir_source = /home/kingbase/cluster/R3HA/db/data
    11. rewinding from last common checkpoint at 0/2F000108 on timeline 10
    12. find last common checkpoint start time from 2022-08-03 10:34:35.349563 CST to 2022-08-03 10:34:35.405349 CST, in "0.055786" seconds.
    13. reading source file list
    14. reading target file list
    15. reading WAL in target
    16. Rewind datadir file from source
    17. update the control file: minRecoveryPoint is '0/3004D0C8', minRecoveryPointTLI is '11', and database state is 'in archive recovery'
    18. rewind start wal location 0/2F0000D0 (file 0000000A000000000000002F), end wal location 0/3004D0C8 (file 0000000B0000000000000030). time from 2022-08-03 10:34:37.349563 CST to 2022-08-03 10:34:37.872586 CST, in "2.523023" seconds.
    19. Done!
    20. sed conf change #synchronous_standby_names
    21. 2022-08-03 10:34:39 file operate
    22. cp recovery.conf...
    23. change recovery.conf ip -> primary.ip
    24. 2022-08-03 10:34:39 change recovery.conf
    25. delete pid file if exist
    26. del the replication_slots if exis
    27. drop the slot [slot_node101].
    28. drop the slot [slot_node102].
    29. 2022-08-03 10:34:40 start up the kingbase...
    30. waiting for server to start....LOG: redirecting log output to logging collector process
    31. HINT: Future log output will appear in directory "/home/kingbase/cluster/R3HA/db/data/sys_log".
    32. done
    33. server started
    34. ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
    35. SYS_CREATE_PHYSICAL_REPLICATION_SLOT
    36. --------------------------------------
    37. (slot_node101,)
    38. (1 row)
    39. 2022-08-03 10:34:42 create the slot [slot_node101] success.
    40. SYS_CREATE_PHYSICAL_REPLICATION_SLOT
    41. --------------------------------------
    42. (slot_node102,)
    43. (1 row)
    44. 2022-08-03 10:34:42 create the slot [slot_node102] success.
    45. 2022-08-03 10:34:42 start up standby successful!
    46. can not get the replication of myself

    如下所示:recovery过程:

    五、总结
    在集群执行failover切换时,可以结合wal日志和recovery.log和控制文件的变化,可以详细了解failover切换中wal日志的变化,及通过sys_rewind工具对原主库的恢复过程。

  • 相关阅读:
    聊聊ChatGLM-6B源码分析(二)
    链表OJ题+牛客题
    ArcGIS Pro SDK 002 对数据文件的读取和操作
    Git:Git中的分支管理
    威联通NAS安装Openwrt旁路由教程
    高频更新使用sse好还是WebSocket
    MQ~消息队列能力、AMQP协议、现有选择(Kafka、RabbitMQ、RocketMQ 、Pulsar)
    STM32——红外遥控器实验
    海康Visionmaster-环境配置:VB.Net 二次开发环境配 置方法
    从0到一开发微信小程序(1)——申请账号并安装开发环境
  • 原文地址:https://blog.csdn.net/lyu1026/article/details/126193050