案例说明:
本案例通过对KingbaseES V8R3集群failover切换过程进行观察,分析了主备库切换后wal日志的变化,对应用者了解KingbaseES V8R3(R6) failover切换过程有一定的帮助。
以下为现场案例:
failover切换后主备库的wal日志信息:
新主库数据库服务启动故障:(sys_log)
=如下所示,在sys_log中,新主库启动startup后,建立流复制,流复制的起始wal日志是:“ 00000004000000050000002A”,导致复制失败。=
适用版本: KingbaseES V8R3/R6
节点信息:
集群节点状态信息:
- [kingbase@node102 bin]$ ./ksql -U SYSTEM -W 123456 TEST -p 9999
- ksql (V008R003C002B0290)
- Type "help" for help.
-
- TEST=# show pool_nodes;
- node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
- ---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
- 0 | 192.168.1.101 | 54321 | up | 0.500000 | standby | 0 | true | 0
- 1 | 192.168.1.102 | 54321 | up | 0.500000 | primary | 0 | false | 0
- (2 rows)
一、查看failover切换前节点信息
1、原主库wal日志
- [kingbase@node102 sys_xlog]$ ls -lh
- .......
- -rw------- 1 kingbase kingbase 16M Jul 7 15:25 00000009000000000000002D.partial
- -rw------- 1 kingbase kingbase 339 Jul 7 15:09 00000009.history
- -rw------- 1 kingbase kingbase 16M Jul 29 10:56 0000000A000000000000002D
- -rw------- 1 kingbase kingbase 16M Jul 29 16:32 0000000A000000000000002E
- -rw------- 1 kingbase kingbase 16M Aug 3 10:22 0000000A000000000000002F
- -rw------- 1 kingbase kingbase 382 Jul 7 15:25 0000000A.history
2、原主库控制文件信息
- [kingbase@node102 bin]$ ./sys_controldata -D ../data
- sys_control version number: 830
- Catalog version number: 201608131
- Database system identifier: 7080416207291699599
- Database cluster state: in production
- sys_control last modified: Wed 03 Aug 2022 10:26:57 AM CST
- Latest checkpoint location: 0/2F000108
- Prior checkpoint location: 0/2F000028
- Latest checkpoint's REDO location: 0/2F0000D0
- Latest checkpoint's REDO WAL file: 0000000A000000000000002F
- Latest checkpoint's TimeLineID: 10
- Latest checkpoint's PrevTimeLineID: 10
3、备库wal日志
- [kingbase@node101 bin]$ ls -lh ../data/sys_xlog
- .......
- -rw------- 1 kingbase kingbase 16M Jul 7 15:25 00000009000000000000002D
- -rw------- 1 kingbase kingbase 339 Jun 22 16:15 00000009.history
- -rw------- 1 kingbase kingbase 16M Jul 29 16:14 0000000A000000000000002D
- -rw------- 1 kingbase kingbase 16M Aug 3 10:22 0000000A000000000000002E
- -rw------- 1 kingbase kingbase 16M Aug 3 10:27 0000000A000000000000002F
- -rw------- 1 kingbase kingbase 382 Jul 29 10:33 0000000A.history
4、备库控制文件信息
- [kingbase@node101 bin]$ ./sys_controldata -D ../data
- sys_control version number: 830
- Catalog version number: 201608131
- Database system identifier: 7080416207291699599
- Database cluster state: in archive recovery
- sys_control last modified: Wed 03 Aug 2022 10:26:55 AM CST
- Latest checkpoint location: 0/2F000028
- Prior checkpoint location: 0/2E0002C8
- Latest checkpoint's REDO location: 0/2F000028
- Latest checkpoint's REDO WAL file: 0000000A000000000000002F
- Latest checkpoint's TimeLineID: 10
- Latest checkpoint's PrevTimeLineID: 10
二、执行failover切换(关闭主库数据库服务)
1、关闭主库数据库服务
- [kingbase@node102 bin]$ ./sys_ctl stop -D ../data
- waiting for server to shut down....... done
- server stopped
三、failover切换完成主备状态信息1、新主库wal日志
- [kingbase@node101 bin]$ ls -lh ../data/sys_xlog
- .......
- -rw------- 1 kingbase kingbase 339 Jun 22 16:15 00000009.history
- -rw------- 1 kingbase kingbase 16M Jul 29 16:14 0000000A000000000000002D
- -rw------- 1 kingbase kingbase 16M Aug 3 10:22 0000000A000000000000002E
- -rw------- 1 kingbase kingbase 16M Aug 3 10:30 0000000A000000000000002F
- -rw------- 1 kingbase kingbase 16M Aug 3 10:30 0000000A0000000000000030.partial
- -rw------- 1 kingbase kingbase 382 Jul 29 10:33 0000000A.history
- -rw------- 1 kingbase kingbase 16M Aug 3 10:31 0000000B0000000000000030
- -rw------- 1 kingbase kingbase 426 Aug 3 10:30 0000000B.history
切换完成后timeline发生切换:
查看timeline history文件信息:
2、新主库控制文件信息
- [kingbase@node101 bin]$ ./sys_controldata -D ../data
- sys_control version number: 830
- Catalog version number: 201608131
- Database system identifier: 7080416207291699599
- Database cluster state: in production
- sys_control last modified: Wed 03 Aug 2022 10:35:48 AM CST
- Latest checkpoint location: 0/3005E110
- Prior checkpoint location: 0/30004BD8
- Latest checkpoint's REDO location: 0/3005B370
- Latest checkpoint's REDO WAL file: 0000000B0000000000000030
- Latest checkpoint's TimeLineID: 11
- Latest checkpoint's PrevTimeLineID: 11

3、新备库wal日志
- [kingbase@node102 bin]$ ls -lh ../data/sys_xlog
- .......
- -rw------- 1 kingbase kingbase 16M Jul 29 10:56 0000000A000000000000002D
- -rw------- 1 kingbase kingbase 16M Jul 29 16:32 0000000A000000000000002E
- -rw------- 1 kingbase kingbase 16M Aug 3 10:34 0000000A000000000000002F
- -rw------- 1 kingbase kingbase 16M Aug 3 10:34 0000000A0000000000000030.partial
- -rw------- 1 kingbase kingbase 382 Aug 3 10:34 0000000A.history
- -rw------- 1 kingbase kingbase 16M Aug 3 10:34 0000000B0000000000000030
- -rw------- 1 kingbase kingbase 426 Aug 3 10:34 0000000B.history
4、新备库控制文件信息
- [kingbase@node102 bin]$ ./sys_controldata -D ../data
- sys_control version number: 830
- Catalog version number: 201608131
- Database system identifier: 7080416207291699599
- Database cluster state: in archive recovery
- sys_control last modified: Wed 03 Aug 2022 10:35:42 AM CST
- Latest checkpoint location: 0/30004BD8
- Prior checkpoint location: 0/30004BD8
- Latest checkpoint's REDO location: 0/30004BA0
- Latest checkpoint's REDO WAL file: 0000000B0000000000000030
- Latest checkpoint's TimeLineID: 11
- Latest checkpoint's PrevTimeLineID: 11
四、将原主库作为备库恢复到集群
1、在原主库data下创建recovery.conf[kingbase@node102 data]$ cp ../etc/recovery.done ./recovery.conf
2、查看recovery.log信息
- primary node/Im node status is changed, primary ip[192.168.1.101], recovery.conf NEED_CHANGE [0] (0 is need ), I,m status is [1] (1 is down), I will be in recovery.
- node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
- ---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
- 0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | true | 0
- 1 | 192.168.1.102 | 54321 | down | 0.500000 | standby | 0 | false | 0
- (2 rows)
-
- if recover node up, let it down , for rewind
- 2022-08-03 10:34:35 sys_rewind...
- sys_rewind --target-data=/home/kingbase/cluster/R3HA/db/data --source-server="host=192.168.1.101 port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST"
- datadir_source = /home/kingbase/cluster/R3HA/db/data
- rewinding from last common checkpoint at 0/2F000108 on timeline 10
- find last common checkpoint start time from 2022-08-03 10:34:35.349563 CST to 2022-08-03 10:34:35.405349 CST, in "0.055786" seconds.
- reading source file list
- reading target file list
- reading WAL in target
- Rewind datadir file from source
- update the control file: minRecoveryPoint is '0/3004D0C8', minRecoveryPointTLI is '11', and database state is 'in archive recovery'
- rewind start wal location 0/2F0000D0 (file 0000000A000000000000002F), end wal location 0/3004D0C8 (file 0000000B0000000000000030). time from 2022-08-03 10:34:37.349563 CST to 2022-08-03 10:34:37.872586 CST, in "2.523023" seconds.
- Done!
- sed conf change #synchronous_standby_names
-
- 2022-08-03 10:34:39 file operate
- cp recovery.conf...
- change recovery.conf ip -> primary.ip
- 2022-08-03 10:34:39 change recovery.conf
- delete pid file if exist
- del the replication_slots if exis
- drop the slot [slot_node101].
- drop the slot [slot_node102].
- 2022-08-03 10:34:40 start up the kingbase...
- waiting for server to start....LOG: redirecting log output to logging collector process
- HINT: Future log output will appear in directory "/home/kingbase/cluster/R3HA/db/data/sys_log".
- done
- server started
- ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
- SYS_CREATE_PHYSICAL_REPLICATION_SLOT
- --------------------------------------
- (slot_node101,)
- (1 row)
-
- 2022-08-03 10:34:42 create the slot [slot_node101] success.
- SYS_CREATE_PHYSICAL_REPLICATION_SLOT
- --------------------------------------
- (slot_node102,)
- (1 row)
-
- 2022-08-03 10:34:42 create the slot [slot_node102] success.
- 2022-08-03 10:34:42 start up standby successful!
- can not get the replication of myself
如下所示:recovery过程:
五、总结
在集群执行failover切换时,可以结合wal日志和recovery.log和控制文件的变化,可以详细了解failover切换中wal日志的变化,及通过sys_rewind工具对原主库的恢复过程。
