名称 | 值 |
---|---|
cpu | Intel® Core™ i5-1035G1 CPU @ 1.00GHz |
操作系统 | CentOS Linux release 7.9.2009 (Core) |
内存 | 4G |
逻辑核数 | 3 |
原有节点1-IP | 192.168.142.10 |
替换节点2-IP | 192.168.142.11 |
数据库版本 | 8.6.2.43-R33.132743 |
192.168.142.11节点由于磁盘损坏,我们需要做集群节点替换,先将操作步骤的1、2、3做好,也就是做到将节点变为不可用和删除event,等用户修改磁盘,我们开始跑替换脚本。
生产环境节点替换,建议停止业务对数据库进行操作。
[root@localhost gcinstall]# gcadmin
CLUSTER STATE: ACTIVE
CLUSTER MODE: NORMAL
=====================================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
=====================================================================
| NodeName | IpAddress |gcware |gcluster |DataState |
---------------------------------------------------------------------
| coordinator1 | 192.168.142.10 | OPEN | OPEN | 0 |
---------------------------------------------------------------------
=================================================================
| GBASE DATA CLUSTER INFORMATION |
=================================================================
|NodeName | IpAddress |gnode |syncserver |DataState |
-----------------------------------------------------------------
| node1 | 192.168.142.10 | OPEN | OPEN | 0 |
-----------------------------------------------------------------
| node2 | 192.168.142.11 | OPEN | OPEN | 0 |
-----------------------------------------------------------------
[root@localhost gcinstall]# gcadmin showdistribution
Distribution ID: 5 | State: new | Total segment num: 2
Primary Segment Node IP Segment ID Duplicate Segment node IP
========================================================================================================================
| 192.168.142.11 | 1 | 192.168.142.10 |
------------------------------------------------------------------------------------------------------------------------
| 192.168.142.10 | 2 | 192.168.142.11 |
========================================================================================================================
语法介绍:
gcadmin setnodestate IP [failure| unavailable |normal]
名称 | 描述 |
---|---|
normal | 正常 |
failure | 故障,记录后续event。如果服务器能恢复,可以设置回normal,集群会自动同步;如确认节点数据丢失,可以再设置为不可用状态。 |
unavaliable | 不可用状态,节点已经明确判断不可恢复,数据肯定丢失,此时集群不再记录event,后续请通过【节点替换】功能修复。 如确认肯定没有ddl,dml等会导致event的操作发生,也可强制改成normal。 |
[gbase@localhost gcinstall]$ gcadmin setnodestate 192.168.142.11 unavailable
load gbase client dll start ......
load gbase client dll end ......
after set node state into unavailable,can not set the state into normal,
must run gcadmin replacenodes to replace this node ,after that command node state can return into normal.
you realy want to set node state into unavailable(yes or no)?
yes
get node data state by ddl fevent log start ......
get node data state by ddl fevent log end ......
get node data state by dml fevent log start ......
get node data state by dml fevent log end ......
get node data state by dml storage fevent log start ......
get node data state by dml storage fevent log end ......
check data server node data state by fevent log start ......
check data server node data state by fevent log end ......
192.168.142.11节点已经变为:UNAVAILABLE 状态。
[gbase@localhost gcinstall]$ gcadmin
CLUSTER STATE: ACTIVE
CLUSTER MODE: NORMAL
=====================================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
=====================================================================
| NodeName | IpAddress |gcware |gcluster |DataState |
---------------------------------------------------------------------
| coordinator1 | 192.168.142.10 | OPEN | OPEN | 0 |
---------------------------------------------------------------------
========================================================================
| GBASE DATA CLUSTER INFORMATION |
========================================================================
|NodeName | IpAddress | gnode |syncserver |DataState |
------------------------------------------------------------------------
| node1 | 192.168.142.10 | OPEN | OPEN | 0 |
------------------------------------------------------------------------
| node2 | 192.168.142.11 | UNAVAILABLE | | |
------------------------------------------------------------------------
[gbase@localhost gcinstall]$ gcadmin rmdmlstorageevent 2 192.168.142.11
[gbase@localhost gcinstall]$ gcadmin rmddlevent 2 192.168.142.11
[gbase@localhost gcinstall]$ gcadmin rmdmlevent 2 192.168.142.11
192.168.142.11停服务,查看进程是否存在。
[root@localhost opt]# service gcware stop
[root@localhost opt]# ps -ef|grep gbase
root 8271 3575 0 12:32 pts/0 00:00:00 grep --color=auto gbase
这一步生产环境不用做。
192.168.142.11删除相关文件。
[root@localhost opt]# rm -rf gcluster/
[root@localhost opt]# rm -rf gnode/
注意点:替换操作会检查集群ddl锁,如果有ddl语句在运行,将等待其完成,如果确认其可以清理,可以kill掉,以便加快节点替换进度。
[gbase@localhost gcinstall]$ ./replace.py --host=192.168.142.11 --rootPwd=qwer1234 --dbaUser=gbase --dbaUserPwd=gbase
check os password ...
check os password successful
check database password ...
check database password successful
192.168.142.11
Are you sure to replace install these nodes ([Y,y]/[N,n])? y
Starting all gcluster nodes...
get table id and set dmlstorageevent on node [::ffff:192.168.142.10], please wait a moment
load gbase client dll start ......
load gbase client dll end ......
check node data map and cluster state start ......
check node data map and cluster state end ......
get distribution information start ......
get distribution information end ......
check ip start ......
check ip end ......
switch cluster mode into READONLY start ......
wait all ddl statement stop ......
all ddl statement stoped
switch cluster mode into READONLY end ......
delete all fevent log on replace nodes start ......
delete ddl event log on node 192.168.142.11 start
delete ddl event log on node 192.168.142.11 end
delete dml event log on node 192.168.142.11 start
delete dml event log on node 192.168.142.11 end
delete dml storage event log on node 192.168.142.11 start
delete dml storage event log on node 192.168.142.11 end
delete all fevent log on replace nodes end ......
sync metedata start ......
sync coordinator metedata start ......
sync coordinator metedata end,spend time 0 ms ......
sync data server metedata start ......
copy script to data node begin
copy script to data node end
build data packet begin
build data packet end
copy data packet to target node begin
copy data packet to target node end
extract data packet begin
extract data packet end
sync dataserver metedata end,spend time 20169 ms ......
sync metedata end ......
set sync data flag start ......
create database start ......
create database end ......
create database and set table dml storage event spend time 6191 ms ......
set sync data flag end ......
restore cluster mode start ......
restore cluster mode end ......
restore node state start ......
restore node state end ......
all nodes replace success end
replace nodes spend time: 89500 ms
Replace gcluster nodes successfully.
替换的相关日志查看
[root@localhost gcinstall]# tail -f replace.log
restore cluster mode end ......
restore node state start ......
restore node state end ......
all nodes replace success end
replace nodes spend time: 89500 ms
2022-08-14 12:35:37,566-root-INFO gcadmin replacenodes ended.
2022-08-14 12:35:37,567-root-DEBUG kill gclusterd or gbased to started for udf(or other) in replace.
2022-08-14 12:35:39,546-root-INFO success to kill gclusterd or gbased on 192.168.142.11
[root@localhost gcinstall]# gcadmin
CLUSTER STATE: ACTIVE
CLUSTER MODE: NORMAL
=====================================================================
| GBASE COORDINATOR CLUSTER INFORMATION |
=====================================================================
| NodeName | IpAddress |gcware |gcluster |DataState |
---------------------------------------------------------------------
| coordinator1 | 192.168.142.10 | OPEN | OPEN | 0 |
---------------------------------------------------------------------
=================================================================
| GBASE DATA CLUSTER INFORMATION |
=================================================================
|NodeName | IpAddress |gnode |syncserver |DataState |
-----------------------------------------------------------------
| node1 | 192.168.142.10 | OPEN | OPEN | 0 |
-----------------------------------------------------------------
| node2 | 192.168.142.11 | OPEN | OPEN | 0 |
-----------------------------------------------------------------
我这边集群中没有什么数据量,所以集群状态很快恢复正常。
实际中可能为:
(1)节点会有一段时间处于readonly状态,故障节点为REPLACE状态。
(2)集群恢复,故障节点为OFFLINE状态。
(3)节点上线,设置同步状态。
注意点:
集群替换后记得查看dml、ddl事件是否在稳步下降。
[root@localhost gcinstall]# gcadmin showddlevent
Event count:0
[root@localhost gcinstall]# gcadmin showdmlevent
Event count:0
[root@xdw0 ~]# gcadmin showdmlstorageevent
Event count:0