• 南大通用数据库-Gbase-8a-学习-09-集群节点替换


    一、测试环境

    名称
    cpuIntel® Core™ i5-1035G1 CPU @ 1.00GHz
    操作系统CentOS Linux release 7.9.2009 (Core)
    内存4G
    逻辑核数3
    原有节点1-IP192.168.142.10
    替换节点2-IP192.168.142.11
    数据库版本8.6.2.43-R33.132743

    二、模拟场景

    192.168.142.11节点由于磁盘损坏,我们需要做集群节点替换,先将操作步骤的1、2、3做好,也就是做到将节点变为不可用和删除event,等用户修改磁盘,我们开始跑替换脚本。

    生产环境节点替换,建议停止业务对数据库进行操作。

    三、操作步骤

    1、查看集群状态

    [root@localhost gcinstall]# gcadmin
    CLUSTER STATE:  ACTIVE
    CLUSTER MODE:   NORMAL
    
    =====================================================================
    |               GBASE COORDINATOR CLUSTER INFORMATION               |
    =====================================================================
    |   NodeName   |       IpAddress       |gcware |gcluster |DataState |
    ---------------------------------------------------------------------
    | coordinator1 |    192.168.142.10     | OPEN  |  OPEN   |    0     |
    ---------------------------------------------------------------------
    =================================================================
    |                GBASE DATA CLUSTER INFORMATION                 |
    =================================================================
    |NodeName |       IpAddress       |gnode |syncserver |DataState |
    -----------------------------------------------------------------
    |  node1  |    192.168.142.10     | OPEN |   OPEN    |    0     |
    -----------------------------------------------------------------
    |  node2  |    192.168.142.11     | OPEN |   OPEN    |    0     |
    -----------------------------------------------------------------
    [root@localhost gcinstall]# gcadmin showdistribution
    
                  Distribution ID: 5 | State: new | Total segment num: 2
    
         Primary Segment Node IP                           Segment ID         Duplicate Segment node IP
    ========================================================================================================================
    |    192.168.142.11                              |       1          |    192.168.142.10                                |
    ------------------------------------------------------------------------------------------------------------------------
    |    192.168.142.10                              |       2          |    192.168.142.11                                |
    ========================================================================================================================
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30

    2、将节点变为不可用

    语法介绍:

    gcadmin setnodestate IP [failure| unavailable |normal]
    
    • 1
    名称描述
    normal正常
    failure故障,记录后续event。如果服务器能恢复,可以设置回normal,集群会自动同步;如确认节点数据丢失,可以再设置为不可用状态。
    unavaliable不可用状态,节点已经明确判断不可恢复,数据肯定丢失,此时集群不再记录event,后续请通过【节点替换】功能修复。 如确认肯定没有ddl,dml等会导致event的操作发生,也可强制改成normal。
    [gbase@localhost gcinstall]$ gcadmin setnodestate 192.168.142.11 unavailable
    load gbase client dll start ......
    load gbase client dll end ......
    
    after set node state into unavailable,can not set the state into normal,
    must run gcadmin replacenodes to replace this node ,after that command node state can return into normal.
    you realy want to set node state into unavailable(yes or no)?
    yes
    get node data state by ddl fevent log start ......
    get node data state by ddl fevent log end ......
    get node data state by dml fevent log start ......
    get node data state by dml fevent log end ......
    get node data state by dml storage fevent log start ......
    get node data state by dml storage fevent log end ......
    
    check data server node data state by fevent log start ......
    check data server node data state by fevent log end ......
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

    3、查看集群状态

    192.168.142.11节点已经变为:UNAVAILABLE 状态。

    [gbase@localhost gcinstall]$ gcadmin
    CLUSTER STATE:  ACTIVE
    CLUSTER MODE:   NORMAL
    
    =====================================================================
    |               GBASE COORDINATOR CLUSTER INFORMATION               |
    =====================================================================
    |   NodeName   |       IpAddress       |gcware |gcluster |DataState |
    ---------------------------------------------------------------------
    | coordinator1 |    192.168.142.10     | OPEN  |  OPEN   |    0     |
    ---------------------------------------------------------------------
    ========================================================================
    |                    GBASE DATA CLUSTER INFORMATION                    |
    ========================================================================
    |NodeName |       IpAddress       |    gnode    |syncserver |DataState |
    ------------------------------------------------------------------------
    |  node1  |    192.168.142.10     |    OPEN     |   OPEN    |    0     |
    ------------------------------------------------------------------------
    |  node2  |    192.168.142.11     | UNAVAILABLE |           |          |
    ------------------------------------------------------------------------
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20

    4、清理故障节点event

    [gbase@localhost gcinstall]$ gcadmin rmdmlstorageevent 2 192.168.142.11
    
    [gbase@localhost gcinstall]$ gcadmin rmddlevent 2 192.168.142.11
    
    [gbase@localhost gcinstall]$ gcadmin rmdmlevent 2 192.168.142.11
    
    • 1
    • 2
    • 3
    • 4
    • 5

    5、模拟现场更换磁盘

    192.168.142.11停服务,查看进程是否存在。

    [root@localhost opt]# service gcware stop
    
    [root@localhost opt]# ps -ef|grep gbase
    root       8271   3575  0 12:32 pts/0    00:00:00 grep --color=auto gbase
    
    • 1
    • 2
    • 3
    • 4

    这一步生产环境不用做。
    192.168.142.11删除相关文件。

    [root@localhost opt]# rm -rf gcluster/
    
    [root@localhost opt]# rm -rf gnode/
    
    • 1
    • 2
    • 3

    6、执行替换

    注意点:替换操作会检查集群ddl锁,如果有ddl语句在运行,将等待其完成,如果确认其可以清理,可以kill掉,以便加快节点替换进度。

    [gbase@localhost gcinstall]$ ./replace.py --host=192.168.142.11 --rootPwd=qwer1234 --dbaUser=gbase --dbaUserPwd=gbase
    check os password ...
    check os password successful
    check database password ...
    check database password successful
    192.168.142.11
    Are you sure to replace install these nodes ([Y,y]/[N,n])? y
    Starting all gcluster nodes...
    get table id and set dmlstorageevent on node [::ffff:192.168.142.10], please wait a moment
    load gbase client dll start ......
    load gbase client dll end ......
    
    check node data map and cluster state start ......
    check node data map and cluster state end ......
    
    get distribution information start ......
    get distribution information end ......
    
    check ip start ......
    check ip end ......
    
    switch cluster mode into READONLY start ......
    wait all ddl statement stop ......
    
    all ddl statement stoped
    switch cluster mode into READONLY end ......
    
    delete all fevent log on replace nodes start ......
    delete ddl event log on node 192.168.142.11 start
    delete ddl event log on node 192.168.142.11 end
    delete dml event log on node 192.168.142.11 start
    delete dml event log on node 192.168.142.11 end
    delete dml storage event log on node 192.168.142.11 start
    delete dml storage event log on node 192.168.142.11 end
    delete all fevent log on replace nodes end ......
    
    sync metedata start ......
    sync coordinator metedata start ......
    sync coordinator metedata end,spend time 0 ms ......
    sync data server metedata start ......
    copy script to data node begin
    copy script to data node end
    build data packet begin
    build data packet end
    copy data packet to target node begin
    copy data packet to target node end
    extract data packet begin
    extract data packet end
    sync dataserver metedata end,spend time 20169 ms ......
    sync metedata end ......
    
    set sync data flag start ......
    create database start ......
    create database end ......
    
    create database and set table dml storage event spend time 6191 ms ......
    set sync data flag end ......
    
    restore cluster mode start ......
    restore cluster mode end ......
    
    restore node state start ......
    restore node state end ......
    
    all nodes replace success end
    replace nodes spend time: 89500 ms
    Replace gcluster nodes successfully.
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67

    替换的相关日志查看

    [root@localhost gcinstall]# tail -f replace.log 
    restore cluster mode end ......
    
    restore node state start ......
    restore node state end ......
    
    all nodes replace success end
    replace nodes spend time: 89500 ms
    2022-08-14 12:35:37,566-root-INFO gcadmin replacenodes ended.
    2022-08-14 12:35:37,567-root-DEBUG kill gclusterd or gbased to started for udf(or other) in replace.
    2022-08-14 12:35:39,546-root-INFO success to kill gclusterd or gbased on 192.168.142.11
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    7、查看集群状态

    [root@localhost gcinstall]# gcadmin
    CLUSTER STATE:  ACTIVE
    CLUSTER MODE:   NORMAL
    
    =====================================================================
    |               GBASE COORDINATOR CLUSTER INFORMATION               |
    =====================================================================
    |   NodeName   |       IpAddress       |gcware |gcluster |DataState |
    ---------------------------------------------------------------------
    | coordinator1 |    192.168.142.10     | OPEN  |  OPEN   |    0     |
    ---------------------------------------------------------------------
    =================================================================
    |                GBASE DATA CLUSTER INFORMATION                 |
    =================================================================
    |NodeName |       IpAddress       |gnode |syncserver |DataState |
    -----------------------------------------------------------------
    |  node1  |    192.168.142.10     | OPEN |   OPEN    |    0     |
    -----------------------------------------------------------------
    |  node2  |    192.168.142.11     | OPEN |   OPEN    |    0     |
    -----------------------------------------------------------------
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20

    我这边集群中没有什么数据量,所以集群状态很快恢复正常。
    实际中可能为:
    (1)节点会有一段时间处于readonly状态,故障节点为REPLACE状态。
    (2)集群恢复,故障节点为OFFLINE状态。
    (3)节点上线,设置同步状态。

    注意点:
    集群替换后记得查看dml、ddl事件是否在稳步下降。

    [root@localhost gcinstall]# gcadmin showddlevent
    Event count:0
    [root@localhost gcinstall]# gcadmin showdmlevent
    Event count:0
    [root@xdw0 ~]# gcadmin showdmlstorageevent
    Event count:0
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
  • 相关阅读:
    Lock和synchronized的区别
    面试金典--面试题 16.18. 模式匹配 (前缀和+哈希表)
    gradle学习笔记(六) 官方文档笔记+理解
    员工脉动/脉搏调查:它们是什么以及它们为何如此重要?
    创新工具 | 教你6步用故事板设计用户体验事半功倍
    带你了解MySQL数据库(二)
    操作系统:了解操作系统(编译、操作系统管理、进程、线程)
    图书管理系统(SpringBoot+SpringMVC+MyBatis)
    窄间距还是PFC方案选择
    web前端网页设计期末课程大作业:旅游网页主题网站设计——紫色的旅游开发景点网站静态模板(4页)HTML+CSS+JavaScript
  • 原文地址:https://blog.csdn.net/qq_45111959/article/details/126307582