• 【经验分享】openGauss容灾集群搭建


    gs_sdr命令代码解读

    背景

    openGauss推出了容灾架构,相比之前的一个集群主从架构,而容灾架构是两个集群间的数据同步。为了更深入了解其原理,本文试图通过阅读gs_sdr命令相关的代码来学习下相关的各种操作。

    1.容灾搭建过程可以参考:https://www.modb.pro/db/628767

    2.vscode调试配置可以参考:https://www.modb.pro/db/658344

    3.个人学习记录,理解不一定完全正确。如有错误,可指出一起探讨_

    环境准备

    安装集群

    安装两套集群,每套集群含2个节点,相关信息如下:

    集群1信息
    omm@pghost2 ~$ cm_ctl query -Cvid
    [ CMServer State ]

    node node_ip instance state
    ---------------------------------------------------------------------
    1 pghost2 192.168.56.20 1 /app/ogdata/data/cm/cm_server Primary
    2 pghost3 192.168.56.30 2 /app/ogdata/data/cm/cm_server Standby

    [ Cluster State ]

    cluster_state : Normal
    redistributing : No
    balanced : Yes
    current_az : AZ_ALL

    [ Datanode State ]

    node node_ip instance state | node node_ip instance state
    ------------------------------------------------------------------------------------------------------------------------------------------------
    1 pghost2 192.168.56.20 6001 /app/ogdata/data/dn1 P Primary Normal | 2 pghost3 192.168.56.30 6002 /app/ogdata/data/dn1 S Standby Normal
    集群2信息
    omm@pghost5 ~$ cm_ctl query -Cvid
    [ CMServer State ]

    node node_ip instance state
    ---------------------------------------------------------------------
    1 pghost5 192.168.56.50 1 /app/ogdata/data/cm/cm_server Primary
    2 pghost6 192.168.56.60 2 /app/ogdata/data/cm/cm_server Standby

    [ Cluster State ]

    cluster_state : Normal
    redistributing : No
    balanced : Yes
    current_az : AZ_ALL

    [ Datanode State ]

    node node_ip instance state | node node_ip instance state
    ------------------------------------------------------------------------------------------------------------------------------------------------
    1 pghost5 192.168.56.50 6001 /app/ogdata/data/dn1 P Primary Normal | 2 pghost6 192.168.56.60 6002 /app/ogdata/data/dn1 S Standby Normal

    创建容灾用户

    集群1上创建容灾用户:

    gsql -d postgres -p 26000 -c "create user dr_user with replication password 'oracle_4U';"

    修改XML配置

    修改集群1

    修改后的xml配置如下:

    > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > **** > **** > > > > > > > > > > > > > > > > > > > >
    修改集群2

    修改后的xml配置如下

    > > >   >     >     >     >     >     >     >     >     >     >   >   >     >       >       >       >       >       >       >       >       >       >       >       >       >       >       >       >       >       **** >       **** >       **** >     >     >       >       >       >       >       >       >     >   > 配置容灾

    集群1启动为主集群

    使用的命令为:

    # gs_sdr -t start -m primary -X XMLFILE [-U DR_USERNAME [-W DR_PASSWORD]] [--time-out=SECS]
    gs_sdr -t start -m primary -X /home/omm/single.xml -U dr_user -W oracle_4U --time-out=86400

    vscode调试配置

    {
    "version": "0.2.0",
    "configurations": [
    {
    "name": "Python: 当前文件",
    "type": "python",
    "request": "launch",
    "program": "${file}",
    "console": "integratedTerminal",
    "justMyCode": true,
    "args": ["-t","start","-m","primary","-X","/home/omm/single.xml","-U","dr_user","-W","oracle_4U","--time-out=86400"]
    }
    ]
    }

    gs_sdr脚本main函数中打上断点

    代码阅读

    判断是否使用root权限操作
    if os.getuid() == 0:
    GaussLog.exitWithError(ErrorCode.GAUSS_501["GAUSS_50105"])
    # 是root权限就直接报错退出
    初始化StreamingDisasterRecoveryBase
    base = StreamingDisasterRecoveryBase() # 从集群xml配置文件中加载相 关的信息

    base中保存的信息可以参考下图:

    86ccfc766e49c4ae277aa1df484da647.jpeg

    判断做何种操作
    handler = HANDLER_MAPPING[base.params.task](base.params, base.user, base.logger, base.trace_id, base.log_file)
    # 这里的 HANDLER_MAPPING 主要包括4种操作。具体如下:
    HANDLER_MAPPING = {
    "start": StreamingStartHandler, # 这块应该是对应上图中的 moduleName 中的值
    "stop": StreamingStopHandler,
    "switchover": StreamingSwitchoverHandler,
    "failover": StreamingFailoverHandler,
    "query": StreamingQueryHandler
    }
    # 此处的 base.params.task 值为 start ,映射到类 StreamingStartHandler ,该类在文件 streaming_diaster_recovery_start.py 中
    创建锁定文件

    由于容灾搭建过程涉及到数据同步耗时较长,这里应是为避免多次重复操作。

    handler.handle_lock_file(handler.trace_id, 'create') # 该方法在streaming_base.py中定义
    # 会生成一个文件:'/app/opengauss/tmp/streaming_lock_cd7eef1a2c1f11ee92b208002716c96f'
    判断是否有其他gs_sdr操作
    if base.params.task in StreamingConstants.TASK_EXIST_CHECK:
    handler.check_streaming_process_is_running() # 有的话,就终止本次操作。
    # 'source /home/omm/.bashrc && pssh -t 10 -H pghost2 -H pghost3 "ls /app/opengauss/tmp/streaming_lock_*"' 主要使用该命令
    执行操作
    进度记录相关操作
    handler.run()
    self.logger.log("Start create streaming disaster relationship.")
    # 创建进度记录文件夹:/app/opengauss/tmp/streaming_cabin(所有节点均创建)
    # 进度记录文件:'.streaming_switchover_primary.step'
    ## 所有的进度记录文件名字如下:
    STREAMING_STEP_FILES = {
    "start_primary": ".streaming_start_primary.step",
    "start_standby": ".streaming_start_standby.step",
    "stop": ".streaming_stop.step",
    "switchover_primary": ".streaming_switchover_primary.step",
    "switchover_standby": ".streaming_switchover_standby.step",
    "failover": ".streaming_failover.step",
    "query": ".streaming_query.step",
    }
    检查集群状态
    # 检查集群状态
    'source /home/omm/.bashrc ; gs_om -t status --all > /app/opengauss/tmp/streaming_cabin/cluster_state_tmp'
    判断执行节点是否为主节点

    操作需要在主节点上执行。

    生成 key_name.key.cipher & key_name.key.rand 文件
    export LD_LIBRARY_PATH=/app/opengauss/tool/script/gspylib/clib && source /home/omm/.bashrc && gs_guc generate -S default -o hadr -D '/app/opengauss/app/2.0.1_46134f73/bin' && /bin/chmod 600 /app/opengauss/app/2.0.1_46134f73/bin/hadr.key.cipher && /bin/chmod 600 /app/opengauss/app/2.0.1_46134f73/bin/hadr.key.rand
    # 随后会将生成的文件分发到集群中其他节点上。
    保存hadr信息到数据库
    ALTER GLOBAL CONFIGURATION with(hadr_user_info ='O1hnmUERtm2hfiXGjKjgaCfKq89IgdSzUqCoMGw/yzdaYki1LYTfhHlILmz10IvDTX9fqGNZrcmdX5NmkK+6bw==');
    检查是否已经有首备节点

    判断是否已经是容灾环境。

    检查是否有cm

    容灾环境必须要有cm组件。

    检查是否在升级中
    # 判断/app/opengauss/tmp/binary_upgrade是否存在
    写进度文件
    $ more /app/opengauss/tmp/streaming_cabin/.streaming_start_primary.step
    2_check_cluster_step
    common_step_for_streaming_start
    # 生成容灾关系json文件 并分发到集群中的其它节点上。
    more /app/opengauss/tmp/streaming_cabin/cluster_conf_record
    {"remoteClusterConf": {"port": 26500, "shards": [[{"ip": "192.168.56.50", "dataIp": "192.168.56.50"}, {"ip": "1
    92.168.56.60", "dataIp": "192.168.56.60"}]]}, "localClusterConf": {"port": 26000, "shards": [[{"ip": "192.168.5
    6.20", "dataIp": "192.168.56.20"}, {"ip": "192.168.56.30", "dataIp": "192.168.56.30"}]]}}
    修改pg_hba配置
    # 拷贝/home/omm/single.xml为/app/opengauss/tmp/streaming_cabin/streaming_config.xml
    source /home/omm/.bashrc; python3 '/app/opengauss/tool/script/local/ConfigHba.py' -U omm -X '/app/opengauss/tmp/streaming_cabin/streaming_config.xml' --try-reload
    # 会在pg_hba.conf文件中加入:
    host all omm 192.168.56.50/32 trust
    host all omm 192.168.56.60/32 trust
    host replication all 192.168.0.0/16 sha256
    复制参数replconninfo相关设置
    'source /home/omm/.bashrc; pssh -H pghost3 \'source /home/omm/.bashrc; gs_guc check -Z datanode -D /app/ogdata/data/dn1 -c "replconninfo1"\''

    'source /home/omm/.bashrc; pssh -H pghost3 \'source /home/omm/.bashrc; gs_guc check -Z datanode -D /app/ogdata/data/dn1 -c "replconninfo2"\''

    'source /home/omm/.bashrc; pssh -H pghost3 "source /home/omm/.bashrc ; gs_guc reload -Z datanode -D /app/ogdata/data/dn1 -c \\"replconninfo1 = \'localhost=192.168.56.30 localport=26001 localheartbeatport=26005 localservice=26004 remotehost=192.168.56.20 remoteport=26001 remoteheartbeatport=26005 remoteservice=26004 iscascade=true iscrossregion=false\'\\""'
    等待首备连接

    Waiting for the main standby connection.

    这里需要在备集群执行下面的命令:

    gs_sdr -t start -m disaster_standby -U dr_user -W oracle_4U -X /home/omm/single.xml --time-out=86400 # 此处为方便,直接在终端上执行该命令,没有进行调试。

    集群2启动为备集群

    gs_sdr -t start -m disaster_standby -U dr_user -W oracle_4U -X /home/omm/single.xml --time-out=86400

    vscode调试配置

    {
    "version": "0.2.0",
    "configurations": [
    {
    "name": "gs_sdr",
    "type": "python",
    "request": "launch",
    "program": "${file}",
    "console": "integratedTerminal",
    "justMyCode": true,
    "args": ["-t","start","-m","disaster_standby","-X","/home/omm/single.xml","-U","dr_user","-W","oracle_4U","--time-out=86400"]
    }
    ]
    }

    执行的类: streaming_diaster_recovery_start

    代码阅读

    Start build key files from remote cluster

    备集群会进行build,速度比较慢(与网络环境和数据库大小关系较大)。

    source /home/omm/.bashrc; /app/opengauss/app/2.0.1/bin/gs_ctl build -D /app/ogdata/data/dn1 -M standby -b copy_secure_files -U dr_user -P *** -C "localhost=192.168.56.50 localport=26001 remotehost=192.168.56.20 remoteport=26501"

    source /home/omm/.bashrc; /app/opengauss/app/2.0.1/bin/gs_ctl build -D /app/ogdata/data/dn1 -M standby -b copy_secure_files -U dr_user -P *** -C "localhost=192.168.56.50 localport=26001 remotehost=192.168.56.30 remoteport=26501"

    echo *** /home/omm/.bashrc; /app/opengauss/app/2.0.1/bin/gs_ctl build -D /app/ogdata/data/dn1 -M standby -b copy_secure_files -U dr_user -P *** -C 'localhost=192.168.56.60 localport=26001 remotehost=192.168.56.20 remoteport=26501'" | pssh -s -H pghost6'
    copy file from data dir to streaming dir
    # 第1个节点
    echo "if [ -d \'/app/ogdata/data/dn1/gs_secure_files\' ];then source /home/omm/.bashrc && pscp --trace-id 9f2c898e2c5a11ee850c080027fd3332 -H pghost5 \'/app/ogdata/data/dn1/gs_secure_files\' \'/app/opengauss/tmp/streaming_cabin\' && rm -rf \'/app/ogdata/data/dn1/gs_secure_files\';fi" | pssh -s -H pghost5
    # 第2个节点
    echo "if [ -d \'/app/ogdata/data/dn1/gs_secure_files\' ];then source /home/omm/.bashrc && pscp --trace-id 9f2c898e2c5a11ee850c080027fd3332 -H pghost5 \'/app/ogdata/data/dn1/gs_secure_files\' \'/app/opengauss/tmp/streaming_cabin\' && rm -rf \'/app/ogdata/data/dn1/gs_secure_files\';fi" | pssh -s -H pghost6
    check cluster user consistency

    主要检查版本和版本提交号是否一致。

    检查安装用户是否一致

    设置集群运行模式stream_cluster_run_mode
    source /home/omm/.bashrc && gs_guc set -Z datanode -N all -I all -c "stream_cluster_run_mode = \'cluster_standby\'"

    source /home/omm/.bashrc && gs_guc set -Z coordinator -N all -I all -c "stream_cluster_run_mode = \'cluster_standby\'"
    停止备集群
    '/app/opengauss/app/2.0.1_46134f73/bin/cluster_static_config'
    再次build集群
    source /home/omm/.bashrc; /app/opengauss/app/2.0.1_46134f73/bin/gs_ctl start -D /app/ogdata/data/dn1 -M hadr_main_standby

    echo *** /home/omm/.bashrc; /app/opengauss/app/2.0.1_46134f73/bin/gs_ctl build -D /app/ogdata/data/dn1 -M cascade_standby -b standby_full -r 7200 -t 1209600" | pssh -s -t 1209610 -H pghost6
    启动集群
    source /home/omm/.bashrc ; cm_ctl start -t 604800 # 此时的集群已经是首备和级联备状态了。

    查询容灾状态

    gs_sdr -t query

    主集群

    $ gs_sdr -t query
    --------------------------------------------------------------------------------
    Streaming disaster recovery query 9f658f3a2d0511eebbb208002716c96f
    --------------------------------------------------------------------------------
    Start streaming disaster query.
    Start check archive.
    Start check recovery.
    Start check RPO & RTO.
    Successfully executed streaming disaster recovery query, result:
    {'hadr_cluster_stat': 'archive', 'hadr_failover_stat': '', 'hadr_switchover_stat': '', 'RPO': '0', 'RTO': '0'}

    备集群

    $ gs_sdr -t query
    --------------------------------------------------------------------------------
    Streaming disaster recovery query ad8afd5c2d0511ee88cf080027fd3332
    --------------------------------------------------------------------------------
    Start streaming disaster query.
    Start check archive.
    Start check recovery.
    Start check RPO & RTO.
    Successfully executed streaming disaster recovery query, result:
    {'hadr_cluster_stat': 'restore', 'hadr_failover_stat': '', 'hadr_switchover_stat': '', 'RPO': '', 'RTO': ''}


  • 相关阅读:
    物联网?快来看 Arduino 上云啦
    大数据挖掘企业服务平台-基于大数据的工业废水处理解决方案
    qq录屏快捷键是什么?qq录屏声音设置
    搭建Docker私有镜像仓库
    你知道SOLIDWORKS焊件类零件有个快速草图建立工具吗?
    金仓数据库KingbaseES服务器应用参考手册--10. sys_test_timing
    算法学习:LeetCode-6. Z 字形变换
    DevCloud加持下的青软,让教育“智”上云端
    ESP8266-Arduino编程实例-PWM
    icp许可证对网站的要求
  • 原文地址:https://blog.csdn.net/renxyz/article/details/134057117