• Orchestrator - server_id相同导致graceful-master-takeout-auto失败


    server_id相同导致graceful-master-takeover-auto失败

    背景

    同事要做一个数据库迁移, 方法就是加从库, 然后主从切换, 再下掉老主库, 完成迁移
    原本是一主两从集群, 增加三个从库变为一主五从

    使用命令切换时报错(以下是本地测试复现的命令)

    resources/bin/orchestrator-client -c graceful-master-takeover-auto -alias orc_prod_infra_broker -d 172.20.0.14:3306 -b 'superuser:superpass'
    Desginated instance 172.20.0.14:3306 cannot take over all of its siblings. Error: 2022-09-02 15:33:27 ERROR Relocating 2 replicas of 172.20.0.10:3306 below 172.20.0.14:3306 turns to be too complex; please do it manually
    
    • 1
    • 2

    问题复现

    其实还没来得及翻代码, 同事就已经发现新加的三个从库server_id与原集群提供备份节点的server_id是一样的, 所以基本猜测就是server_id导致

    构建本地测试环境

    这里我部署1主4从, 虽然比线上实际少了一个从库, 但不影响复现问题.
    这个集群中, 我们将以下三个节点server_id都设置成101:

    • 172.20.0.11
    • 172.20.0.13
    • 172.20.0.14
    root@localhost 16:07:38 [orchestrator]> select instance_alias, hostname, port, server_id from database_instance;
    +---------------------------+-------------+------+-----------+
    | instance_alias            | hostname    | port | server_id |
    +---------------------------+-------------+------+-----------+
    | local_docker_master_3306  | 172.20.0.10 | 3306 |         1 |
    | local_docker_slave_1_3306 | 172.20.0.11 | 3306 |       101 |
    | local_docker_slave_2_3306 | 172.20.0.12 | 3306 |       102 |
    | local_docker_slave_3_3306 | 172.20.0.13 | 3306 |       101 |
    | local_docker_slave_4_3306 | 172.20.0.14 | 3306 |       101 |
    +---------------------------+-------------+------+-----------+
    5 rows in set (0.01 sec)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    使用命令切换

    resources/bin/orchestrator-client -c graceful-master-takeover-auto -alias orc_prod_infra_broker -d 172.20.0.14:3306 -b 'superuser:superpass'
    Desginated instance 172.20.0.14:3306 cannot take over all of its siblings. Error: 2022-09-02 15:33:27 ERROR Relocating 2 replicas of 172.20.0.10:3306 below 172.20.0.14:3306 turns to be too complex; please do it manually
    
    • 1
    • 2

    日志

    [martini] Started GET /api/graceful-master-takeover-auto/orc_prod_infra_broker/172.20.0.14/3306 for 127.0.0.1:51897
    2022-09-02 15:33:26 INFO GracefulMasterTakeover: designated master instructed to be 172.20.0.14:3306
    2022-09-02 15:33:26 INFO GracefulMasterTakeover: Will let 172.20.0.14:3306 take over its siblings
    2022-09-02 15:33:26 INFO moveReplicasViaGTID: Will move 3 replicas below 172.20.0.14:3306 via GTID, max concurrency: 5
    2022-09-02 15:33:26 INFO Will move 172.20.0.12:3306 below 172.20.0.14:3306 via GTID
    2022-09-02 15:33:26 INFO Stopped replication on 172.20.0.12:3306, Self:mysql-bin.000001:2997164, Exec:mysql-bin.000003:4032
    2022-09-02 15:33:26 DEBUG ChangeMasterTo: will attempt changing master on 172.20.0.12:3306 to 172.20.0.14:3306, mysql-bin.000001:2997164
    2022-09-02 15:33:27 INFO ChangeMasterTo: Changed master on 172.20.0.12:3306 to: 172.20.0.14:3306, mysql-bin.000001:2997164. GTID: true
    2022-09-02 15:33:27 DEBUG semi-sync: analysis results for recovery of cluster 172.20.0.10:3306:
    2022-09-02 15:33:27 DEBUG semi-sync: master = 172.20.0.14:3306, master semi-sync wait count = 0, master semi-sync replica count = 0
    2022-09-02 15:33:27 DEBUG semi-sync: possible semi-sync replicas (in priority order): (none)
    2022-09-02 15:33:27 DEBUG semi-sync: always-async replicas: (none)
    2022-09-02 15:33:27 DEBUG semi-sync: excluded replicas (defunct): (none)
    2022-09-02 15:33:27 DEBUG semi-sync: suggested actions: (none)
    2022-09-02 15:33:27 INFO semi-sync: 172.20.0.12:3306: no action taken; this may lead to future recoveries
    2022-09-02 15:33:27 INFO Started replication on 172.20.0.12:3306
    2022-09-02 15:33:27 INFO moveReplicasViaGTID: Will move 2 replicas below 172.20.0.14:3306 via GTID, max concurrency: 5
    2022-09-02 15:33:27 ERROR Relocating 2 replicas of 172.20.0.10:3306 below 172.20.0.14:3306 turns to be too complex; please do it manually
    [martini] Completed 500 Internal Server Error in 581.39425ms
    [martini] Started GET /web/cluster/alias/orc_prod_infra_broker for 127.0.0.1:51672
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20

    问题定位

    通过对日志中关键字的搜索, 发现

    Relocating 2 replicas of 172.20.0.10:3306 below 172.20.0.14:3306 turns to be too complex; please do it manually
    
    • 1

    以下日志只会在relocateReplicasInternal函数中中输出


    那么是什么导致movereplicasViaGTID报错呢? 继续梳理代码可以发现
    moveReplicasViaGTID 会调用 moveInstanceBelowViaGTID,
    后者会调用 CanReplicateFrom 检查 A是否可以做B的从库

    而这里就会判断server_id是否相等

    原生orchestrator的问题

    原生的问题是, 没有输出这段错误信息

    想解决也很简单, 只修改以下两处代码


    [martini] Started GET /api/graceful-master-takeover-auto/orc_prod_infra_broker/172.20.0.14/3306 for 127.0.0.1:60996
    2022-09-02 17:21:00 INFO GracefulMasterTakeover: designated master instructed to be 172.20.0.14:3306
    2022-09-02 17:21:00 INFO GracefulMasterTakeover: Will let 172.20.0.14:3306 take over its siblings
    2022-09-02 17:21:00 INFO moveReplicasViaGTID: Will move 3 replicas below 172.20.0.14:3306 via GTID, max concurrency: 5
    2022-09-02 17:21:00 INFO Will move 172.20.0.12:3306 below 172.20.0.14:3306 via GTID
    2022-09-02 17:21:00 INFO Stopped replication on 172.20.0.12:3306, Self:mysql-bin.000001:2997164, Exec:mysql-bin.000003:4032
    2022-09-02 17:21:00 INFO ChangeMasterTo: Changed master on 172.20.0.12:3306 to: 172.20.0.14:3306, mysql-bin.000001:2997164. GTID: true
    2022-09-02 17:21:00 INFO semi-sync: 172.20.0.12:3306: no action taken; this may lead to future recoveries
    2022-09-02 17:21:00 INFO Started replication on 172.20.0.12:3306
    2022-09-02 17:21:00 INFO moveReplicasViaGTID: Will move 2 replicas below 172.20.0.14:3306 via GTID, max concurrency: 5
    2022-09-02 17:21:00 ERROR Relocating 2 replicas of 172.20.0.10:3306 below 172.20.0.14:3306 turns to be too complex; please do it manually
    2022-09-02 17:21:00 ERROR Identical server id: 172.20.0.14:3306, 172.20.0.11:3306 both have 101
    2022-09-02 17:21:00 ERROR Identical server id: 172.20.0.14:3306, 172.20.0.13:3306 both have 101
    [martini] Completed 500 Internal Server Error in 402.777083ms
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
  • 相关阅读:
    【云原生】Docker的初步认识,安装与基本操作
    高通导航器软件开发包使用指南(11)
    给你 2 万条数据,怎么快速导入到 MySQL?
    分布式集群——搭建Hadoop环境以及相关的Hadoop介绍
    mysql8多主多从复制搭建
    python带你对北京二手房进行数据分析,看看大概都什么价位
    winform中也可以这样做数据展示✨
    【PyTorch 攻略(5/7)】训练和模型
    Java递归生成树形菜单结构的json
    【Spring Boot】web开发相关源码分析
  • 原文地址:https://blog.csdn.net/ashic/article/details/126672274