VMware安装的四台虚拟机,IP分配为:192.168.217.19/20/21/22 ,采用kubeadm部署的高可用kubernetes集群,该集群使用的是外部扩展etcd集群,etcd集群部署在19 20 21,master也是19 20 21,22为工作节点。
具体的配置和安装操作见上一篇文章:云原生|kubernetes|kubeadm部署高可用集群(二)---kube-apiserver高可用+etcd外部集群+haproxy+keepalived_晚风_END的博客-CSDN博客
由于误操作,当然,也不是误操作,只是ionice没用好,19服务器彻底挂了,虽然有打快照,但不想恢复快照,看看如何找回这个失去的master节点。
在20服务器上查看:
- [root@master2 ~]# kubectl get po -A -owide
- NAMESPACE NAME READY STATUS RESTARTS
- kube-system calico-kube-controllers-796cc7f49d-5fz47 1/1 Running 3 (88m ago) 2d5h 10.244.166.146 node1
- kube-system calico-node-49qd6 1/1 Running 4 (29h ago) 2d5h 192.168.217.19 master1
- kube-system calico-node-l9kbj 1/1 Running 3 (88m ago) 2d5h 192.168.217.21 master3
- kube-system calico-node-nsknc 1/1 Running 3 (88m ago) 2d5h 192.168.217.20 master2
- kube-system calico-node-pd8v2 1/1 Running 6 (88m ago) 2d5h 192.168.217.22 node1
- kube-system coredns-7f6cbbb7b8-7c85v 1/1 Running 15 (88m ago) 5d16h 10.244.166.143 node1
- kube-system coredns-7f6cbbb7b8-h9wtb 1/1 Running 15 (88m ago) 5d16h 10.244.166.144 node1
- kube-system kube-apiserver-master1 1/1 Running 19 (29h ago) 5d18h 192.168.217.19 master1
- kube-system kube-apiserver-master2 1/1 Running 1 (88m ago) 16h 192.168.217.20 master2
- kube-system kube-apiserver-master3 1/1 Running 1 (88m ago) 16h 192.168.217.21 master3
- kube-system kube-controller-manager-master1 1/1 Running 14 4d2h 192.168.217.19 master1
- kube-system kube-controller-manager-master2 1/1 Running 12 (88m ago) 4d4h 192.168.217.20 master2
- kube-system kube-controller-manager-master3 1/1 Running 13 (88m ago) 4d4h 192.168.217.21 master3
- kube-system kube-proxy-69w6c 1/1 Running 2 (29h ago) 2d5h 192.168.217.19 master1
- kube-system kube-proxy-vtz99 1/1 Running 4 (88m ago) 2d6h 192.168.217.22 node1
- kube-system kube-proxy-wldcc 1/1 Running 4 (88m ago) 2d6h 192.168.217.21 master3
- kube-system kube-proxy-x6w6l 1/1 Running 4 (88m ago) 2d6h 192.168.217.20 master2
- kube-system kube-scheduler-master1 1/1 Running 11 (29h ago) 4d2h 192.168.217.19 master1
- kube-system kube-scheduler-master2 1/1 Running 11 (88m ago) 4d4h 192.168.217.20 master2
- kube-system kube-scheduler-master3 1/1 Running 10 (88m ago) 4d4h 192.168.217.21 master3
- kube-system metrics-server-55b9b69769-j9c7j 1/1 Running 1 (88m ago) 16h 10.244.166.147 node1
查看etcd状态:
这里是设置带所有证书的etcdctl命令的别名(在20服务器上):
- alias etct_search="ETCDCTL_API=3 \
- /opt/etcd/bin/etcdctl \
- --endpoints=https://192.168.217.19:2379,https://192.168.217.20:2379,https://192.168.217.21:2379 \
- --cacert=/opt/etcd/ssl/ca.pem \
- --cert=/opt/etcd/ssl/server.pem \
- --key=/opt/etcd/ssl/server-key.pem"
已经将下线的etcd节点19踢出member,查看集群状态可以看到有报错:
- [root@master2 ~]# etct_serch member list -w table
- +------------------+---------+--------+-----------------------------+-----------------------------+------------+
- | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
- +------------------+---------+--------+-----------------------------+-----------------------------+------------+
- | ef2fee107aafca91 | started | etcd-2 | https://192.168.217.20:2380 | https://192.168.217.20:2379 | false |
- | f5b8cb45a0dcf520 | started | etcd-3 | https://192.168.217.21:2380 | https://192.168.217.21:2379 | false |
- +------------------+---------+--------+-----------------------------+-----------------------------+------------+
- [root@master2 ~]# etct_serch endpoint status -w table
- {"level":"warn","ts":"2022-11-01T17:13:39.004+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://192.168.217.19:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.217.19:2379: connect: no route to host\""}
- Failed to get the status of endpoint https://192.168.217.19:2379 (context deadline exceeded)
- +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
- | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
- +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
- | https://192.168.217.20:2379 | ef2fee107aafca91 | 3.4.9 | 4.0 MB | true | false | 229 | 453336 | 453336 | |
- | https://192.168.217.21:2379 | f5b8cb45a0dcf520 | 3.4.9 | 4.0 MB | false | false | 229 | 453336 | 453336 | |
- +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
etcd的配置文件(20服务器正常的配置文件):
- [root@master2 ~]# cat /opt/etcd/cfg/etcd.conf
- #[Member]
- ETCD_NAME="etcd-2"
- ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
- ETCD_LISTEN_PEER_URLS="https://192.168.217.20:2380"
- ETCD_LISTEN_CLIENT_URLS="https://192.168.217.20:2379"
-
- #[Clustering]
- ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.217.20:2380"
- ETCD_ADVERTISE_CLIENT_URLS="https://192.168.217.20:2379"
- ETCD_INITIAL_CLUSTER="etcd-1=https://192.168.217.19:2380,etcd-2=https://192.168.217.20:2380,etcd-3=https://192.168.217.21:2380"
- ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
- ETCD_INITIAL_CLUSTER_STATE="new"
etcd的启动脚本(20服务器上正常的启动脚本):
- [root@master2 ~]# cat /usr/lib/systemd/system/etcd.service
- [Unit]
- Description=Etcd Server
- After=network.target
- After=network-online.target
- Wants=network-online.target
-
- [Service]
- Type=notify
- EnvironmentFile=/opt/etcd/cfg/etcd.conf
- ExecStart=/opt/etcd/bin/etcd \
- --cert-file=/opt/etcd/ssl/server.pem \
- --key-file=/opt/etcd/ssl/server-key.pem \
- --peer-cert-file=/opt/etcd/ssl/server.pem \
- --peer-key-file=/opt/etcd/ssl/server-key.pem \
- --trusted-ca-file=/opt/etcd/ssl/ca.pem \
- --peer-trusted-ca-file=/opt/etcd/ssl/ca.pem
- --wal-dir=/var/lib/etcd \ #快照日志路径
- --snapshot-count=50000 \ #最大快照次数,指定有多少事务被提交时,触发截取快照保存到磁盘,释放wal日志,默认值100000
- --auto-compaction-retention=1 \ #首次压缩周期为1小时,后续压缩周期为当前值的10%,也就是每隔6分钟压缩一次
- --auto-compaction-mode=periodic \ #周期性压缩
- --max-request-bytes=$((10*1024*1024)) \ #请求的最大字节数,默认一个key为1.5M,官方推荐最大为10M
- --quota-backend-bytes=$((8*1024*1024*1024)) \
- --heartbeat-interval="500" \
- --election-timeout="1000"
- Restart=on-failure
- LimitNOFILE=65536
-
- [Install]
- WantedBy=multi-user.target
一,
恢复计划
根据以上的信息,可以看出,etcd集群需要恢复到三个,原节点19重做虚拟机,使用原IP,这样可以不需要重新制作etcd证书, 只需要加入现有的etcd集群即可。
因为IP没有改变,并且etcd是外部集群,因此第二步就是待etcd集群恢复后,只需要重新kubeadm reset各个节点后,所有节点在重新加入一次整个集群就恢复了。
1,踢出损坏的节点
踢出命令为(数字为 etcdctl member list 查询出的ID):
etct_serch member remove 97c1c1003e0d4bf
2,
将正常的etcd节点的配置文件,程序拷贝到19节点(/opt/etcd目录下有etcd集群证书和两个可执行文件以及主配置文件,配置文件一哈要修改):
- scp -r /opt/etcd 192.168.217.19:/opt/
- scp /usr/lib/systemd/system/etcd.service 192.168.217.19:/usr/lib/systemd/system/
3,
修改19的etcd配置文件:
主要是IP修改和name修改,以及cluster_state 修改为existing可以对比上面的配置文件。
- [root@master ~]# cat !$
- cat /opt/etcd/cfg/etcd.conf
-
-
- #[Member]
- ETCD_NAME="etcd-1"
- ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
- ETCD_LISTEN_PEER_URLS="https://192.168.217.19:2380"
- ETCD_LISTEN_CLIENT_URLS="https://192.168.217.19:2379"
-
- #[Clustering]
- ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.217.19:2380"
- ETCD_ADVERTISE_CLIENT_URLS="https://192.168.217.19:2379"
- ETCD_INITIAL_CLUSTER="etcd-1=https://192.168.217.19:2380,etcd-2=https://192.168.217.20:2380,etcd-3=https://192.168.217.21:2380"
- ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
- ETCD_INITIAL_CLUSTER_STATE="existing"
4,
在20服务器上执行添加member命令,这个非常重要!!!!!!!!!!!!:
etct_serch member add etcd-1 --peer-urls=https://192.168.217.19:2380
5,
三个节点的etcd服务都重启:
systemctl daemon-reload && systemctl restart etcd
6,etcd集群恢复的日志
查看20服务器上的系统日志可以看到etcd新加节点以及选主的过程,现将相关日志截取出来(3d那一串就是新节点,也就是192.168.217.19):
- Nov 1 22:41:47 master2 etcd: raft2022/11/01 22:41:47 INFO: ef2fee107aafca91 switched to configuration voters=(4427268366965300623 17235256053515405969 17706125434919122208)
- Nov 1 22:41:47 master2 etcd: added member 3d70d11f824a5d8f [https://192.168.217.19:2380] to cluster b459890bbabfc99f
- Nov 1 22:41:47 master2 etcd: starting peer 3d70d11f824a5d8f...
- Nov 1 22:41:47 master2 etcd: started HTTP pipelining with peer 3d70d11f824a5d8f
- Nov 1 22:41:47 master2 etcd: started streaming with peer 3d70d11f824a5d8f (writer)
- Nov 1 22:41:47 master2 etcd: started streaming with peer 3d70d11f824a5d8f (writer)
- Nov 1 22:41:47 master2 etcd: started peer 3d70d11f824a5d8f
- Nov 1 22:41:47 master2 etcd: added peer 3d70d11f824a5d8f
- Nov 1 22:41:47 master2 etcd: started streaming with peer 3d70d11f824a5d8f (stream Message reader)
- Nov 1 22:41:47 master2 etcd: started streaming with peer 3d70d11f824a5d8f (stream MsgApp v2 reader)
- Nov 1 22:41:47 master2 etcd: peer 3d70d11f824a5d8f became active
- Nov 1 22:41:47 master2 etcd: established a TCP streaming connection with peer 3d70d11f824a5d8f (stream Message reader)
- Nov 1 22:41:47 master2 etcd: established a TCP streaming connection with peer 3d70d11f824a5d8f (stream MsgApp v2 reader)
- Nov 1 22:41:47 master2 etcd: ef2fee107aafca91 initialized peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s)
- Nov 1 22:41:47 master2 etcd: established a TCP streaming connection with peer 3d70d11f824a5d8f (stream Message writer)
- Nov 1 22:41:47 master2 etcd: established a TCP streaming connection with peer 3d70d11f824a5d8f (stream MsgApp v2 writer)
- Nov 1 22:41:47 master2 etcd: published {Name:etcd-2 ClientURLs:[https://192.168.217.20:2379]} to cluster b459890bbabfc99f
- Nov 1 22:41:47 master2 etcd: ready to serve client requests
- Nov 1 22:41:47 master2 systemd: Started Etcd Server.
-
- Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 [term 639] received MsgTimeoutNow from 3d70d11f824a5d8f and starts an election to get leadership.
- Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 became candidate at term 640
- Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 received MsgVoteResp from ef2fee107aafca91 at term 640
- Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 [logterm: 639, index: 492331] sent MsgVote request to 3d70d11f824a5d8f at term 640
- Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 [logterm: 639, index: 492331] sent MsgVote request to f5b8cb45a0dcf520 at term 640
- Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: raft.node: ef2fee107aafca91 lost leader 3d70d11f824a5d8f at term 640
- Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 received MsgVoteResp from 3d70d11f824a5d8f at term 640
- Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 has received 2 MsgVoteResp votes and 0 vote rejections
- Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: ef2fee107aafca91 became leader at term 640
- Nov 1 22:42:26 master2 etcd: raft2022/11/01 22:42:26 INFO: raft.node: ef2fee107aafca91 elected leader ef2fee107aafca91 at term 640
- Nov 1 22:42:26 master2 etcd: lost the TCP streaming connection with peer 3d70d11f824a5d8f (stream MsgApp v2 reader)
- Nov 1 22:42:26 master2 etcd: lost the TCP streaming connection with peer 3d70d11f824a5d8f (stream Message reader)
6,
etcd集群测试
可以看到现在的20服务器是etcd的leader,这一点也在上面的日志上有所体现。
- [root@master2 ~]# etct_serch member list -w table
- +------------------+---------+--------+-----------------------------+-----------------------------+------------+
- | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
- +------------------+---------+--------+-----------------------------+-----------------------------+------------+
- | 3d70d11f824a5d8f | started | etcd-1 | https://192.168.217.19:2380 | https://192.168.217.19:2379 | false |
- | ef2fee107aafca91 | started | etcd-2 | https://192.168.217.20:2380 | https://192.168.217.20:2379 | false |
- | f5b8cb45a0dcf520 | started | etcd-3 | https://192.168.217.21:2380 | https://192.168.217.21:2379 | false |
- +------------------+---------+--------+-----------------------------+-----------------------------+------------+
- [root@master2 ~]# etct_serch endpoint status -w table
- +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
- | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
- +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
- | https://192.168.217.19:2379 | 3d70d11f824a5d8f | 3.4.9 | 4.0 MB | false | false | 640 | 494999 | 494999 | |
- | https://192.168.217.20:2379 | ef2fee107aafca91 | 3.4.9 | 4.1 MB | true | false | 640 | 495000 | 495000 | |
- | https://192.168.217.21:2379 | f5b8cb45a0dcf520 | 3.4.9 | 4.0 MB | false | false | 640 | 495000 | 495000 | |
- +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
- [root@master2 ~]# etct_serch endpoint health -w table
- +-----------------------------+--------+-------------+-------+
- | ENDPOINT | HEALTH | TOOK | ERROR |
- +-----------------------------+--------+-------------+-------+
- | https://192.168.217.19:2379 | true | 26.210272ms | |
- | https://192.168.217.20:2379 | true | 26.710558ms | |
- | https://192.168.217.21:2379 | true | 27.903774ms | |
- +-----------------------------+--------+-------------+-------+
由于是三个master里的master1挂了,因此,需要重新安装haproxy和keepalived,这个没什么好说的,从正常节点20,把配置文件拷贝一份,在改改就可以了。
同样的,将kubelet,kubeadm和kubectl也重新安装一次。
然后四个节点都重置,也就是kubeadm reset -f 然后重新初始化即可,初始化文件(节点恢复这里本文最开始有挂部署安装教程的链接,在这就不重复了):
- [root@master ~]# cat kubeadm-init-ha.yaml
- apiVersion: kubeadm.k8s.io/v1beta3
- bootstrapTokens:
- - groups:
- - system:bootstrappers:kubeadm:default-node-token
- token: abcdef.0123456789abcdef
- ttl: "0"
- usages:
- - signing
- - authentication
- kind: InitConfiguration
- localAPIEndpoint:
- advertiseAddress: 192.168.217.19
- bindPort: 6443
- nodeRegistration:
- criSocket: /var/run/dockershim.sock
- imagePullPolicy: IfNotPresent
- name: master1
- taints: null
- ---
- controlPlaneEndpoint: "192.168.217.100"
- apiServer:
- timeoutForControlPlane: 4m0s
- apiVersion: kubeadm.k8s.io/v1beta3
- certificatesDir: /etc/kubernetes/pki
- clusterName: kubernetes
- controllerManager: {}
- dns: {}
- etcd:
- external:
- endpoints: #下面为自定义etcd集群地址
- - https://192.168.217.19:2379
- - https://192.168.217.20:2379
- - https://192.168.217.21:2379
- caFile: /etc/kubernetes/pki/etcd/ca.pem
- certFile: /etc/kubernetes/pki/etcd/apiserver-etcd-client.pem
- keyFile: /etc/kubernetes/pki/etcd/apiserver-etcd-client-key.pem
- imageRepository: registry.aliyuncs.com/google_containers
- kind: ClusterConfiguration
- kubernetesVersion: 1.22.2
- networking:
- dnsDomain: cluster.local
- podSubnet: "10.244.0.0/16"
- serviceSubnet: "10.96.0.0/12"
- scheduler: {}
初始化命令(在19服务器上执行,因为初始化配置文件写的就是19服务器嘛):
kubeadm init --config=kubeadm-init-ha.yaml --upload-certs
由于使用的是外部扩展etcd集群,因此,节点恢复起来比较简单,网络插件什么的重新安装一次就可以了,效果如下图:
可以看到部分pod,比如kube-apiserver的时间都刷新了,像kube-proxy等pod并没有改变时间,这都是由于使用的是外部etcd集群的原因。
- [root@master ~]# kubectl get po -A -owide
- NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
- kube-system calico-kube-controllers-796cc7f49d-k586w 1/1 Running 0 153m 10.244.166.131 node1
- kube-system calico-node-7x86d 1/1 Running 0 153m 192.168.217.21 master3
- kube-system calico-node-dhxcq 1/1 Running 0 153m 192.168.217.19 master1
- kube-system calico-node-jcq6p 1/1 Running 0 153m 192.168.217.20 master2
- kube-system calico-node-vjtv6 1/1 Running 0 153m 192.168.217.22 node1
- kube-system coredns-7f6cbbb7b8-7c85v 1/1 Running 16 5d23h 10.244.166.129 node1
- kube-system coredns-7f6cbbb7b8-7xm62 1/1 Running 0 152m 10.244.166.132 node1
- kube-system kube-apiserver-master1 1/1 Running 0 107m 192.168.217.19 master1
- kube-system kube-apiserver-master2 1/1 Running 0 108m 192.168.217.20 master2
- kube-system kube-apiserver-master3 1/1 Running 1 (107m ago) 107m 192.168.217.21 master3
- kube-system kube-controller-manager-master1 1/1 Running 5 (108m ago) 3h26m 192.168.217.19 master1
- kube-system kube-controller-manager-master2 1/1 Running 15 (108m ago) 4d10h 192.168.217.20 master2
- kube-system kube-controller-manager-master3 1/1 Running 15 4d10h 192.168.217.21 master3
- kube-system kube-proxy-69w6c 1/1 Running 2 (131m ago) 2d12h 192.168.217.19 master1
- kube-system kube-proxy-vtz99 1/1 Running 5 2d12h 192.168.217.22 node1
- kube-system kube-proxy-wldcc 1/1 Running 5 2d12h 192.168.217.21 master3
- kube-system kube-proxy-x6w6l 1/1 Running 5 2d12h 192.168.217.20 master2
- kube-system kube-scheduler-master1 1/1 Running 4 (108m ago) 3h26m 192.168.217.19 master1
- kube-system kube-scheduler-master2 1/1 Running 14 (108m ago) 4d10h 192.168.217.20 master2
- kube-system kube-scheduler-master3 1/1 Running 13 (46m ago) 4d10h 192.168.217.21 master3
- kube-system metrics-server-55b9b69769-j9c7j 1/1 Running 13 (127m ago) 23h 10.244.166.130 node1