K8s 集群高可用master节点ETCD挂掉如何恢复?
写在前面
- 很常见的集群运维场景,整理分享
- 博文内容为 K8s 集群高可用
master
节点故障如何恢复的过程 - 理解不足小伙伴帮忙指正
不必太纠结于当下,也不必太忧虑未来,当你经历过一些事情的时候,眼前的风景已经和从前不一样了。——村上春树
遇到了什么问题
今天做实验发现 ,集群其中一个 master
节点上的 etcd
和 apiserver
都挂掉了
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get nodes
NAME STATUS ROLES AGE VERSION
vms100.liruilongs.github.io Ready control-plane 415d v1.25.1
vms101.liruilongs.github.io Ready control-plane 415d v1.25.1
vms102.liruilongs.github.io Ready control-plane 415d v1.25.1
vms103.liruilongs.github.io Ready <none> 415d v1.25.1
vms105.liruilongs.github.io Ready <none> 415d v1.25.1
vms106.liruilongs.github.io Ready <none> 415d v1.25.1
┌──[root@vms100.liruilongs.github.io]-[~]
└─$
vms100.liruilongs.github.io
这个节点 上的 apiserver
和 etcd
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep apiserver
kube-system kube-apiserver-vms100.liruilongs.github.io 0/1 CrashLoopBackOff 1448 (3m23s ago) 415d 192.168.26.100 vms100.liruilongs.github.io <none> <none>
kube-system kube-apiserver-vms101.liruilongs.github.io 1/1 Running 272 (3h18m ago) 415d 192.168.26.101 vms101.liruilongs.github.io <none> <none>
kube-system kube-apiserver-vms102.liruilongs.github.io 1/1 Running 246 (3h18m ago) 415d 192.168.26.102 vms102.liruilongs.github.io <none> <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep etcd
kube-system etcd-vms100.liruilongs.github.io 0/1 CrashLoopBackOff 1244 (3m6s ago) 415d 192.168.26.100 vms100.liruilongs.github.io <none> <none>
kube-system etcd-vms101.liruilongs.github.io 1/1 Running 167 (3h18m ago) 415d 192.168.26.101 vms101.liruilongs.github.io <none> <none>
kube-system etcd-vms102.liruilongs.github.io 1/1 Running 173 (3h18m ago) 415d 192.168.26.102 vms102.liruilongs.github.io <none> <none>
查看 keepalived
对应的静态Pod运行正常
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep keep
kube-system keepalived-vms100.liruilongs.github.io 1/1 Running 63 (3h50m ago) 415d 192.168.26.100 vms100.liruilongs.github.io <none> <none>
kube-system keepalived-vms101.liruilongs.github.io 1/1 Running 54 (3h51m ago) 415d 192.168.26.101 vms101.liruilongs.github.io <none> <none>
kube-system keepalived-vms102.liruilongs.github.io 1/1 Running 60 (3h51m ago) 415d 192.168.26.102 vms102.liruilongs.github.io <none> <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$
所以可能是 etcd
数据不同步,或者什么原因 导致etcd
挂掉了。因为 每个 master
节点的 apiserver
只和 本节点的 etcd
进行 通信(每个 etcd
的写请求会转发到 etcd
的领导节点),etcd 挂掉,apiserver 无法提供能力,所以也会挂掉。
通过 etcdctl
可以发现 vms100.liruilongs.github.io
上的 etcd
彻底死掉了
┌──[root@vms100.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl --endpoints https://127.0.0.1:2379 \--cert="/etc/kubernetes/pki/etcd/server.crt" \--key="/etc/kubernetes/pki/etcd/server.key" \--cacert="/etc/kubernetes/pki/etcd/ca.crt" \member list -w table
Error: dial tcp 127.0.0.1:2379: connect: connection refused
如何排查
这里我们换一个 etcd
节点 执行 命令
查看 etcd 集群成员
┌──[root@vms100.liruilongs.github.io]-[~]
└─$ssh vms101.liruilongs.github.io
Last login: Sat Mar 2 09:52:01 2024 from 192.168.26.100
┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl --endpoints https://127.0.0.1:2379 \--cert="/etc/kubernetes/pki/etcd/server.crt" \--key="/etc/kubernetes/pki/etcd/server.key" \--cacert="/etc/kubernetes/pki/etcd/ca.crt" \member list -w table
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
| ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
| 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
查看节点状态
┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl --endpoints https://127.0.0.1:2379 \--cert="/etc/kubernetes/pki/etcd/server.crt" \--key="/etc/kubernetes/pki/etcd/server.key" \--cacert="/etc/kubernetes/pki/etcd/ca.crt" \endpoint status --cluster -w table
Failed to get the status of endpoint https://192.168.26.100:2379 (context deadline exceeded)
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://192.168.26.101:2379 | 70059e836d19883d | 3.5.4 | 88 MB | false | 603 | 22208417 |
| https://192.168.26.102:2379 | b8cb9f66c2e63b91 | 3.5.4 | 88 MB | true | 603 | 22208417 |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
确定 ETCD 节点故障
┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl --endpoints https://127.0.0.1:2379 \--cert="/etc/kubernetes/pki/etcd/server.crt" \--key="/etc/kubernetes/pki/etcd/server.key" \--cacert="/etc/kubernetes/pki/etcd/ca.crt" \endpoint health --cluster -w table
https://192.168.26.101:2379 is healthy: successfully committed proposal: took = 3.753357ms
https://192.168.26.102:2379 is healthy: successfully committed proposal: took = 2.989943ms
https://192.168.26.100:2379 is unhealthy: failed to connect: dial tcp 192.168.26.100:2379: connect: connection refused
Error: unhealthy cluster
查看 etcd
的容器日志
┌──[root@vms100.liruilongs.github.io]-[~]
└─$docker ps -a | grep etcd
0f2f98ebf8c3 a8a176a5d5d6 "etcd --advertise-cl…" 4 minutes ago Exited (2) 4 minutes ago k8s_etcd_etcd-vms100.liruilongs.github.io_kube-system_e8c17bb99f9bd8119cdd769556041e18_1252
a4b39d16a753 registry.aliyuncs.com/google_containers/pause:3.8 "/pause" 4 hours ago Up 4 hours k8s_POD_etcd-vms100.liruilongs.github.io_kube-system_e8c17bb99f9bd8119cdd769556041e18_54
┌──[root@vms100.liruilongs.github.io]-[~]
└─$docker logs 0f2f98ebf8c3
{"level":"info","ts":"2024-03-16T14:46:54.644Z","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["etcd","--advertise-client-urls=https://192.168.26.100:2379","--cert-file=/etc/kubernetes/pki/etcd/server.crt","--client-cert-auth=true","--data-dir=/var/lib/etcd","--experimental-initial-corrupt-check=true","--experimental-watch-progress-notify-interval=5s","--initial-advertise-peer-urls=https://192.168.26.100:2380","--initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380","--key-file=/etc/kubernetes/pki/etcd/server.key","--listen-client-urls=https://127.0.0.1:2379,https://192.168.26.100:2379","--listen-metrics-urls=http://127.0.0.1:2381","--listen-peer-urls=https://192.168.26.100:2380","--name=vms100.liruilongs.github.io","--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt","--peer-client-cert-auth=true","--peer-key-file=/etc/kubernetes/pki/etcd/peer.key","--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt","--snapshot-count=10000","--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"]}
{"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
{"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["https://192.168.26.100:2380"]}
{"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"embed/etcd.go:479","msg":"starting with peer TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, client-cert=, client-key=, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
{"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://192.168.26.100:2379"]}
{"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.4","git-sha":"08407ff76","go-version":"go1.16.15","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":true,"name":"vms100.liruilongs.github.io","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://192.168.26.100:2380"],"listen-peer-urls":["https://192.168.26.100:2380"],"advertise-client-urls":["https://192.168.26.100:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://192.168.26.100:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
panic: freepages: failed to get all reachable pages (page 7744: multiple references)goroutine 109 [running]:
go.etcd.io/bbolt.(*DB).freepages.func2(0xc00009c480)/go/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:1056 +0xe9
created by go.etcd.io/bbolt.(*DB).freepages/go/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:1054 +0x1cd
┌──[root@vms100.liruilongs.github.io]-[~]
└─$
如何解决
这里最快的办法是重新同步一下这个节点的数据,即把这个故障节点移出 集群,清理完故障节点旧数据在重新添加,操作步骤
清理数据目录
,移动静态Pod 的yaml 文件:停止故障节点服务,然后删除etcd
数据目录。移除故障节点
:使用member remove
命令剔除错误节点,可以在健康的节点执行命令。添加节点
:使用member add
命令添加故障节点。重新启动
:移动故障节点yaml文件,进行启动
注
: 静态Pod 通过加载指定目录的 yaml 文件来调度,kubelet
会定时扫描,删除移动 yaml 文件,静态 Pod 会自动停止,同理。添加 yaml 文件会自动创建静态 Pod
移动静态Pod 的yaml 文件
┌──[root@vms100.liruilongs.github.io]-[~]
└─$mv /etc/kubernetes/manifests/{etcd.yaml,kube-apiserver.yaml} /tmp/
删除etcd
数据目录
┌──[root@vms100.liruilongs.github.io]-[~]
└─$rm -rf /var/lib/etcd/*
确认节点 的 etcd
和 apiservier
都已经停止
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep apiserver
kube-system kube-apiserver-vms101.liruilongs.github.io 1/1 Running 272 (4h15m ago) 415d 192.168.26.101 vms101.liruilongs.github.io <none> <none>
kube-system kube-apiserver-vms102.liruilongs.github.io 1/1 Running 246 (4h15m ago) 415d 192.168.26.102 vms102.liruilongs.github.io <none> <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep etcd
kube-system etcd-vms101.liruilongs.github.io 1/1 Running 167 (4h15m ago) 415d 192.168.26.101 vms101.liruilongs.github.io <none> <none>
kube-system etcd-vms102.liruilongs.github.io 1/1 Running 173 (4h15m ago) 415d 192.168.26.102 vms102.liruilongs.github.io <none> <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$
获取故障节点 ID,下面的操作我们在健康的 etcd
节点执行,或者可以修改 --endpoints
┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl --endpoints https://192.168.26.101:2379 --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
| ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
| 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
移除故障节点
┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl --endpoints https://127.0.0.1:2379 --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" member remove ee392e5273e89e2
Member ee392e5273e89e2 removed from cluster 4816f346663d82a7
重新添加
┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl --endpoints https://127.0.0.1:2379 --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" member add vms100.liruilongs.github.io --peer-urls=https://192.168.26.100:2380
Member 456f71fdc1ad9917 added to cluster 4816f346663d82a7ETCD_NAME="vms100.liruilongs.github.io"
ETCD_INITIAL_CLUSTER="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.26.100:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
回到 100 节点机器,移动 Yaml 文件,恢复节点
┌──[root@vms100.liruilongs.github.io]-[~]
└─$mv /tmp/{etcd.yaml,kube-apiserver.yaml} /etc/kubernetes/manifests/
确认 Pod 状态
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep etcd
kube-system etcd-vms100.liruilongs.github.io 1/1 Running 0 16s 192.168.26.100 vms100.liruilongs.github.io <none> <none>
kube-system etcd-vms101.liruilongs.github.io 1/1 Running 167 (4h32m ago) 415d 192.168.26.101 vms101.liruilongs.github.io <none> <none>
kube-system etcd-vms102.liruilongs.github.io 1/1 Running 173 (4h32m ago) 415d 192.168.26.102 vms102.liruilongs.github.io <none> <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep apiserver
kube-system kube-apiserver-vms100.liruilongs.github.io 1/1 Running 0 24s 192.168.26.100 vms100.liruilongs.github.io <none> <none>
kube-system kube-apiserver-vms101.liruilongs.github.io 1/1 Running 272 (4h32m ago) 415d 192.168.26.101 vms101.liruilongs.github.io <none> <none>
kube-system kube-apiserver-vms102.liruilongs.github.io 1/1 Running 246 (4h32m ago) 415d 192.168.26.102 vms102.liruilongs.github.io <none> <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$
查看 etcd 集群状态
┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl --endpoints https://127.0.0.1:2379 --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+------------------+-----------+-----------------------------+-----------------------------+-----------------------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
+------------------+-----------+-----------------------------+-----------------------------+-----------------------------+
| 54952f3b494c0286 | unstarted | | https://192.168.26.100:2380 | |
| 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+-----------+-----------------------------+-----------------------------+-----------------------------+
这里我们发现 新添加的节点状态不正常,一直是 unstarted
我们在 故障节点执行 etcd
命令。发现故障节点并没有添加到集群,而是作为一个单节点运行。
┌──[root@vms100.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl --endpoints https://127.0.0.1:2379 --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+-----------------+---------+-----------------------------+-----------------------------+-----------------------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
+-----------------+---------+-----------------------------+-----------------------------+-----------------------------+
| ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
+-----------------+---------+-----------------------------+-----------------------------+-----------------------------+
┌──[root@vms100.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl --endpoints https://127.0.0.1:2379 --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" endpoint status --cluster -w table
+-----------------------------+-----------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------+-----------------+---------+---------+-----------+-----------+------------+
| https://192.168.26.100:2379 | ee392e5273e89e2 | 3.5.4 | 815 kB | true | 2 | 2261 |
+-----------------------------+-----------------+---------+---------+-----------+-----------+------------+
┌──[root@vms100.liruilongs.github.io]-[~]
└─$
也没有同步 当前集群的数据
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide --server=https://vms100.liruilongs.github.io:6443
No resources found
遇到这种情况,大部分原因是 某个节点的 etcd
配置文件的问题,我的这个问题是 故障节点的 etcd 配置文件,没有集群信息相关配置
,所以这里把集群相关配置写入配置
原本的配置文件
┌──[root@vms100.liruilongs.github.io]-[~]
└─$cat /etc/kubernetes/manifests/etcd.yaml
apiVersion: v1
kind: Pod
metadata:annotations:kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.26.100:2379creationTimestamp: nulllabels:component: etcdtier: control-planename: etcdnamespace: kube-system
spec:containers:- command:- etcd- --advertise-client-urls=https://192.168.26.100:2379- --cert-file=/etc/kubernetes/pki/etcd/server.crt- --client-cert-auth=true- --data-dir=/var/lib/etcd- --experimental-initial-corrupt-check=true- --experimental-watch-progress-notify-interval=5s- --initial-advertise-peer-urls=https://192.168.26.100:2380- --initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380- --key-file=/etc/kubernetes/pki/etcd/server.key- --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.100:2379- --listen-metrics-urls=http://127.0.0.1:2381- --listen-peer-urls=https://192.168.26.100:2380- --name=vms100.liruilongs.github.io- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt- --peer-client-cert-auth=true- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt- --snapshot-count=10000- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crtimage: registry.aliyuncs.com/google_containers/etcd:3.5.4-0
。。。。。。。。。。。。。。。。
集群信息不全的,添加后的配置文件
┌──[root@vms100.liruilongs.github.io]-[~]
└─$cat /etc/kubernetes/manifests/etcd.yaml
apiVersion: v1
kind: Pod
metadata:annotations:kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.26.100:2379creationTimestamp: nulllabels:component: etcdtier: control-planename: etcdnamespace: kube-system
spec:containers:- command:- etcd- --advertise-client-urls=https://192.168.26.100:2379- --cert-file=/etc/kubernetes/pki/etcd/server.crt- --client-cert-auth=true- --data-dir=/var/lib/etcd- --experimental-initial-corrupt-check=true- --experimental-watch-progress-notify-interval=5s- --initial-advertise-peer-urls=https://192.168.26.100:2380- --initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380- --initial-cluster-state=existing- --key-file=/etc/kubernetes/pki/etcd/server.key- --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.100:2379- --listen-metrics-urls=http://127.0.0.1:2381- --listen-peer-urls=https://192.168.26.100:2380- --name=vms100.liruilongs.github.io- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt- --peer-client-cert-auth=true- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt- --snapshot-count=10000- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
然后我们以上面相同的方式从新恢复一次,发现节点直接没有起来
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep apiserver
kube-system kube-apiserver-vms100.liruilongs.github.io 0/1 CrashLoopBackOff 1 (18s ago) 39s 192.168.26.100 vms100.liruilongs.github.io <none> <none>
kube-system kube-apiserver-vms101.liruilongs.github.io 1/1 Running 272 (5h29m ago) 415d 192.168.26.101 vms101.liruilongs.github.io <none> <none>
kube-system kube-apiserver-vms102.liruilongs.github.io 1/1 Running 246 (5h29m ago) 415d 192.168.26.102 vms102.liruilongs.github.io <none> <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep etcd
kube-system etcd-vms100.liruilongs.github.io 0/1 CrashLoopBackOff 3 (21s ago) 53s 192.168.26.100 vms100.liruilongs.github.io <none> <none>
kube-system etcd-vms101.liruilongs.github.io 1/1 Running 167 (5h29m ago) 415d 192.168.26.101 vms101.liruilongs.github.io <none> <none>
kube-system etcd-vms102.liruilongs.github.io 1/1 Running 173 (5h29m ago) 415d 192.168.26.102 vms102.liruilongs.github.io <none> <none>
查看日志
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl logs etcd-vms100.liruilongs.github.io -n kube-system
.............................
{"level":"fatal","ts":"2024-03-16T16:25:19.981Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"error validating peerURLs {ClusterID:4816f346663d82a7 Members:[&{ID:b8cb9f66c2e63b91 RaftAttributes:{PeerURLs:[https://192.168.26.102:2380] IsLearner:false} Attributes:{Name:vms102.liruilongs.github.io ClientURLs:[https://192.168.26.102:2379]}} &{ID:3fbbbed942c51f7b RaftAttributes:{PeerURLs:[https://192.168.26.100:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}} &{ID:70059e836d19883d RaftAttributes:{PeerURLs:[https://192.168.26.101:2380] IsLearner:false} Attributes:{Name:vms101.liruilongs.github.io ClientURLs:[https://192.168.26.101:2379]}}] RemovedMemberIDs:[]}: member count is unequal","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
根据日志信息,可以看到有用的信息 RemovedMemberIDs:[]}: member count is unequal
,成员数量不相等,在分析日志
{"level": "info","ts": "2024-03-16T16:25:19.961Z","caller": "etcdmain/etcd.go:73","msg": "Running: ","args": ["etcd","--advertise-client-urls=https://192.168.26.100:2379","--cert-file=/etc/kubernetes/pki/etcd/server.crt","--client-cert-auth=true","--data-dir=/var/lib/etcd","--experimental-initial-corrupt-check=true","--experimental-watch-progress-notify-interval=5s","--initial-advertise-peer-urls=https://192.168.26.100:2380","--initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380","--initial-cluster-state=existing","--key-file=/etc/kubernetes/pki/etcd/server.key","--listen-client-urls=https://127.0.0.1:2379,https://192.168.26.100:2379","--listen-metrics-urls=http://127.0.0.1:2381","--listen-peer-urls=https://192.168.26.100:2380","--name=vms100.liruilongs.github.io","--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt","--peer-client-cert-auth=true","--peer-key-file=/etc/kubernetes/pki/etcd/peer.key","--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt","--snapshot-count=10000","--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"]
}
..............................................................................
{"level": "warn","ts": "2024-03-16T16:25:19.981Z","caller": "etcdmain/etcd.go:146","msg": "failed to start etcd","error": "error validating peerURLs {ClusterID:4816f346663d82a7 Members:[&{ID:b8cb9f66c2e63b91 RaftAttributes:{PeerURLs:[https://192.168.26.102:2380] IsLearner:false} Attributes:{Name:vms102.liruilongs.github.io ClientURLs:[https://192.168.26.102:2379]}} &{ID:3fbbbed942c51f7b RaftAttributes:{PeerURLs:[https://192.168.26.100:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}} &{ID:70059e836d19883d RaftAttributes:{PeerURLs:[https://192.168.26.101:2380] IsLearner:false} Attributes:{Name:vms101.liruilongs.github.io ClientURLs:[https://192.168.26.101:2379]}}] RemovedMemberIDs:[]}: member count is unequal"
}
{"level": "fatal","ts": "2024-03-16T16:25:19.981Z","caller": "etcdmain/etcd.go:204","msg": "discovery failed","error": "error validating peerURLs {ClusterID:4816f346663d82a7 Members:[&{ID:b8cb9f66c2e63b91 RaftAttributes:{PeerURLs:[https://192.168.26.102:2380] IsLearner:false} Attributes:{Name:vms102.liruilongs.github.io ClientURLs:[https://192.168.26.102:2379]}} &{ID:3fbbbed942c51f7b RaftAttributes:{PeerURLs:[https://192.168.26.100:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}} &{ID:70059e836d19883d RaftAttributes:{PeerURLs:[https://192.168.26.101:2380] IsLearner:false} Attributes:{Name:vms101.liruilongs.github.io ClientURLs:[https://192.168.26.101:2379]}}] RemovedMemberIDs:[]}: member count is unequal","stacktrace": "go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"
}
可以看到它提示 可能错误与 vms102.liruilongs.github.io
节点相关
然后我们看一下 vms102.liruilongs.github.io
的配置文件
┌──[root@vms102.liruilongs.github.io]-[~]
└─$cat /etc/kubernetes/manifests/etcd.yaml
apiVersion: v1
kind: Pod
metadata:annotations:kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.26.102:2379creationTimestamp: nulllabels:component: etcdtier: control-planename: etcdnamespace: kube-system
spec:containers:- command:- etcd- --advertise-client-urls=https://192.168.26.102:2379- --cert-file=/etc/kubernetes/pki/etcd/server.crt- --client-cert-auth=true- --data-dir=/var/lib/etcd- --experimental-initial-corrupt-check=true- --experimental-watch-progress-notify-interval=5s- --initial-advertise-peer-urls=https://192.168.26.102:2380- --initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380- --initial-cluster-state=existing- --key-file=/etc/kubernetes/pki/etcd/server.key- --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.102:2379- --listen-metrics-urls=http://127.0.0.1:2381- --listen-peer-urls=https://192.168.26.102:2380- --name=vms102.liruilongs.github.io- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt- --peer-client-cert-auth=true- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt- --snapshot-count=10000- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
通过配置文件比对
,可以发现,之前配置的故障节点的配置任然有问题,少了一个vms102.liruilongs.github.io=https://192.168.26.102:2380
节点信息。
"--initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380",
"--initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380"
修改完配置,按照上面相同的流程重新恢复节点, 节点恢复
通过 etcdctl
命令检查
┌──[root@vms100.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl --endpoints https://127.0.0.1:2379 --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
| 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| ac5f6045dbe477b3 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
| b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
┌──[root@vms100.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl --endpoints https://127.0.0.1:2379 --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" --cacert="/etc/kubernetes/pki/etcd/ca.crt" endpoint status --cluster -w table
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://192.168.26.101:2379 | 70059e836d19883d | 3.5.4 | 88 MB | false | 603 | 22227327 |
| https://192.168.26.100:2379 | ac5f6045dbe477b3 | 3.5.4 | 88 MB | false | 603 | 22227327 |
| https://192.168.26.102:2379 | b8cb9f66c2e63b91 | 3.5.4 | 88 MB | true | 603 | 22227327 |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
┌──[root@vms100.liruilongs.github.io]-[~]
└─$
故障节点恢复,在实际的操作中,添加完节点,我们需要确认故障节点的配置文件是否是正确的配置文件
© 2018-2024 liruilonger@gmail.com, All rights reserved. 保持署名-非商用-相同方式共享(CC BY-NC-SA 4.0)
相关文章:
K8s 集群高可用master节点ETCD挂掉如何恢复?
写在前面 很常见的集群运维场景,整理分享博文内容为 K8s 集群高可用 master 节点故障如何恢复的过程理解不足小伙伴帮忙指正 不必太纠结于当下,也不必太忧虑未来,当你经历过一些事情的时候,眼前的风景已经和从前不一样了。——村上…...

【Godot 4.2】常见几何图形、网格、刻度线点求取函数及原理总结
概述 本篇为ShapePoints静态函数库的补充和辅助文档。ShapePoints函数库是一个用于生成常见几何图形顶点数据(PackedVector2Array)的静态函数库。生成的数据可用于_draw和Line2D、Polygon2D等进行绘制和显示。因为不断地持续扩展,ShapePoint…...

如何利用POI导出报表
一、报表格式 二、依赖坐标 <dependency><groupId>org.apache.poi</groupId><artifactId>poi</artifactId><version>3.16</version> </dependency> <dependency><groupId>org.apache.poi</groupId><art…...
自动部署SSL证书到阿里云腾讯云CDN
项目地址:https://github.com/yxzlwz/ssl_update 项目简介 目前,自动申请和管理免费SSL证书的项目有很多,如个人正在使用的 acme.sh。然而在申请后,如果我们的需求不仅限于服务器本地的使用,证书的部署也是一件麻烦事…...

【系统性】 循序渐进学C++
循序渐进学C 第一阶段:基础 一、环境配置 1.1.第一个程序(基本格式) #include <iosteam> using namespace std;int main(){cout<<"hello world"<<endl;system("pause"); } 模板 #include &…...
rust - 一个日志缓存记录的通用实现
本文给出了一个通用的设计模式,通过建造者模式实例化记录对象,可自定义格式化器将实例化后的记录对象写入到指定的缓存对象中。 定义记录对象 use chrono::prelude::*; use std::{cell::RefCell, ffi::OsStr, fmt, io, io::Write, path::Path, rc::Rc,…...
elasticsearch(RestHighLevelClient API操作)(黑马)
操作全是换汤不换药,创建一个request,然后使用client发送就可以了 一、增加索引库数据 Testvoid testAddDocument() throws IOException {//从数据库查出数据Writer writer writerService.getById(199);//将查出来的数据处理成json字符串String json …...

用尾插的思想实现移除链表中的元素
目录 一、介绍尾插 1.链表为空 2.链表不为空 二、题目介绍 三、思路 四、代码 五、代码解析 1. 2. 3. 4. 5. 6. 六、注意点 1. 2. 一、介绍尾插 整体思路为 1.链表为空 void SLPushBack(SLTNode** pphead, SLTDataType x) {SLTNode* newnode BuyLTNode(x); …...

【Kubernetes】k8s删除master节点后重新加入集群
目录 前言一、思路二、实战1.安装etcdctl指令2.重置旧节点的k8s3.旧节点的的 etcd 从 etcd 集群删除4.在 master03 上,创建存放证书目录5.把其他控制节点的证书拷贝到 master01 上6.把 master03 加入到集群7.验证 master03 是否加入到 k8s 集群,检查业务…...

HCIP—OSPF虚链路实验
OSPF虚链路—Vlink 作用:专门解决OSPF不规则区域所诞生的技术,是一种虚拟的,逻辑的链路。实现非骨干区域和骨干区域在逻辑上直接连接。注意虚链路条件:只能穿越一个区域,通常对虚链路进行认证功能的配置。虚链路认证也…...
RAxML-NG安装与使用-raxml-ng-v1.2.0(bioinfomatics tools-013)
01 背景 1.1 ML树 ML树,或最大似然树,是一种在进化生物学中用来推断物种之间进化关系的方法。最大似然(Maximum Likelihood, ML)是一种统计框架,用于估计模型参数,使得观察到的数据在该模型参数下的概率最…...

Tomcat内存马
Tomcat内存马 前言 描述Servlet3.0后允许动态注册组件 这一技术的实现有赖于官方对Servlet3.0的升级,Servlet在3.0版本之后能够支持动态注册组件。 而Tomcat直到7.x才支持Servlet3.0,因此通过动态添加恶意组件注入内存马的方式适合Tomcat7.x及以上。…...

pytorch之诗词生成3--utils
先上代码: import numpy as np import settingsdef generate_random_poetry(tokenizer, model, s):"""随机生成一首诗:param tokenizer: 分词器:param model: 用于生成古诗的模型:param s: 用于生成古诗的起始字符串,默认为空串:return: …...
OpenAI的ChatGPT企业版专注于安全性、可扩展性和定制化。
OpenAI的ChatGPT企业版:安全、可扩展性和定制化的重点 OpenAI的ChatGPT在商业世界引起了巨大反响,而最近推出的ChatGPT企业版更是证明了其在企业界的日益重要地位。企业版ChatGPT拥有企业级安全、无限GPT-4访问、更长的上下文窗口以及一系列定制选项等增…...
JS06-class对象
class对象 className 修改样式 <!DOCTYPE html> <html lang"en"> <head><meta charset"UTF-8"><meta http-equiv"X-UA-Compatible" content"IEedge"><meta name"viewport" content&quo…...

深度学习1650ti在win10安装pytorch复盘
深度学习1650ti在win10安装pytorch复盘 前言1. 安装anaconda2. 检查更新显卡驱动3. 根据pytorch选择CUDA版本4. 安装CUDA5. 安装cuDNN6. conda安装pytorch结语 前言 建议有条件的,可以在安装过程中,开启梯子。例如cuDNN安装时登录 or 注册,会…...

Node.js与webpack(三)
上一节:Node.js与Webpack笔记(二)-CSDN博客 从0来一遍(webpack项目) 将之前的webpack 的纯开发配置,重新创建空白项目,重新做一遍,捋一遍思路防止加入生产模式时候弄混 1.创建文件夹…...
测试覆盖率那些事
在测试过程中,会出现测试覆盖不全的情况,特别是工期紧张的情况下,测试的时间被项目的周期一压再压,测试覆盖概率不全就会伴随而来。 网上冲浪,了解一下覆盖率的文章,其中一篇感觉写的很不错,将…...

Etcd 介绍与使用(入门篇)
etcd 介绍 etcd 简介 etc (基于 Go 语言实现)在 Linux 系统中是配置文件目录名;etcd 就是配置服务; etcd 诞生于 CoreOS 公司,最初用于解决集群管理系统中 os 升级时的分布式并发控制、配置文件的存储与分发等问题。基…...

Docker 安装 LogStash
关于LogStash Logstash,作为Elastic Stack家族中的核心成员之一,是一个功能强大的开源数据收集引擎。它专长于从各种来源动态地获取、解析、转换和丰富数据,并将这些结构化或非结构化的数据高效地传输到诸如Elasticsearch等存储系统中进行集…...

铭豹扩展坞 USB转网口 突然无法识别解决方法
当 USB 转网口扩展坞在一台笔记本上无法识别,但在其他电脑上正常工作时,问题通常出在笔记本自身或其与扩展坞的兼容性上。以下是系统化的定位思路和排查步骤,帮助你快速找到故障原因: 背景: 一个M-pard(铭豹)扩展坞的网卡突然无法识别了,扩展出来的三个USB接口正常。…...

装饰模式(Decorator Pattern)重构java邮件发奖系统实战
前言 现在我们有个如下的需求,设计一个邮件发奖的小系统, 需求 1.数据验证 → 2. 敏感信息加密 → 3. 日志记录 → 4. 实际发送邮件 装饰器模式(Decorator Pattern)允许向一个现有的对象添加新的功能,同时又不改变其…...

Keil 中设置 STM32 Flash 和 RAM 地址详解
文章目录 Keil 中设置 STM32 Flash 和 RAM 地址详解一、Flash 和 RAM 配置界面(Target 选项卡)1. IROM1(用于配置 Flash)2. IRAM1(用于配置 RAM)二、链接器设置界面(Linker 选项卡)1. 勾选“Use Memory Layout from Target Dialog”2. 查看链接器参数(如果没有勾选上面…...
Robots.txt 文件
什么是robots.txt? robots.txt 是一个位于网站根目录下的文本文件(如:https://example.com/robots.txt),它用于指导网络爬虫(如搜索引擎的蜘蛛程序)如何抓取该网站的内容。这个文件遵循 Robots…...
OpenPrompt 和直接对提示词的嵌入向量进行训练有什么区别
OpenPrompt 和直接对提示词的嵌入向量进行训练有什么区别 直接训练提示词嵌入向量的核心区别 您提到的代码: prompt_embedding = initial_embedding.clone().requires_grad_(True) optimizer = torch.optim.Adam([prompt_embedding...

IoT/HCIP实验-3/LiteOS操作系统内核实验(任务、内存、信号量、CMSIS..)
文章目录 概述HelloWorld 工程C/C配置编译器主配置Makefile脚本烧录器主配置运行结果程序调用栈 任务管理实验实验结果osal 系统适配层osal_task_create 其他实验实验源码内存管理实验互斥锁实验信号量实验 CMISIS接口实验还是得JlINKCMSIS 简介LiteOS->CMSIS任务间消息交互…...

mysql已经安装,但是通过rpm -q 没有找mysql相关的已安装包
文章目录 现象:mysql已经安装,但是通过rpm -q 没有找mysql相关的已安装包遇到 rpm 命令找不到已经安装的 MySQL 包时,可能是因为以下几个原因:1.MySQL 不是通过 RPM 包安装的2.RPM 数据库损坏3.使用了不同的包名或路径4.使用其他包…...
MySQL用户和授权
开放MySQL白名单 可以通过iptables-save命令确认对应客户端ip是否可以访问MySQL服务: test: # iptables-save | grep 3306 -A mp_srv_whitelist -s 172.16.14.102/32 -p tcp -m tcp --dport 3306 -j ACCEPT -A mp_srv_whitelist -s 172.16.4.16/32 -p tcp -m tcp -…...

让回归模型不再被异常值“带跑偏“,MSE和Cauchy损失函数在噪声数据环境下的实战对比
在机器学习的回归分析中,损失函数的选择对模型性能具有决定性影响。均方误差(MSE)作为经典的损失函数,在处理干净数据时表现优异,但在面对包含异常值的噪声数据时,其对大误差的二次惩罚机制往往导致模型参数…...

windows系统MySQL安装文档
概览:本文讨论了MySQL的安装、使用过程中涉及的解压、配置、初始化、注册服务、启动、修改密码、登录、退出以及卸载等相关内容,为学习者提供全面的操作指导。关键要点包括: 解压 :下载完成后解压压缩包,得到MySQL 8.…...