K8s 集群 etcd节点故障解决方案
Posted 烟拢寒水
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了K8s 集群 etcd节点故障解决方案相关的知识,希望对你有一定的参考价值。
1 环境说明
k8s版本:v1.20
etcd节点(192.168.0.12)故障:
报错详情:
4月 24 22:47:13 k8s-node2 etcd[9543]: "level":"warn","ts":"2023-04-24T22:47:13.571+0800","caller":"etcdserver/server.go:2065","msg":"failed to publish local member to cluster through raft","local-member-id":"b8fffb7f5b2f26e","local-member-attributes":"Name:etcd-3 ClientURLs:[https://192.168.0.12:2379]","request-path":"/0/members/b8fffb7f5b2f26e/attributes","publish-timeout":"7s","error":"etcdserver: request timed out"
2 查看etcd集群
/opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.0.5:2379,https://192.168.0.11:2379,https://192.168.0.12:2379" member list
3 移除故障节点
/opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.0.5:2379,https://192.168.0.11:2379,https://192.168.0.12:2379" member remove b8fffb7f5b2f26e
4 删除故障节点的数据
rm -rf /var/lib/etcd/default.etcd/member/
5 修改故障节点etcd配置文件
将new改为existing
#[Member]
ETCD_NAME="etcd-3"
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="https://192.168.0.12:2380"
ETCD_LISTEN_CLIENT_URLS="https://192.168.0.12:2379"
#[Clustering]
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.0.12:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://192.168.0.12:2379"
ETCD_INITIAL_CLUSTER="etcd-1=https://192.168.0.5:2380,etcd-2=https://192.168.0.11:2380,etcd-3=https://192.168.0.12:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="existing"
6 重新加入集群
/opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.0.5:2379,https://192.168.0.11:2379,https://192.168.0.12:2379" member add etcd-3 --peer-urls=https://192.168.0.12:2380
7 重启故障节点的etcd
systemctl restart etcd
查看etcd服务状态
8 查看k8s集群健康状态
ETCD 故障节点修复
故障,etcd某个节点启动报错:etcd failed to get all reachable pages解决办法,删除节点,重新添加,步骤如下:
一、从集群中删除故障节点(正常节点上操作)
# 列出etcd所有节点
etcdctl member list
# 删除故障节点
etcdctl member remove c13845537406e22f
二、修复故障节点(故障节点上操作)
# 修改配置
sed -i "s#initial-cluster-state: ‘new‘#initial-cluster-state: ‘existing‘" /etc/etcd/etcd.config.yml
# 清理节点数据(下面为默认路径,请根据你的设置修改)
rm -rf /var/lib/etcd/member
三、重新添加节点(正常节点上操作)
etcdctl member add K8s-2 https://192.168.216.242:2380
四、重启故障节点(故障节点上操作)
systemctl restart etcd
以上是关于K8s 集群 etcd节点故障解决方案的主要内容,如果未能解决你的问题,请参考以下文章