K8s 集群 etcd节点故障解决方案

Posted 烟拢寒水

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了K8s 集群 etcd节点故障解决方案相关的知识,希望对你有一定的参考价值。

1 环境说明

k8s版本:v1.20

etcd节点(192.168.0.12)故障:

 报错详情:

 4月 24 22:47:13 k8s-node2 etcd[9543]: "level":"warn","ts":"2023-04-24T22:47:13.571+0800","caller":"etcdserver/server.go:2065","msg":"failed to publish local member to cluster through raft","local-member-id":"b8fffb7f5b2f26e","local-member-attributes":"Name:etcd-3 ClientURLs:[https://192.168.0.12:2379]","request-path":"/0/members/b8fffb7f5b2f26e/attributes","publish-timeout":"7s","error":"etcdserver: request timed out"

2 查看etcd集群

/opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.0.5:2379,https://192.168.0.11:2379,https://192.168.0.12:2379" member list

3 移除故障节点

/opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.0.5:2379,https://192.168.0.11:2379,https://192.168.0.12:2379" member remove b8fffb7f5b2f26e

4 删除故障节点的数据

rm -rf /var/lib/etcd/default.etcd/member/

5 修改故障节点etcd配置文件

将new改为existing

#[Member]
ETCD_NAME="etcd-3"
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="https://192.168.0.12:2380"
ETCD_LISTEN_CLIENT_URLS="https://192.168.0.12:2379"

#[Clustering]
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.0.12:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://192.168.0.12:2379"
ETCD_INITIAL_CLUSTER="etcd-1=https://192.168.0.5:2380,etcd-2=https://192.168.0.11:2380,etcd-3=https://192.168.0.12:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="existing"

6 重新加入集群

/opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.0.5:2379,https://192.168.0.11:2379,https://192.168.0.12:2379" member add etcd-3 --peer-urls=https://192.168.0.12:2380

 

7 重启故障节点的etcd

systemctl restart etcd

查看etcd服务状态

8 查看k8s集群健康状态

 

ETCD 故障节点修复

故障,etcd某个节点启动报错:etcd failed to get all reachable pages

解决办法,删除节点,重新添加,步骤如下:

一、从集群中删除故障节点(正常节点上操作)

# 列出etcd所有节点
etcdctl member list

# 删除故障节点
etcdctl member remove c13845537406e22f

二、修复故障节点(故障节点上操作)

# 修改配置
sed -i  "s#initial-cluster-state: ‘new‘#initial-cluster-state: ‘existing‘"  /etc/etcd/etcd.config.yml 

# 清理节点数据(下面为默认路径,请根据你的设置修改)
rm -rf  /var/lib/etcd/member

三、重新添加节点(正常节点上操作)

etcdctl member add K8s-2 https://192.168.216.242:2380

四、重启故障节点(故障节点上操作)

systemctl restart etcd

以上是关于K8s 集群 etcd节点故障解决方案的主要内容,如果未能解决你的问题,请参考以下文章

k8s-外置ETCD集群部署

Rancher RKE K8s 集群 etcd 恢复

K8S节点网络故障排除过程

7,k8s 的etcd集群的搭建

K8s高可用+负载均衡集群

K8s高可用+负载均衡集群