记一次虚机强制断电 K8s 集群 etcd pod 挂掉快照丢失(没有备份)问题处理

Posted 山河已无恙

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了记一次虚机强制断电 K8s 集群 etcd pod 挂掉快照丢失(没有备份)问题处理相关的知识,希望对你有一定的参考价值。

写在前面

  • 不小心拔错电源了,虚机强制关机,开机后集群死掉了
  • 记录下解决方案
  • 断电导致etcd 快照数据丢失,没有备份.基本上是没办法处理
  • 可以找专业的 DBA来处理数据看有没有可能恢复
  • 这篇博文的解决办法是删除了 etcd 数据目录中的部分文件。
  • 集群可以启动,但是 部署的环境数据都丢失了,包括CNI, 集群自带的 DNS 组件也丢了。
  • 理解不足小伙伴帮忙指正
  • 不管是生产还是测试, k8s集群 ETCD 一定要备份,ETCD 一定要备份,ETCD 一定要备份 ,重要的话说三遍。

我所渴求的,無非是將心中脫穎語出的本性付諸生活,為何竟如此艱難呢 ------赫尔曼·黑塞《德米安》


当前集群的状态

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get nodes
The connection to the server 192.168.26.81:6443 was refused - did you specify the right host or port?

重启 docke 和 kubelet 尝试启动

┌──[root@vms81.liruilongs.github.io]-[~]
└─$systemctl restart docker
┌──[root@vms81.liruilongs.github.io]-[~]
└─$systemctl restart kubelet.service

还是不行,查看下 maser 节点的 kubelet 日志信息

┌──[root@vms81.liruilongs.github.io]-[~]
└─$journalctl  -u kubelet.service -f
119 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.703418   11344 kubelet.go:2407] "Error getting node" err="node \\"vms81.liruilongs.github.io\\" not found"
119 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.804201   11344 kubelet.go:2407] "Error getting node" err="node \\"vms81.liruilongs.github.io\\" not found"
119 09:32:06 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:06.905156   11344 kubelet.go:2407] "Error getting node" err="node \\"vms81.liruilongs.github.io\\" not found"
119 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.005487   11344 kubelet.go:2407] "Error getting node" err="node \\"vms81.liruilongs.github.io\\" not found"
119 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.105648   11344 kubelet.go:2407] "Error getting node" err="node \\"vms81.liruilongs.github.io\\" not found"
119 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.186066   11344 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://192.168.26.81:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/vms81.liruilongs.github.io?timeout=10s": dial tcp 192.168.26.81:6443: connect: connection refused
119 09:32:07 vms81.liruilongs.github.io kubelet[11344]: E0119 09:32:07.205785   11344 kubelet.go:2407] "Error getting node" err="node \\"vms81.liruilongs.github.io\\" not found"

利用 docker 查看下当前存在的 pod 信息

┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker ps
CONTAINER ID   IMAGE                                               COMMAND                  CREATED          STATUS              PORTS     NAMES
d9d6471ce936   b51ddc1014b0                                        "kube-scheduler --au…"   17 minutes ago   Up 17 minutes                 k8s_kube-scheduler_kube-scheduler-vms81.liruilongs.github.io_kube-system_e1b874bfdef201d69db10b200b8f47d5_14
010c1b8c30c6   5425bcbd23c5                                        "kube-controller-man…"   17 minutes ago   Up 17 minutes                 k8s_kube-controller-manager_kube-controller-manager-vms81.liruilongs.github.io_kube-system_49b7654103f80170bfe29d034f806256_15
7e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 18 minutes ago   Up About a minute             k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
f557435d150e   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 18 minutes ago   Up 18 minutes                 k8s_POD_kube-scheduler-vms81.liruilongs.github.io_kube-system_e1b874bfdef201d69db10b200b8f47d5_7
5deaffbc555a   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 18 minutes ago   Up 18 minutes                 k8s_POD_kube-controller-manager-vms81.liruilongs.github.io_kube-system_49b7654103f80170bfe29d034f806256_7
a418c2ce33f2   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 18 minutes ago   Up 18 minutes                 k8s_POD_kube-apiserver-vms81.liruilongs.github.io_kube-system_a35cb37b6c90c72f607936b33161eefe_6

etcd 没有启动, apiservice 也没有启动。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker ps -a | grep etcd
b5e18722315b   004811815584                                        "etcd --advertise-cl…"   5 minutes ago    Exited (2) About a minute ago             k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_19
7e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 21 minutes ago   Up 4 minutes                              k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7

尝试重新启动 etcd

┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker restart b5e18722315b
b5e18722315b

查看启动状态

┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker ps -a | grep etcd
b5e18722315b   004811815584                                        "etcd --advertise-cl…"   5 minutes ago    Exited (2) About a minute ago             k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_19
7e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 21 minutes ago   Up 4 minutes                              k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker logs b5e18722315b

看一下 etcd 对应的日志

┌──[root@vms81.liruilongs.github.io]-[~]
└─$docker logs 8a53cbc545e4
..................................................
"level":"info","ts":"2023-01-19T01:34:24.332Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"5.557212ms"
"level":"warn","ts":"2023-01-19T01:34:24.332Z","caller":"wal/util.go:90","msg":"ignored file in WAL directory","path":"0000000000000014-0000000000185aba.wal.broken"
"level":"info","ts":"2023-01-19T01:34:24.770Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":26912747,"snapshot-size":"42 kB"
"level":"warn","ts":"2023-01-19T01:34:24.771Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":26912747,"snapshot-file-path":"/var/lib/etcd/member/snap/00000000019aa7eb.snap.db","error":"snap: snapshot file doesn't exist"
"level":"panic","ts":"2023-01-19T01:43:31.738Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\\n\\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:515\\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\\n\\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:244\\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\\n\\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:227\\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\\n\\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:122\\ngo.etcd.io/etcd/server/v3/etcdmain.Main\\n\\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/main.go:40\\nmain.main\\n\\t/tmp/etcd-release-3.5.0/etcd/release/etcd/server/main.go:32\\nruntime.main\\n\\t/home/remote/sbatsche/.gvm/gos/go1.16.3/src/runtime/proc.go:225"
panic: failed to recover v3 backend from snapshot

goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000114600, 0xc000588240, 0x1, 0x1)
        /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000080960, 0x122e2fc, 0x2a, 0xc000588240, 0x1, 0x1)
        /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0x7ffe54af1e25, 0x1a, 0x0, 0x0, 0x0, 0x0, 0xc0004cf830, 0x1, 0x1, 0xc0004cfa70, ...)
        /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc0000ee000, 0xc0000ee600, 0x0, 0x0)
        /tmp/etcd-release-3.5.0/etcd/release/etcd/server/embed/etcd.go:244 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc0000ee000, 0x1202a6f, 0x6, 0xc000428401, 0x2)
        /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:227 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a120, 0x12, 0x12)
        /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/etcd.go:122 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a120, 0x12, 0x12)
        /tmp/etcd-release-3.5.0/etcd/release/etcd/server/etcdmain/main.go:40 +0x11f
main.main()
        /tmp/etcd-release-3.5.0/etcd/release/etcd/server/main.go:32 +0x45

"msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","

“msg”: “从快照恢复v3后台失败”, “error”: “未能找到数据库快照文件(snap: 快照文件不存在)”,"

断电照成数据文件损坏了,它希望从快照中恢复,但是没有快照。

额,这里没有备份,所以基本上是没有办法修复了。只能通过 kubeadm 重置集群了。

一些补救措施

如果说你希望通过一些其他的方式来启动集群,来获取一些当前集群的配置信息,下面的方式可以尝试,但是我的集群使用了下面的方法,所有的 pods 数据都丢失了,没办法最后重置集群了。

如果你想使用下面的方式,一定要备份删除的 etcd 数据文件

etcd master 是一个静态 pod ,所以我们看下 yaml 文件,配置的数据文件中什么位置

┌──[root@vms81.liruilongs.github.io]-[~]
└─$cd /etc/kubernetes/manifests/
┌──[root@vms81.liruilongs.github.io]-[/etc/kubernetes/manifests]
└─$ls
etcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml

- --data-dir=/var/lib/etcd

┌──[root@vms81.liruilongs.github.io]-[/etc/kubernetes/manifests]
└─$cat etcd.yaml | grep -e "--"
    - --advertise-client-urls=https://192.168.26.81:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://192.168.26.81:2380
    - --initial-cluster=vms81.liruilongs.github.io=https://192.168.26.81:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.81:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://192.168.26.81:2380
    - --name=vms81.liruilongs.github.io
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

对应的数据文件,可以尝试对数据文件进行修复,如果希望集群可以快速启动,可以

┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd/member]
└─$tree
.
├── snap
│   ├── 0000000000000058-00000000019a0ba7.snap
│   ├── 0000000000000058-00000000019a32b8.snap
│   ├── 0000000000000058-00000000019a59c9.snap
│   ├── 0000000000000058-00000000019a80da.snap
│   ├── 0000000000000058-00000000019aa7eb.snap
│   └── db
└── wal
    ├── 0000000000000014-0000000000185aba.wal.broken
    ├── 0000000000000142-0000000001963c0e.wal
    ├── 0000000000000143-0000000001977bbe.wal
    ├── 0000000000000144-0000000001986aa6.wal
    ├── 0000000000000145-0000000001995ef6.wal
    ├── 0000000000000146-00000000019a544d.wal
    └── 1.tmp

2 directories, 13 files

备份一下数据文件

┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$ls
member
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$tar -cvf member.tar member/
member/
member/snap/
member/snap/db
member/snap/0000000000000058-00000000019a0ba7.snap
member/snap/0000000000000058-00000000019a32b8.snap
member/snap/0000000000000058-00000000019a59c9.snap
member/snap/0000000000000058-00000000019a80da.snap
member/snap/0000000000000058-00000000019aa7eb.snap
member/wal/
member/wal/0000000000000142-0000000001963c0e.wal
member/wal/0000000000000144-0000000001986aa6.wal
member/wal/0000000000000014-0000000000185aba.wal.broken
member/wal/0000000000000145-0000000001995ef6.wal
member/wal/0000000000000146-00000000019a544d.wal
member/wal/1.tmp
member/wal/0000000000000143-0000000001977bbe.wal
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$ls
member  member.tar
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$mv member.tar  /tmp/
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$rm -rf  member/snap/*.snap
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$rm -rf  member/wal/*.wal
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$

重新启动 docker 对应的镜像,或者重新启动 kubectl。

┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$docker ps -a | grep etcd
a3b97cb34d9b   004811815584                                        "etcd --advertise-cl…"   2 minutes ago   Exited (2) 2 minutes ago              k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_45
7e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 3 hours ago     Up 2 hours                            k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$docker start a3b97cb34d9b
a3b97cb34d9b
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$docker ps -a | grep etcd
e1fc068247af   004811815584                                        "etcd --advertise-cl…"   3 seconds ago   Up 2 seconds                          k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_46
a3b97cb34d9b   004811815584                                        "etcd --advertise-cl…"   3 minutes ago   Exited (2) 3 seconds ago              k8s_etcd_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_45
7e215924a1dd   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 3 hours ago     Up 2 hours                            k8s_POD_etcd-vms81.liruilongs.github.io_kube-system_1502584f9ab841720212d4341d723ba2_7
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$

查看 Node 状态

┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$kubectl get nodes
NAME                          STATUS   ROLES    AGE   VERSION
vms155.liruilongs.github.io   Ready    <none>   76s   v1.22.2
vms81.liruilongs.github.io    Ready    <none>   76s   v1.22.2
vms82.liruilongs.github.io    Ready    <none>   76s   v1.22.2
vms83.liruilongs.github.io    Ready    <none>   76s   v1.22.2
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd]
└─$

查看集群当前所有的 Pod 。

┌──[root@vms81.liruilongs.github.io]-[~/ansible/kubevirt]
└─$kubectl get pods -A
NAME                                                 READY   STATUS    RESTARTS         AGE
etcd-vms81.liruilongs.github.io                      1/1     Running   48 (3h35m ago)   3h53m
kube-apiserver-vms81.liruilongs.github.io            1/1     Running   48 (3h35m ago)   3h51m
kube-controller-manager-vms81.liruilongs.github.io   1/1     Running   17 (3h35m ago)   3h51m
kube-scheduler-vms81.liruilongs.github.io            1/1     Running   16 (3h35m ago)   3h52m

网络相关的 pod 都不在了,而且 k8s 的 dns 组件也没有起来, 这里需要 重新配置网络,有点麻烦,正常情况下如果, 网络相关的组件没有起来, 所有节点应该都是未就绪状态。感觉有点妖。。。时间关系,我需要集群来做实验,所以通过 kubeadm重置了

┌──[root@vms81.liruilongs.github.io]-[~/ansible]
└─$kubectl apply -f calico.yaml

博文参考


https://github.com/etcd-io/etcd/issues/11949

以上是关于记一次虚机强制断电 K8s 集群 etcd pod 挂掉快照丢失(没有备份)问题处理的主要内容,如果未能解决你的问题,请参考以下文章

记一次虚机强制断电 K8s 集群 etcd pod 挂掉快照丢失(没有备份)问题处理

记一次虚机强制断电 K8s 集群 etcd pod 挂掉快照丢失(没有备份)问题处理

记一次虚机强制断电磁盘损坏导致 K8s 集群 部分节点未就绪(NotReady) 问题解决

记一次虚机强制断电磁盘损坏导致 K8s 集群 部分节点未就绪(NotReady) 问题解决

记一次虚机强制断电磁盘损坏导致 K8s 集群 部分节点未就绪(NotReady) 问题解决

一文学会k8s etcd故障解决方案