集群备库宕机后,检查与恢复的步骤
Posted 瀚高PG实验室
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了集群备库宕机后,检查与恢复的步骤相关的知识,希望对你有一定的参考价值。
目录
环境
症状
问题原因
解决方案
环境
系统平台:Linux x86-64 Red Hat Enterprise Linux 7,中标麒麟(CPU龙芯)7
版本:4.3.4.8,4.3.4
症状
现有集群环境下,怀疑备库宕机,可通过如下方法检查:
备库执行命令ps -ef |grep post,如无如下红色字样,则表示备库宕机。
[root@node2 ~]# ps -ef |grep post
root 11796 11682 0 16:45 pts/0 00:00:00 grep --color=auto post
root 19592 1 0 Jul28 ? 00:00:14 /opt/HighGo4.3.4.8-see/bin/postgres
root 19593 19592 0 Jul28 ? 00:00:00 postgres: logger process
root 19595 19592 0 Jul28 ? 00:00:00 postgres: checkpointer process
root 19596 19592 0 Jul28 ? 00:00:11 postgres: writer process
root 19597 19592 0 Jul28 ? 00:00:11 postgres: wal writer process
root 19598 19592 0 Jul28 ? 00:00:40 postgres: autovacuum launcher process
root 19599 19592 0 Jul28 ? 00:00:03 postgres: archiver process
root 19600 19592 0 Jul28 ? 00:00:36 postgres: stats collector process
root 19601 19592 0 Jul28 ? 00:05:58 postgres: bgworker: highgo cron
root 19602 19592 0 Jul28 ? 00:00:00 postgres: bgworker: logical replication launcher
问题原因
备库宕机的可能原因有很多,例如系统资源争用,内存不足,物理硬件故障,人为误操作,软件本身bug等等。需结合数据库日志和操作系统日志进行进一步判断。
数据库日志一般存在于数据库data目录下的hgdb_log或log中。
可通过psql命令连接数据库后执行如下操作进行确认:
```cpp
highgo=# show log_directory ;
log_directory
---------------
hgdb_log
(1 row)
解决方案
首先执行如下命令
[root@node2 data]# pkill repmgrd
尝试直接启动数据库备库。
[root@node1 ~]# pg_ctl start
waiting for server to start....2020-08-03 16:57:53.978 CST [12154] LOG: listening on IPv4 address "0.0.0.0", port 5866
2020-08-03 16:57:53.978 CST [12154] LOG: listening on IPv6 address "::", port 5866
2020-08-03 16:57:53.987 CST [12154] LOG: listening on Unix socket "/tmp/.s.PGSQL.5866"
2020-08-03 16:57:54.017 CST [12154] LOG: redirecting log output to logging collector process
2020-08-03 16:57:54.017 CST [12154] HINT: Future log output will appear in directory "hgdb_log".
done
server started
此处看到有提示server started,即代表启动成功。
接下来查询集群状态,在任意一节点执行如下系统命令:
[root@node1 ~]# repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Replication lag | Last replayed LSN
----+-------+---------+-----------+----------+----------+----------+-----------------+-------------------
1 | node1 | primary | * running | | default | 100 | n/a | none
2 | node2 | standby | running | node1 | default | 100 | 0 bytes | 0/EE001A78
标红处为0字节即为正常,备库恢复完毕,直接执行本文底部第5节启动守护进程即可。如果多次查询该值超过50M且无变化或增大,则说明主库上的wal日志被移除,备库断档,需要通过如下方式进行恢复。
1先停掉选中的备库,并移动原数据目录。
```cpp
[root@node2 ~]$ pg_ctl stop
[root@node2 HighGo4.3.4.8-see]$ mv data data1
2使用pg_basebackup或repmgr standby clone进行重做,node1是主库主机名,也可以写主库ip
[root@node2 HighGo4.3.4.8-see]$ pg_basebackup -h node1 -U sysdba -D $PGDATA -Fp -Xs -v -R
或
[root@node2 HighGo4.3.4.8-see]$ repmgr standby clone -h node1 -U sysdba -d highgo
Password:
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/A5000028 on timeline 27
pg_basebackup: starting background WAL receiver
pg_basebackup: write-ahead log end point: 0/A50000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: base backup completed
3启动数据库
[root@node2 data]$ pg_ctl start
waiting for server to start....2020-04-29 14:16:50.403 HKT [8856] LOG: listening on IPv4 address "0.0.0.0", port 5866
2020-04-29 14:16:50.403 HKT [8856] LOG: listening on IPv6 address "::", port 5866
2020-04-29 14:16:50.405 HKT [8856] LOG: listening on Unix socket "/tmp/.s.PGSQL.5866"
2020-04-29 14:16:50.491 HKT [8856] LOG: redirecting log output to logging collector process
2020-04-29 14:16:50.491 HKT [8856] HINT: Future log output will appear in directory "hgdb_log".
done
server started
4检查集群状态,恢复正常
[root@node2 ~]# repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Replication lag | Last replayed LSN
----+-------+---------+-----------+----------+----------+----------+-----------------+-------------------
1 | node1 | primary | * running | | default | 100 | n/a | none
2 | node2 | standby | running | node1 | default | 100 | 0 bytes | 0/EE001A77
5启动守护进程
[root@node2 data]# repmgrd -d
更多详细信息请登录【瀚高技术支持平台】查看https://support.highgo.com/#/index/docContent/4e62e0a902d9bc01
以上是关于集群备库宕机后,检查与恢复的步骤的主要内容,如果未能解决你的问题,请参考以下文章
《叶问》34期,延迟从库加上MASTER_DELAY,主库宕机后如何快速恢复服务