repmgr 集群双主问题处理

Posted 2022-02-02 瀚高PG实验室

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了repmgr 集群双主问题处理相关的知识，希望对你有一定的参考价值。

目录
环境
症状
问题原因
解决方案

环境
系统平台：N/A
版本：5.6.5
症状
本文档旨在介绍出现集群出现双主问题时如何处理。

问题原因
repmgr 出现切换后，原主库停止服务。手动启动原主库会导致数据库集群状态异常，变为双主。

解决方案
原主库（192.168.80.221）查询集群状态，发现备库1（192.168.80.222）已经成为新的主库，但是原主库（192.168.80.221）仍然以主库的模式启动，这时候集群就出现了双主的问题。检查后发现，备库2（192.168.80.236）已经连接到了新主库（192.168.80.222），此时可将原主库（192.168.80.221）重新加入集群作为备库运行。

[highgo@localhost HighGo5.6.5-cluster]$ repmgr cluster show
ID | Name           | Role    | Status               | Upstream       | Location | Priority | Replication lag | Last replayed LSN
----+----------------+---------+----------------------+----------------+----------+----------+-----------------+-------------------
1  | 192.168.80.221 | primary | * running            |                | default  | 100      | n/a             | none            
2  | 192.168.80.222 | standby | ! running as primary | 192.168.80.221 | default  | 100      | 96 MB           | 2/D2013F28      
3  | 192.168.80.236 | standby |   running            | 192.168.80.221 | default  | 100      | 96 MB           | 2/D2013F28      


WARNING: following issues were detected
  - node "192.168.80.222" (ID: 2) is registered as standby but running as primary
[highgo@localhost HighGo5.6.5-cluster]$ pg_ctl stop -m f
等待服务器进程关闭 ..... 完成
服务器进程已经关闭

测试是否可以将本节点重新加入集群：

[highgo@localhost HighGo5.6.5-cluster]$ repmgr node rejoin  -d 'host=192.168.80.222 dbname=highgo user=highgo' --force-rewind --config-files=postgresql.conf --verbose --dry-run  
INFO: looking for configuration file in "/opt/HighGo5.6.5-cluster"
INFO: configuration file found at: "/opt/HighGo5.6.5-cluster/conf/hg_repmgr.conf"
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_primary_node_id():
SELECT node_id                   FROM repmgr.nodes     WHERE type = 'primary'    AND active IS TRUE 
DEBUG: get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name   FROM repmgr.nodes n  WHERE n.node_id = 2
DEBUG: connecting to: "user=highgo connect_timeout=2 dbname=highgo host=192.168.80.222 port=5866 fallback_application_name=repmgr"
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: replication connection to the rejoin target node was successful
INFO: local and rejoin target system identifiers match
DETAIL: system identifier is 6823598896860049498
DEBUG: local timeline: 25; rejoin target timeline: 26
DEBUG: get_timeline_history():
TIMELINE_HISTORY 26
DEBUG: local tli: 25; local_xlogpos: 2/DE000028; follow_target_history->tli: 25; follow_target_history->end: 2/D2013F28
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 26 forked off current database system timeline 25 before current recovery point 2/DE000028
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings  WHERE name = 'full_page_writes' AND setting = 'off'
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings  WHERE name = 'wal_log_hints' AND setting = 'on'
INFO: prerequisites for using pg_rewind are met
DEBUG: using archive directory "/tmp/repmgr-config-archive-192.168.80.221"
INFO: temporary archive directory "/tmp/repmgr-config-archive-192.168.80.221" created
INFO: file "postgresql.conf" would be copied to "/tmp/repmgr-config-archive-192.168.80.221/postgresql.conf"
INFO: 1 files would have been copied to "/tmp/repmgr-config-archive-192.168.80.221"
INFO: temporary archive directory "/tmp/repmgr-config-archive-192.168.80.221" deleted
INFO: pg_rewind would now be executed
DETAIL: pg_rewind command is:
  /opt/HighGo5.6.5-cluster/bin/pg_rewind -D '/opt/HighGo5.6.5-cluster/data' --source-server='host=192.168.80.222 user=highgo dbname=highgo port=5866 connect_timeout=2'
INFO: prerequisites for executing NODE REJOIN are met

正式将节点重新加入集群：

[highgo@localhost HighGo5.6.5-cluster]$ repmgr node rejoin  -d 'host=192.168.80.222 dbname=highgo user=highgo' --force-rewind --config-files=postgresql.conf --verbose
INFO: looking for configuration file in "/opt/HighGo5.6.5-cluster"
INFO: configuration file found at: "/opt/HighGo5.6.5-cluster/conf/hg_repmgr.conf"
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_primary_node_id():
SELECT node_id                   FROM repmgr.nodes     WHERE type = 'primary'    AND active IS TRUE 
DEBUG: get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name   FROM repmgr.nodes n  WHERE n.node_id = 2
DEBUG: connecting to: "user=highgo connect_timeout=2 dbname=highgo host=192.168.80.222 port=5866 fallback_application_name=repmgr"
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: local timeline: 25; rejoin target timeline: 26
DEBUG: get_timeline_history():
TIMELINE_HISTORY 26
DEBUG: local tli: 25; local_xlogpos: 2/DE000028; follow_target_history->tli: 25; follow_target_history->end: 2/D2013F28
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 26 forked off current database system timeline 25 before current recovery point 2/DE000028
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings  WHERE name = 'full_page_writes' AND setting = 'off'
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings  WHERE name = 'wal_log_hints' AND setting = 'on'
INFO: prerequisites for using pg_rewind are met
DEBUG: using archive directory "/tmp/repmgr-config-archive-192.168.80.221"
DEBUG: copying "postgresql.conf" to "/tmp/repmgr-config-archive-192.168.80.221/postgresql.conf"
INFO: 1 files copied to "/tmp/repmgr-config-archive-192.168.80.221"
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/opt/HighGo5.6.5-cluster/bin/pg_rewind -D '/opt/HighGo5.6.5-cluster/data' --source-server='host=192.168.80.222 user=highgo dbname=highgo port=5866 connect_timeout=2'"
DEBUG: executing:
  /opt/HighGo5.6.5-cluster/bin/pg_rewind -D '/opt/HighGo5.6.5-cluster/data' --source-server='host=192.168.80.222 user=highgo dbname=highgo port=5866 connect_timeout=2'
DEBUG: result of command was 0 (13)
DEBUG: local_command(): output returned was:
servers diverged at WAL location 2/D2013F28 on timeline 25

DEBUG: using archive directory "/tmp/repmgr-config-archive-192.168.80.221"
DEBUG: copying "/tmp/repmgr-config-archive-192.168.80.221/postgresql.conf" to "/opt/HighGo5.6.5-cluster/data/postgresql.conf"
NOTICE: 1 files copied to /opt/HighGo5.6.5-cluster/data
INFO: directory "/tmp/repmgr-config-archive-192.168.80.221" deleted
INFO: deleting "recovery.done"
DEBUG: get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name   FROM repmgr.nodes n  WHERE n.node_id = 1
NOTICE: setting node 1's upstream to node 2
DEBUG: create_recovery_file(): creating "/opt/HighGo5.6.5-cluster/data/recovery.conf"...
DEBUG: recovery file is:
standby_mode = 'on'
primary_conninfo = 'user=highgo connect_timeout=2 host=192.168.80.222 port=5866 application_name=192.168.80.221'
recovery_target_timeline = 'latest'


DEBUG: is_server_available(): ping status for "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2" is PQPING_NO_RESPONSE
WARNING: unable to ping "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/opt/HighGo5.6.5-cluster/bin/pg_ctl  -w -D '/opt/HighGo5.6.5-cluster/data' start"
DEBUG: executing:
  /opt/HighGo5.6.5-cluster/bin/pg_ctl  -w -D '/opt/HighGo5.6.5-cluster/data' start
DEBUG: result of command was 0 (13)
DEBUG: local_command(): output returned was:
等待服务器进程启动 ....2020-06-25 09:41:54.523 CST [5308] WARNING:  01000: gdb version should large than 7.10


DEBUG: update_node_record_status():
    UPDATE repmgr.nodes      SET type = 'standby',          upstream_node_id = 2,          active = TRUE    WHERE node_id = 1
DEBUG: is_server_available(): ping status for "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2" is PQPING_REJECT
WARNING: unable to ping "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2"
DETAIL: PQping() returned "PQPING_REJECT"
INFO: waiting for node 1 to respond to pings; 1 of max 60 attempts
DEBUG: is_server_available(): ping status for "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2" is PQPING_REJECT
WARNING: unable to ping "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2"
DETAIL: PQping() returned "PQPING_REJECT"
DEBUG: sleeping 1 second waiting for node 1 to respond to pings; 2 of max 60 attempts
DEBUG: is_server_available(): ping status for "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2" is PQPING_REJECT
WARNING: unable to ping "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2"
DETAIL: PQping() returned "PQPING_REJECT"
DEBUG: sleeping 1 second waiting for node 1 to respond to pings; 3 of max 60 attempts
DEBUG: is_server_available(): ping status for "host=192.168.80.221 user=highgo dbname=highgo password=highgo@123 port=5866 connect_timeout=2" is PQPING_OK
INFO: demoted primary is pingable
INFO: node 1 has attached to its upstream node
DEBUG: _create_event(): event is "node_rejoin" for node 1
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: _create_event():
   INSERT INTO repmgr.events (              node_id,              event,              successful,              details             )       VALUES ($1, $2, $3, $4)    RETURNING event_timestamp
DEBUG: _create_event(): Event timestamp is "2020-06-30 10:15:47.522104+08"
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2

重新查询集群状态，集群状态正常。

[highgo@localhost HighGo5.6.5-cluster]$ repmgr cluster show
ID | Name           | Role    | Status               | Upstream       | Location | Priority | Replication lag | Last replayed LSN
----+----------------+---------+----------------------+----------------+----------+----------+-----------------+-------------------
1  | 192.168.80.221 | standby |   running            |  192.168.80.221 | default  | 100      | 0 MB           | 2/D4013F28            
2  | 192.168.80.222 | primary |   running            |                 | default  | 100      | n/a            | none      
3  | 192.168.80.236 | standby |   running            | 192.168.80.221  | default  | 100      | 0 MB           | 2/D4013F28

以上是关于repmgr 集群双主问题处理的主要内容，如果未能解决你的问题，请参考以下文章

HG_REPMGR启动失败排查和状态检查步骤

repmgr+pg12构建高可用集群

MySQL集群：双主模式

PostgreSQL集群管理—repmgr