Pgpool-II：当断开主节点或备用节点失败时，委托 IP 不可用

Posted 2023-03-10

技术标签:

【中文标题】Pgpool-II：当断开主节点或备用节点失败时，委托 IP 不可用【英文标题】：Pgpool-II: Delegated IP is not available when disconnected Primary or Standby Node Failed 【发布时间】：2020-08-27 13:14:12 【问题描述】：

我正在尝试设置两个节点（主节点和备用节点）的 postgres 集群。为了激活自动故障转移，我使用了 pgpool-II。

我关注了以下文章：https://www.pgpool.net/docs/41/en/html/example-cluster.html，唯一不同的是我安装了 postgresql 版本 12 而不是版本 11。

知道我正在尝试使用 Proxmox 上的两个 centos7 图像。我遇到了以下问题：

当我在两个节点上运行 systemctl status pgpool.service 时，它返回了成功。我也可以使用看门狗代理 IP 访问 postgresql。

但是什么测试故障转移，一切都出错了。

一旦我停止其中一台服务器，委派的 IP 就会停止响应。因此，数据库不可用。一旦我启动另一个节点，委托的 un 就可用。

##############日志节点1

停止

db0 pgpool[44615]: [1-1] 2020-05-11 23:31:55: pid 44615: LOG:  stop request sent to pgpool. waiting for termination...
db0 pgpool[44104]: [27-1] 2020-05-11 23:31:55: pid 44104: LOG:  Watchdog is shutting down
db0 pgpool[44616]: [28-1] 2020-05-11 23:31:55: pid 44616: LOG:  watchdog: de-escalation started
db0 pgpool[44616]: [29-1] 2020-05-11 23:31:55: pid 44616: LOG:  successfully released the delegate IP:"172.16.0.151"
db0 pgpool[44616]: [29-2] 2020-05-11 23:31:55: pid 44616: DETAIL:  'if_down_cmd' returned with success

###############日志节点2

停止节点1

db0 pgpool[44615]: [1-1] 2020-05-11 23:31:55: pid 44615: LOG:  stop request sent to pgpool. waiting for termination...
db0 pgpool[44104]: [27-1] 2020-05-11 23:31:55: pid 44104: LOG:  Watchdog is shutting down
db0 pgpool[44616]: [28-1] 2020-05-11 23:31:55: pid 44616: LOG:  watchdog: de-escalation started
db0 pgpool[44616]: [29-1] 2020-05-11 23:31:55: pid 44616: LOG:  successfully released the delegate IP:"172.16.0.151"
db0 pgpool[44616]: [29-2] 2020-05-11 23:31:55: pid 44616: DETAIL:  'if_down_cmd' returned with success

##############日志节点1

开始

db0 pgpool[44687]: [1-1] 2020-05-11 23:36:17: pid 44687: LOG:  memory cache initialized
db0 pgpool[44687]: [1-2] 2020-05-11 23:36:17: pid 44687: DETAIL:  memcache blocks :64
db0 pgpool[44687]: [2-1] 2020-05-11 23:36:17: pid 44687: LOG:  pool_discard_oid_maps: discarded memqcache oid maps
db0 pgpool[44687]: [3-1] 2020-05-11 23:36:17: pid 44687: LOG:  waiting for watchdog to initialize
db0 pgpool[44689]: [3-1] 2020-05-11 23:36:17: pid 44689: LOG:  setting the local watchdog node name to "db0:9999 Linux db0"
db0 pgpool[44689]: [4-1] 2020-05-11 23:36:17: pid 44689: LOG:  watchdog cluster is configured with 1 remote nodes
db0 pgpool[44689]: [5-1] 2020-05-11 23:36:17: pid 44689: LOG:  watchdog remote node:0 on db1:9000
db0 pgpool[44689]: [6-1] 2020-05-11 23:36:17: pid 44689: LOG:  interface monitoring is disabled in watchdog
db0 pgpool[44689]: [7-1] 2020-05-11 23:36:17: pid 44689: LOG:  watchdog node state changed from [DEAD] to [LOADING]
db0 pgpool[44689]: [8-1] 2020-05-11 23:36:17: pid 44689: LOG:  new outbound connection to db1:9000
db0 pgpool[44689]: [9-1] 2020-05-11 23:36:17: pid 44689: LOG:  setting the remote node "db1:9999 Linux db1" as watchdog cluster master
db0 pgpool[44689]: [10-1] 2020-05-11 23:36:17: pid 44689: LOG:  watchdog node state changed from [LOADING] to [INITIALIZING]
db0 pgpool[44689]: [11-1] 2020-05-11 23:36:17: pid 44689: LOG:  new watchdog node connection is received from "172.16.0.152:30404"
db0 pgpool[44689]: [12-1] 2020-05-11 23:36:17: pid 44689: LOG:  new node joined the cluster hostname:"db1" port:9000 pgpool_port:9999
db0 pgpool[44689]: [12-2] 2020-05-11 23:36:17: pid 44689: DETAIL:  Pgpool-II version:"4.1.1" watchdog messaging version: 1.1
db0 pgpool[44689]: [13-1] 2020-05-11 23:36:18: pid 44689: LOG:  watchdog node state changed from [INITIALIZING] to [STANDBY]
db0 pgpool[44689]: [14-1] 2020-05-11 23:36:18: pid 44689: LOG:  successfully joined the watchdog cluster as standby node
db0 pgpool[44689]: [14-2] 2020-05-11 23:36:18: pid 44689: DETAIL:  our join coordinator request is accepted by cluster leader node "db1:9999 Linux db1"
db0 pgpool[44687]: [4-1] 2020-05-11 23:36:18: pid 44687: LOG:  watchdog process is initialized
db0 pgpool[44687]: [4-2] 2020-05-11 23:36:18: pid 44687: DETAIL:  watchdog messaging data version: 1.1
db0 pgpool[44689]: [15-1] 2020-05-11 23:36:18: pid 44689: LOG:  new IPC connection received
db0 pgpool[44689]: [16-1] 2020-05-11 23:36:18: pid 44689: LOG:  new IPC connection received
db0 pgpool[44687]: [5-1] 2020-05-11 23:36:18: pid 44687: LOG:  we have joined the watchdog cluster as STANDBY node
db0 pgpool[44687]: [5-2] 2020-05-11 23:36:18: pid 44687: DETAIL:  syncing the backend states from the MASTER watchdog node
db0 pgpool[44690]: [5-1] 2020-05-11 23:36:18: pid 44690: LOG:  2 watchdog nodes are configured for lifecheck
db0 pgpool[44689]: [17-1] 2020-05-11 23:36:18: pid 44689: LOG:  new IPC connection received
db0 pgpool[44690]: [6-1] 2020-05-11 23:36:18: pid 44690: LOG:  watchdog nodes ID:0 Name:"db0:9999 Linux db0"
db0 pgpool[44690]: [6-2] 2020-05-11 23:36:18: pid 44690: DETAIL:  Host:"db0" WD Port:9000 pgpool-II port:9999
db0 pgpool[44690]: [7-1] 2020-05-11 23:36:18: pid 44690: LOG:  watchdog nodes ID:1 Name:"db1:9999 Linux db1"
db0 pgpool[44690]: [7-2] 2020-05-11 23:36:18: pid 44690: DETAIL:  Host:"db1" WD Port:9000 pgpool-II port:9999
db0 pgpool[44689]: [18-1] 2020-05-11 23:36:18: pid 44689: LOG:  received the get data request from local pgpool-II on IPC interface
db0 pgpool[44689]: [19-1] 2020-05-11 23:36:18: pid 44689: LOG:  get data request from local pgpool-II node received on IPC interface is forwarded to master watchdog node "db1:9999 Linux db1"
db0 pgpool[44689]: [19-2] 2020-05-11 23:36:18: pid 44689: DETAIL:  waiting for the reply...
db0 pgpool[44687]: [6-1] 2020-05-11 23:36:18: pid 44687: LOG:  master watchdog node "db1:9999 Linux db1" returned status for 2 backend nodes
db0 pgpool[44687]: [7-1] 2020-05-11 23:36:18: pid 44687: LOG:  backend:0 is set to UP status
db0 pgpool[44687]: [7-2] 2020-05-11 23:36:18: pid 44687: DETAIL:  backend:0 is UP on cluster master "db1:9999 Linux db1"
db0 pgpool[44687]: [8-1] 2020-05-11 23:36:18: pid 44687: LOG:  backend:1 is set to UP status
db0 pgpool[44687]: [8-2] 2020-05-11 23:36:18: pid 44687: DETAIL:  backend:1 is UP on cluster master "db1:9999 Linux db1"
db0 pgpool[44687]: [9-1] 2020-05-11 23:36:18: pid 44687: LOG:  Setting up socket for 0.0.0.0:9999
db0 pgpool[44687]: [10-1] 2020-05-11 23:36:18: pid 44687: LOG:  Setting up socket for :::9999
db0 pgpool[44725]: [11-1] 2020-05-11 23:36:18: pid 44725: LOG:  PCP process: 44725 started
db0 pgpool[44687]: [11-1] 2020-05-11 23:36:18: pid 44687: LOG:  pgpool-II successfully started. version 4.1.1 (karasukiboshi)

###############日志节点2

启动节点1

db1 pgpool[30154]: [39-1] 2020-05-11 23:36:17: pid 30154: LOG:  new watchdog node connection is received from "172.16.0.153:61085"
db1 pgpool[30154]: [40-1] 2020-05-11 23:36:17: pid 30154: LOG:  new node joined the cluster hostname:"db0" port:9000 pgpool_port:9999
db1 pgpool[30154]: [40-2] 2020-05-11 23:36:17: pid 30154: DETAIL:  Pgpool-II version:"4.1.1" watchdog messaging version: 1.1
db1 pgpool[30154]: [41-1] 2020-05-11 23:36:17: pid 30154: LOG:  The newly joined node:"db0:9999 Linux db0" had left the cluster because it was shutdown
db1 pgpool[30154]: [42-1] 2020-05-11 23:36:17: pid 30154: LOG:  new outbound connection to db0:9000
db1 pgpool[30154]: [43-1] 2020-05-11 23:36:18: pid 30154: LOG:  adding watchdog node "db0:9999 Linux db0" to the standby list
db1 pgpool[30154]: [44-1] 2020-05-11 23:36:18: pid 30154: LOG:  quorum found
db1 pgpool[30154]: [44-2] 2020-05-11 23:36:18: pid 30154: DETAIL:  starting escalation process
db1 pgpool[30154]: [45-1] 2020-05-11 23:36:18: pid 30154: LOG:  escalation process started with PID:30601
db1 pgpool[30601]: [45-1] 2020-05-11 23:36:18: pid 30601: LOG:  watchdog: escalation started
db1 pgpool[30152]: [14-1] 2020-05-11 23:36:18: pid 30152: LOG:  Pgpool-II parent process received watchdog quorum change signal from watchdog
db1 pgpool[30154]: [46-1] 2020-05-11 23:36:18: pid 30154: LOG:  new IPC connection received
db1 pgpool[30152]: [15-1] 2020-05-11 23:36:18: pid 30152: LOG:  watchdog cluster now holds the quorum
db1 pgpool[30152]: [15-2] 2020-05-11 23:36:18: pid 30152: DETAIL:  updating the state of quarantine backend nodes
db1 pgpool[30154]: [47-1] 2020-05-11 23:36:18: pid 30154: LOG:  new IPC connection received
db1 pgpool[30601]: [46-1] 2020-05-11 23:36:20: pid 30601: WARNING:  watchdog failed to ping host"172.16.0.151"
db1 pgpool[30601]: [46-2] 2020-05-11 23:36:20: pid 30601: DETAIL:  ping process exits with code: 2
db1 pgpool[30601]: [47-1] 2020-05-11 23:36:20: pid 30601: LOG:  waiting for the delegate IP address to become active
db1 pgpool[30601]: [47-2] 2020-05-11 23:36:20: pid 30601: DETAIL:  waiting... count: 1
db1 pgpool[30601]: [48-1] 2020-05-11 23:36:20: pid 30601: WARNING:  watchdog failed to ping host"172.16.0.151"
db1 pgpool[30601]: [48-2] 2020-05-11 23:36:20: pid 30601: DETAIL:  ping process exits with code: 2
db1 pgpool[30601]: [49-1] 2020-05-11 23:36:20: pid 30601: LOG:  waiting for the delegate IP address to become active
db1 pgpool[30601]: [49-2] 2020-05-11 23:36:20: pid 30601: DETAIL:  waiting... count: 2
db1 pgpool[30601]: [50-1] 2020-05-11 23:36:20: pid 30601: WARNING:  watchdog failed to ping host"172.16.0.151"
db1 pgpool[30601]: [50-2] 2020-05-11 23:36:20: pid 30601: DETAIL:  ping process exits with code: 2
db1 pgpool[30601]: [51-1] 2020-05-11 23:36:20: pid 30601: LOG:  waiting for the delegate IP address to become active
db1 pgpool[30601]: [51-2] 2020-05-11 23:36:20: pid 30601: DETAIL:  waiting... count: 3
db1 pgpool[30601]: [52-1] 2020-05-11 23:36:20: pid 30601: LOG:  failed to acquire the delegate IP address
db1 pgpool[30601]: [52-2] 2020-05-11 23:36:20: pid 30601: DETAIL:  'if_up_cmd' failed
db1 pgpool[30601]: [53-1] 2020-05-11 23:36:20: pid 30601: WARNING:  watchdog escalation failed to acquire delegate IP
db1 pgpool[30154]: [48-1] 2020-05-11 23:36:20: pid 30154: LOG:  watchdog escalation process with pid: 30601 exit with SUCCESS.
db1 pgpool[30157]: [11-1] 2020-05-11 23:36:29: pid 30157: LOG:  informing the node status change to watchdog
db1 pgpool[30157]: [11-2] 2020-05-11 23:36:29: pid 30157: DETAIL:  node id :1 status = "NODE ALIVE" message:"Heartbeat signal found"
db1 pgpool[30154]: [49-1] 2020-05-11 23:36:29: pid 30154: LOG:  new IPC connection received
db1 pgpool[30154]: [50-1] 2020-05-11 23:36:29: pid 30154: LOG:  received node status change ipc message
db1 pgpool[30154]: [50-2] 2020-05-11 23:36:29: pid 30154: DETAIL:  Heartbeat signal found
db1 pgpool[30154]: [51-1] 2020-05-11 23:36:29: pid 30154: LOG:  remote node "db0:9999 Linux db0" became reachable again
db1 pgpool[30154]: [51-2] 2020-05-11 23:36:29: pid 30154: DETAIL:  requesting the node info
db1 pgpool[30154]: [52-1] 2020-05-11 23:36:29: pid 30154: LOG:  remote node "db0:9999 Linux db0" is reachable again
db1 pgpool[30154]: [52-2] 2020-05-11 23:36:29: pid 30154: DETAIL:  trying to add it back as a standby

【问题讨论】：

【参考方案1】：

我猜你需要启用这个标志enable_consensus_with_half_votes=on 在此处查看更多详细信息。 https://www.pgpool.net/docs/latest/en/html/runtime-watchdog-config.html#GUC-ENABLE-CONSENSUS-WITH-HALF-VOTES

【讨论】：

以上是关于Pgpool-II：当断开主节点或备用节点失败时，委托 IP 不可用的主要内容，如果未能解决你的问题，请参考以下文章

PGPool-II+PG流复制实现HA主备切换

K8s 节点断开连接后，本在运行的 Pod 会如何？

当实际的活动名称节点关闭时，HDFS HA 集群备用节点不会变为活动状态

RabbitMQ 的4种集群架构

repmgr的见证节点和守护进程

当 Spark master 失败时会发生啥？