如何运用5why分析法解决问题？

Posted 2023-04-13

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了如何运用5why分析法解决问题？相关的知识，希望对你有一定的参考价值。

不断地问“为什么”可不是儿戏。事实上，它是进行根本原因分析 (RCA) 时最常用的方法。它的伟大之处在于它不需要进行假设检验、数据分割，甚至不需要使用高级统计工具。
这个工具的前提是问五次“为什么”。这使得很容易剥离在特定问题中观察到的症状层。一旦剥离了症状层，它就会揭示问题的根本原因。除了能够找出问题的根本原因外，它还可以帮助确定已识别的原因之间的关系。这个根本原因分析技术对于确定涉及人为因素的问题的根本原因非常有帮助。
如何运用5why？
作为进行风险分析的最简单工具，5why分析法非常容易做到。以下是如何完成“5why分析法”的步骤：
写下问题：将问题写在一张干净的纸上。此步骤是必要的，以便您可以熟悉手头的问题。在问题陈述的下方，对其进行完整描述以添加更多详细信息。
提问：首先询问问题发生的原因并将答案写在问题下方。
深入挖掘：如果你写的答案没有提供问题的根本原因，再问“为什么”，然后在下面写下你的答案。
重复：重复这个过程，直到你发现真相。虽然只问5why是明智的，但这个过程可以通过问更少或更多的问题来帮助您确定根本原因。
何时使用5why？
5个为什么是解决问题的最简单形式之一。您可以用它来提高组织流程的质量，也可以用它来解决简单的问题。但是，如果您遇到棘手的问题，这个特定的风险分析工具就不再有效，尤其是当问题涉及多个根本原因时。
尽管情况可能如此，但5why是一个很好的工具，可以帮助您找到问题的根本原因。因此，如果您的流程运行不佳，那么您应该尝试使用此工具，看看它是否可以帮助您了解有关问题的更多信息。参考技术A 5why分析法是一种常用的问题解决方法，可以帮助我们深入探究问题的根本原因。以下是使用5why分析法解决问题的步骤：
1.明确问题：首先要明确问题，确保所有人都对问题有一个清晰的了解。
2.找出问题的原因：问自己为什么出现这个问题，列出第一个原因。
3.追问为什么：对于第一个原因，再次问自己为什么会出现这个问题，列出第二个原因。这个问题的回答应该是第一个问题答案的具体原因。
4.继续追问为什么：对于第二个原因，再次问自己为什么，列出第三个原因。这个问题的回答应该是第二个问题答案的具体原因。
5.反复追问：继续以此类推，直到你找到了根本原因。
6.解决问题：一旦你发现了根本原因，你就可以着手解决问题了。制定一个解决方案并实施，确保问题不再发生。

redis源码分析Redis Sentinel 是如何实际解决分布式共识问题的

文章目录

开题

前几篇都偏离了轨道，方知写一个系列一定要先定指导思想。
那几篇过于注重“what”和“why”了，但是我写这个系列的初衷是“how”，怎么做！

所以，本篇将聚焦与以下几个问题：
1、哨兵是如何监视节点的？
2、哨兵是如何选举的？
3、从节点是如何上位的？

对于分布式一致性算法raft 不了解的可以先了解一下：分布式一致性之raft算法，文中还有关于分布式事务的连接，想了解的也可以了解一下，后文出现专有名词我就不再解释了。

Sentinel 结构

每个 Sentinel 节点都维护一份自己视角下的当前 Sentinel 集群的状态，该状态信息存储在 SentinelState结构体中：

/* Main state. */
struct sentinelState 
    char myid[CONFIG_RUN_ID_SIZE+1]; /* This sentinel ID. */
    uint64_t current_epoch;         // 集群当前任期号，用于故障转移时使用 raft 算法选举 leader 节点
    dict *masters;      /* Dictionary of master sentinelRedisInstances.
                           Key is the instance name, value is the
                           sentinelRedisInstance structure pointer. */
	......
 sentinel;

这里要明确一点，集群节点会挂，哨兵也是会挂的。

sentinelRedisInstance 结构体负责存储Sentinel 集群中主从节点，以及其它 Sentinel 节点的实例数据。


typedef struct sentinelRedisInstance 
    int flags;      // 节点标志，见下文
    char *name;     /* Master name from the point of view of this sentinel. */
    char *runid;    /* Run ID of this instance, or unique ID if is a Sentinel.*/
    uint64_t config_epoch;  /* Configuration epoch. */
    sentinelAddr *addr; /* Master host. */
    instanceLink *link; /* Link to the instance, may be shared for Sentinels. */
    mstime_t last_pub_time;   /* Last time we sent hello via Pub/Sub. */
    mstime_t last_hello_time; /* Only used if SRI_SENTINEL is set. Last time
                                 we received a hello from this Sentinel
                                 via Pub/Sub. */
    mstime_t last_master_down_reply_time; /* Time of last reply to
                                             SENTINEL is-master-down command. */
    mstime_t s_down_since_time; /* Subjectively down since time. */
    mstime_t o_down_since_time; /* Objectively down since time. */
    mstime_t down_after_period; /* Consider it down after that period. */
    ......
    /* Role and the first time we observed it.
     * This is useful in order to delay replacing what the instance reports
     * with our own configuration. We need to always wait some time in order
     * to give a chance to the leader to report the new configuration before
     * we do silly things. */
    int role_reported;
    mstime_t role_reported_time;
    mstime_t slave_conf_change_time; /* Last time slave master addr changed. */

    /* Master specific. */
    dict *sentinels;    /* Other sentinels monitoring the same master. */
    dict *slaves;       /* Slaves for this master instance. */
    unsigned int quorum;/* Number of sentinels that need to agree on failure. */
    int parallel_syncs; /* How many slaves to reconfigure at same time. */
    char *auth_pass;    /* Password to use for AUTH against master & replica. */
    char *auth_user;    /* Username for ACLs AUTH against master & replica. */

    /* Slave specific. */
    mstime_t master_link_down_time; /* Slave replication link down time. */
    int slave_priority; /* Slave priority according to its INFO output. */
    mstime_t slave_reconf_sent_time; /* Time at which we sent SLAVE OF <new> */
    struct sentinelRedisInstance *master; /* Master instance if it's slave. */
    char *slave_master_host;    /* Master host as reported by INFO */
    int slave_master_port;      /* Master port as reported by INFO */
    int slave_master_link_status; /* Master link status as reported by INFO */
    unsigned long long slave_repl_offset; /* Slave replication offset. */
    /* Failover */
    char *leader;       /* If this is a master instance, this is the runid of
                           the Sentinel that should perform the failover. If
                           this is a Sentinel, this is the runid of the Sentinel
                           that this Sentinel voted as leader. */
    uint64_t leader_epoch; /* Epoch of the 'leader' field. */
    uint64_t failover_epoch; /* Epoch of the currently started failover. */
    int failover_state; /* See SENTINEL_FAILOVER_STATE_* defines. */
    mstime_t failover_state_change_time;
    mstime_t failover_start_time;   /* Last failover attempt start time. */
    mstime_t failover_timeout;      /* Max time to refresh failover state. */
    mstime_t failover_delay_logged; /* For what failover_start_time value we
                                       logged the failover delay. */
    struct sentinelRedisInstance *promoted_slave; /* Promoted slave instance. */
    ......
 sentinelRedisInstance;

/* A Sentinel Redis Instance object is monitoring. */
#define SRI_MASTER  (1<<0)
#define SRI_SLAVE   (1<<1)
#define SRI_SENTINEL (1<<2)
#define SRI_S_DOWN (1<<3)   //该节点已主观下线
#define SRI_O_DOWN (1<<4)   //该节点已客观下线
#define SRI_MASTER_DOWN (1<<5) /* A Sentinel with this flag set thinks that
                                   its master is down. */
#define SRI_FAILOVER_IN_PROGRESS (1<<6) //节点正在进行故障迁移
#define SRI_PROMOTED (1<<7)            //节点被选为前一种的晋升节点

在 Sentinel 集群没有执行故障转移时，集群中所有 sentinel 节点都是平等的。当执行故障转移时，会选出一个leader节点，由leader节点完成故障转移。

Sentinel 利用了频道订阅功能，每个Sentinel节点都订阅了主从节点的一个特定频道，并将自身节点信息发送到该频道，这样每个Sentinel节点自身信息就会被广播给集群其他Sentinel节点。

Sentinel故障与安全模式：TITL模式

Sentinel 机制非常依赖系统时间，举个栗子：基于某个节点上次响应 PING 命令的时间与当前系统时间之差来判断该节点是否下线。如果系统时间被修改或者进程由于繁忙而阻塞，那么Sentinel机制可能出现运行不正常的情况。

为了结局这种情况，Sentinel 机制中定义了TITL 模式。每次执行 sentinelTimer 函数都会检查上次执行该函数的时间与当前系统时间之差，如果出现负数或时间差特别大，则Sentinel进入TITL模式：

1、它不再执行任何操作，如故障转移
2、当其他Sentinel节点询问它对于某个主节点主观下线的判定结果时，它将返回节点未下线的判定结果
3、如果TITL模式下Sentinel机制可以正常运行30秒，则该节点退出TITL模式

故障转移主逻辑

/* Perform scheduled operations for all the instances in the dictionary.
 * Recursively call the function against dictionaries of slaves. */
void sentinelHandleDictOfRedisInstances(dict *instances) 
    dictIterator *di;
    dictEntry *de;
    sentinelRedisInstance *switch_to_promoted = NULL;

    /* There are a number of things we need to perform against every master. */
    di = dictGetIterator(instances);
    while((de = dictNext(di)) != NULL) 
        sentinelRedisInstance *ri = dictGetVal(de);

        sentinelHandleRedisInstance(ri);	//调用主逻辑函数
        if (ri->flags & SRI_MASTER) 	
        //如果当前处理的是主节点，还需要递归处理主节点实例下的slaves 和 sentinels
            sentinelHandleDictOfRedisInstances(ri->slaves);
            sentinelHandleDictOfRedisInstances(ri->sentinels);
            if (ri->failover_state == SENTINEL_FAILOVER_STATE_UPDATE_CONFIG) 
                switch_to_promoted = ri;
            
        
    
    //完成故障转移的最后一步
    if (switch_to_promoted)
        sentinelFailoverSwitchToPromotedSlave(switch_to_promoted);
    dictReleaseIterator(di);

/* ======================== SENTINEL timer handler ==========================
 * This is the "main" our Sentinel, being sentinel completely non blocking
 * in design. The function is called every second.
 * -------------------------------------------------------------------------- */

/* Perform scheduled operations for the specified Redis instance. */
void sentinelHandleRedisInstance(sentinelRedisInstance *ri) 
    /* ========== MONITORING HALF ============ */
    /* Every kind of instance */
    sentinelReconnectInstance(ri);		//建立网络连接
    sentinelSendPeriodicCommands(ri);	

    /* ============== ACTING HALF ============= */
    /* We don't proceed with the acting half if we are in TILT mode.
     * TILT happens when we find something odd with the time, like a
     * sudden change in the clock. */
    if (sentinel.tilt) 
        if (mstime()-sentinel.tilt_start_time < SENTINEL_TILT_PERIOD) return;
        sentinel.tilt = 0;
        sentinelEvent(LL_WARNING,"-tilt",NULL,"#tilt mode exited");
    

    /* Every kind of instance */
    sentinelCheckSubjectivelyDown(ri);	//检查是否存在主观下线的节点

    /* Masters and slaves */
    if (ri->flags & (SRI_MASTER|SRI_SLAVE)) 
        /* Nothing so far. */
    

    /* Only masters */
    if (ri->flags & SRI_MASTER) 	//只对主节点执行
        sentinelCheckObjectivelyDown(ri);	//检查是否存在客观下线的节点
        if (sentinelStartFailoverIfNeeded(ri))	//判断是够可以进行故障转移
            sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_ASK_FORCED);	//发送投票请求
   	     	sentinelFailoverStateMachine(ri);	//实现一个故障转移状态机，实现故障转移逻辑
        	sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_NO_FLAGS);	//询问其他 Sentinel 节点对该节点主观下线的判定结果

主观下线：我个人认为你下线了
客观下线：超过半数的人都认为你下线了

心跳监测

Sentinel 会定时发送消息给主从节点和其他 Sentinel 节点，看它们还活着不：

/* Send periodic PING, INFO, and PUBLISH to the Hello channel to
 * the specified master or slave instance. */
void sentinelSendPeriodicCommands(sentinelRedisInstance *ri) 
    mstime_t now = mstime();
    mstime_t info_period, ping_period;
    int retval;

    /* Return ASAP if we have already a PING or INFO already pending, or
     * in the case the instance is not properly connected. */
    if (ri->link->disconnected) return;

    /* For INFO, PING, PUBLISH that are not critical commands to send we
     * also have a limit of SENTINEL_MAX_PENDING_COMMANDS. We don't
     * want to use a lot of memory just because a link is not working
     * properly (note that anyway there is a redundant protection about this,
     * that is, the link will be disconnected and reconnected if a long
     * timeout condition is detected. */
    if (ri->link->pending_commands >=
        SENTINEL_MAX_PENDING_COMMANDS * ri->link->refcount) return;

    /* If this is a slave of a master in O_DOWN condition we start sending
     * it INFO every second, instead of the usual SENTINEL_INFO_PERIOD
     * period. In this state we want to closely monitor slaves in case they
     * are turned into masters by another Sentinel, or by the sysadmin.
     *
     * Similarly we monitor the INFO output more often if the slave reports
     * to be disconnected from the master, so that we can have a fresh
     * disconnection time figure. */
    if ((ri->flags & SRI_SLAVE) &&
        ((ri->master->flags & (SRI_O_DOWN|SRI_FAILOVER_IN_PROGRESS)) ||
         (ri->master_link_down_time != 0)))
    
        info_period = 1000;
     else 
        info_period = SENTINEL_INFO_PERIOD;
    

    /* We ping instances every time the last received pong is older than
     * the configured 'down-after-milliseconds' time, but every second
     * anyway if 'down-after-milliseconds' is greater than 1 second. */
    ping_period = ri->down_after_period;
    if (ping_period > SENTINEL_PING_PERIOD) ping_period = SENTINEL_PING_PERIOD;

    /* Send INFO to masters and slaves, not sentinels. */
    if ((ri->flags & SRI_SENTINEL) == 0 &&
        (ri->info_refresh == 0 ||
        (now - ri->info_refresh) > info_period))
    
        retval = redisAsyncCommand(ri->link->cc,
            sentinelInfoReplyCallback, ri, "%s",
            sentinelInstanceMapCommand(ri,"INFO"));
        if (retval == C_OK) ri->link->pending_commands++;
    

    /* Send PING to all the three kinds of instances. */
    if ((now - ri->link->last_pong_time) > ping_period &&
               (now - ri->link->last_ping_time) > ping_period/2) 
        sentinelSendPing(ri);
    

    /* PUBLISH hello messages to all the three kinds of instances. */
    if ((now - ri->last_pub_time) > SENTINEL_PUBLISH_PERIOD) 
        sentinelSendHello(ri);

判断下线及投票表决

首先，这种事情需要我自己先说服我自己，他已经挂了，所以：

/* ===================== SENTINEL availability checks ======================= */

/* Is this instance down from our point of view? */
void sentinelCheckSubjectivelyDown(sentinelRedisInstance *ri) 
    mstime_t elapsed = 0;

	//计算目标节点上次响应后过去的时间
    if (ri->link->act_ping_time)
        elapsed = mstime() - ri->link->act_ping_time;
    else if (ri->link->disconnected)
        elapsed = mstime() - ri->link->last_avail_time;

    /* Check if we are in need for a reconnection of one of the
     * links, because we are detecting low activity.
     *
     * 1) Check if the command link seems connected, was connected not less
     *    than SENTINEL_MIN_LINK_RECONNECT_PERIOD, but still we have a
     *    pending ping for more than half the timeout. */
    if (ri->link->cc &&
        (mstime() - ri->link->cc_conn_time) >
        SENTINEL_MIN_LINK_RECONNECT_PERIOD &&
        ri->link->act_ping_time != 0 && /* There is a pending ping... */
        /* The pending ping is delayed, and we did not receive
         * error replies as well. */
        (mstime() - ri->link->act_ping_time) > (ri->down_after_period/2) &&
        (mstime() - ri->link->last_pong_time) > (ri->down_after_period/2))
    
        instanceLinkCloseConnection(ri->link,ri->link->cc);
    

    /* 2) Check if the pubsub link seems connected, was connected not less
     *    than SENTINEL_MIN_LINK_RECONNECT_PERIOD, but still we have no
     *    activity in the Pub/Sub channel for more than
     *    SENTINEL_PUBLISH_PERIOD * 3.
     */
    if (ri->link->pc &&
        (mstime() - ri->link->pc_conn_time) >
         SENTINEL_MIN_LINK_RECONNECT_PERIOD &&
        (mstime() - ri->link->pc_last_activity) > (SENTINEL_PUBLISH_PERIOD*3))
    
        instanceLinkCloseConnection(ri->link,ri->link->pc);
    

    /* Update the SDOWN flag. We believe the instance is SDOWN if:
     *
     * 1) It is not replying.
     * 2) We believe it is a master, it reports to be a slave for enough time
     *    to meet the down_after_period, plus enough time to get two times
     *    INFO report from the instance. */
    if (elapsed > ri->down_after_period ||
        (ri->flags & SRI_MASTER &&
         ri->role_reported == SRI_SLAVE &&
         mstime() - ri->role_reported_time >
          (ri->down_after_period+SENTINEL_INFO_PERIOD*2)))
    
        /* Is subjectively down */
        if ((ri->flags & SRI_S_DOWN) == 0) 
            sentinelEvent(LL_WARNING,"+sdown",ri,"%@");
            ri->s_down_since_time = mstime();
            ri->flags |= SRI_S_DOWN;
        
     else 
        /* Is subjectively up */
        if (ri->flags & SRI_S_DOWN) 
            sentinelEvent(LL_WARNING,"-sdown",ri,"%@");
            ri->flags &= ~(SRI_S_DOWN|SRI_SCRIPT_KILL_SENT);

我说服了自己之后，为了避免决策失误，我便开始问询身边同频的朋友的意见：

//该函数内含选举逻辑
//其他 Sentinel 节点会回复一个标志位，如果为 true，则代表他也认为那个节点下线了
void sentinelAskMasterStateToOtherSentinels(sentinelRedisInstance *master, int flags) 
    dictIterator *di;
    dictEntry *de;

    di = dictGetIterator(master->sentinels);
    while((de = dictNext(di)) != NULL) 
        sentinelRedisInstance *ri = dictGetVal(de);
        mstime_t elapsed = mstime() - ri->last_master_down_reply_time;
        char port[32];
        int retval;

        /* If the master state from other sentinel is too old, we clear it. */
        if (elapsed > SENTINEL_ASK_PERIOD*5) 
            ri->flags &= ~SRI_MASTER_DOWN;
            sdsfree(ri->leader);
            ri->leader = NULL;
        

        /* Only ask if master is down to other sentinels if:
         *
         * 1) We believe it is down, or there is a failover in progress.
         * 2) Sentinel is connected.
         * 3) We did not receive the info within SENTINEL_ASK_PERIOD ms. */
        if ((master->flags & SRI_S_DOWN) == 0) continue;
        if (ri->link->disconnected) continue;
        if (!(flags & SENTINEL_ASK_FORCED) &&
            mstime() - ri->last_master_down_reply_time < SENTINEL_ASK_PERIOD)
            continue;

        /* Ask */
        ll2string(port,sizeof(port),master->addr->port);
        retval = redisAsyncCommand(ri->link->cc,
                    sentinelReceiveIsMasterDownReply, ri,
                    "%s is-master-down-by-addr %s %s %llu %s",
                    sentinelInstanceMapCommand(ri,"SENTINEL"),
                    master->addr->ip, port,
                    sentinel.current_epoch,
                    (master->failover_state > SENTINEL_FAILOVER_STATE_NONE) ?
                    sentinel.myid : "*");
        if (retval == C_OK) ri->link->pending_commands++;
    
    dictReleaseIterator(di);

投票选举 leader哨兵

现在认定他挂了，我们一群监视的要推举一个主事儿的来料理他的后事，由于是我先发现他不对劲儿的，也是我先获取了他最终挂掉的信息，所以我抢先发起了料理后事的请求，其他哨兵只能先给我投票，如果我落选了，他们才有机会发起选举：

1、拉票

void sentinelAskMasterStateToOtherSentinels(sentinelRedisInstance *master, int flags) 
    dictIterator *di;
    dictEntry *de;

    di = dictGetIterator(master->sentinels);
    while((de = dictNext(di)) != NULL) 
        sentinelRedisInstance *ri = dictGetVal(de);
        mstime_t elapsed = mstime() - ri->last_master_down_reply_time;
        char port[32];
        int retval;

        /* If the master state from other sentinel is too old, we clear it. */
        if (elapsed > SENTINEL_ASK_PERIOD*5) 
            ri->flags &= ~SRI_MASTER_DOWN;
            sdsfree(ri->leader);
            ri->leader = NULL;
        

        /* Only ask if master is down to other sentinels if:
         *
         * 1) We believe it is down, or there is a failover in progress.
         * 2) Sentinel is connected.
         * 3) We did not receive the info within SENTINEL_ASK_PERIOD ms. */
        if ((master->flags & SRI_S_DOWN) == 0) continue;
        if (ri->link->disconnected) continue;
        if (!(flags & SENTINEL_ASK_FORCED) &&
            mstime() - ri->last_master_down_reply_time < SENTINEL_ASK_PERIOD)
            continue;

        /* Ask */
        ll2string(port,sizeof(port),master->addr->port);
        retval 以上是关于如何运用5why分析法解决问题？的主要内容，如果未能解决你的问题，请参考以下文章