Redis cluster 故障转移

在节点间交互中我们已经知道了,cluster集群是如何做到节点间通信和故障发现的.这里总结下集群是如何做故障转移(Failover)的.

故障转移

故障转移的逻辑也是在clusterCron()方法中定时触发执行的.具体流程都在clusterHandleSlaveFailover(void)方法中.

1. 基本概念

为了更好理解源码,先同步下变量的含义.

server.cluster->failover_auth_time: 表示slave节点开始进行故障转移的时刻;

auth_age: 从发起 failover开始时间到现在过去的时间。

needed_quorum: 故障转移需要的选票数量;

server.cluster: 是主节点数量;

auth_timeout: 当前slave发起投票后,等待回应的超时时间,至少为 2s.如果超过该时间还没有获得足够的选票,那么表示本次failover失败;

auth_retry_time: 发起下一次故障转移的时间间隔;

代码语言：javascript复制

mstime_t data_age;
    mstime_t auth_age = mstime() - server.cluster->failover_auth_time;
    int needed_quorum = (server.cluster->size / 2)   1;
    int manual_failover = server.cluster->mf_end != 0 &&
                          server.cluster->mf_can_start;
    mstime_t auth_timeout, auth_retry_time;
    server.cluster->todo_before_sleep &= ~CLUSTER_TODO_HANDLE_FAILOVER;
    /* Compute the failover timeout (the max time we have to send votes
     * and wait for replies), and the failover retry time (the time to wait
     * before trying to get voted again).
     *
     * Timeout is MAX(NODE_TIMEOUT*2,2000) milliseconds.
     * Retry is two times the Timeout.
     */
    auth_timeout = server.cluster_node_timeout*2;
    if (auth_timeout < 2000) auth_timeout = 2000;
    auth_retry_time = auth_timeout*2;

2. 判断节点是否能发起故障转移

能发起failover的节点必须满足以下条件:

a. slave 节点

b. master 不为空

c. master 负责的 slot 数量不为空

d. master 被标记成了 FAIL,或者是一个主动 failover(manual_failover为真)

代码语言：javascript复制

    /* Pre conditions to run the function, that must be met both in case
     * of an automatic or manual failover:
     * 1) We are a slave.
     * 2) Our master is flagged as FAIL, or this is a manual failover.
     * 3) We don't have the no failover configuration set, and this is
     *    not a manual failover.
     * 4) It is serving slots. */
    if (nodeIsMaster(myself) ||
        myself->slaveof == NULL ||
        (!nodeFailed(myself->slaveof) && !manual_failover) ||
        (server.cluster_slave_no_failover && !manual_failover) ||
        myself->slaveof->numslots == 0)
    {
        /* There are no reasons to failover, so we set the reason why we
         * are returning without failing over to NONE. */
        server.cluster->cant_failover_reason = CLUSTER_CANT_FAILOVER_NONE;
        return;
    }

3. 判断节点数据是否太旧

data_age:表示当前slave节点多长时间没有与master节点交互过了.如果slave节点的数据太旧就不能替换掉下线 master 节点,因此只能人工处理。

代码语言：javascript复制

  /* Set data_age to the number of seconds we are disconnected fromthe master. */
    if (server.repl_state == REPL_STATE_CONNECTED) {
        data_age = (mstime_t)(server.unixtime - server.master->lastinteraction)
                   * 1000;
    } else {
        data_age = (mstime_t)(server.unixtime - server.repl_down_since) * 1000;
    }
    /* Remove the node timeout from the data age as it is fine that we are
     * disconnected from our master at least for the time it was down to be
     * flagged as FAIL, that's the baseline. */
    if (data_age > server.cluster_node_timeout)
        data_age -= server.cluster_node_timeout;

    /* Check if our data is recent enough according to the slave validity
     * factor configured by the user.
     * Check bypassed for manual failovers. */
    if (server.cluster_slave_validity_factor &&
        data_age >
        (((mstime_t)server.repl_ping_slave_period * 1000)  
         (server.cluster_node_timeout * server.cluster_slave_validity_factor)))
    {
        if (!manual_failover) {
            clusterLogCantFailover(CLUSTER_CANT_FAILOVER_DATA_AGE);
            return;
        }
    }

4. 启动故障转移流程

满足条件(auth_age > auth_retry_time)后,发起故障转移流程,将自己的数据和节点等信息广播出去

ailover_auth_rank:根据clusterGetSlaveRank()可以看出,排名根据数据复制位置来定,复制数据量越多,排名越靠前,越早进行故障转移;

代码语言：javascript复制

  /* If the previous failover attempt timedout and the retry time has
     * elapsed, we can setup a new one. */
    if (auth_age > auth_retry_time) {
        server.cluster->failover_auth_time = mstime()  
            500   /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */
            random() % 500; /* Random delay between 0 and 500 milliseconds. */
        server.cluster->failover_auth_count = 0;
        server.cluster->failover_auth_sent = 0;
        server.cluster->failover_auth_rank = clusterGetSlaveRank();
        /* We add another delay that is proportional to the slave rank.
         * Specifically 1 second * rank. This way slaves that have a probably
         * less updated replication offset, are penalized. */
        server.cluster->failover_auth_time  =
            server.cluster->failover_auth_rank * 1000;
        /* However if this is a manual failover, no delay is needed. */
        if (server.cluster->mf_end) {
            server.cluster->failover_auth_time = mstime();
            server.cluster->failover_auth_rank = 0;
        }
        serverLog(LL_WARNING,
            "Start of election delayed for %lld milliseconds "
            "(rank #%d, offset %lld).",
            server.cluster->failover_auth_time - mstime(),
            server.cluster->failover_auth_rank,
            replicationGetSlaveOffset());
        /* Now that we have a scheduled election, broadcast our offset
         * to all the other slaves so that they'll updated their offsets
         * if our offset is better. */
        clusterBroadcastPong(CLUSTER_BROADCAST_LOCAL_SLAVES);
        return;
    }

5. 集群内拉选票

向集群内广播票选信息

type=CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST

代码语言：javascript复制

  /* Ask for votes if needed. */
    if (server.cluster->failover_auth_sent == 0) {
        server.cluster->currentEpoch  ;
        server.cluster->failover_auth_epoch = server.cluster->currentEpoch;
        ...
        clusterRequestFailoverAuth();
...
    server.cluster->failover_auth_sent = 1;
        return; /* Wait for replies. */
    }

6. 故障转移,从主切换

节点切换为master

主要流程在clusterFailoverReplaceYourMaster(void)方法中

1.将节点相关信息修改为主节点:例如主从节点等信息

2.接受旧master的hash槽信息

3.广播通知其他所有节点

7. 节点投票

各节点在接收到CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST报文时,会进行投票,具体方法参考clusterSendFailoverAuthIfNeeded();

a. 首先会判断节点必须为主节点,而且负责一部分hash槽

代码语言：javascript复制

if (nodeIsSlave(myself) || myself->numslots == 0) return;

b. 保证一次failover只做一次投票

代码语言：javascript复制

/* I already voted for this epoch? Return ASAP. */
if (server.cluster->lastVoteEpoch == server.cluster->currentEpoch) {
    serverLog(LL_WARNING,
            "Failover auth denied to %.40s: already voted for epoch %llu",
            node->name,
            (unsigned long long) server.cluster->currentEpoch);
    return;
}

c. 组装报文发回给sender节点

代码语言：javascript复制

type=CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK
clusterSendFailoverAuth(node);

8. 延迟处理

在上述流程中,部分操作都是延后一段时间执行的,这样做的目的是让信息在各节点充分转发.

hash server time void

0 人点赞