在节点间交互中我们已经知道了,cluster集群是如何做到节点间通信和故障发现的.这里总结下集群是如何做故障转移(Failover)的.
故障转移
故障转移的逻辑也是在clusterCron()方法中定时触发执行的.具体流程都在clusterHandleSlaveFailover(void)方法中.
1. 基本概念
为了更好理解源码,先同步下变量的含义.
server.cluster->failover_auth_time: 表示slave节点开始进行故障转移的时刻;
auth_age: 从发起 failover开始时间到现在过去的时间。
needed_quorum: 故障转移需要的选票数量;
server.cluster: 是主节点数量;
auth_timeout: 当前slave发起投票后,等待回应的超时时间,至少为 2s.如果超过该时间还没有获得足够的选票,那么表示本次failover失败;
auth_retry_time: 发起下一次故障转移的时间间隔;
代码语言:javascript复制mstime_t data_age;
mstime_t auth_age = mstime() - server.cluster->failover_auth_time;
int needed_quorum = (server.cluster->size / 2) 1;
int manual_failover = server.cluster->mf_end != 0 &&
server.cluster->mf_can_start;
mstime_t auth_timeout, auth_retry_time;
server.cluster->todo_before_sleep &= ~CLUSTER_TODO_HANDLE_FAILOVER;
/* Compute the failover timeout (the max time we have to send votes
* and wait for replies), and the failover retry time (the time to wait
* before trying to get voted again).
*
* Timeout is MAX(NODE_TIMEOUT*2,2000) milliseconds.
* Retry is two times the Timeout.
*/
auth_timeout = server.cluster_node_timeout*2;
if (auth_timeout < 2000) auth_timeout = 2000;
auth_retry_time = auth_timeout*2;
2. 判断节点是否能发起故障转移
能发起failover的节点必须满足以下条件:
a. slave 节点
b. master 不为空
c. master 负责的 slot 数量不为空
d. master 被标记成了 FAIL,或者是一个主动 failover(manual_failover为真)
代码语言:javascript复制 /* Pre conditions to run the function, that must be met both in case
* of an automatic or manual failover:
* 1) We are a slave.
* 2) Our master is flagged as FAIL, or this is a manual failover.
* 3) We don't have the no failover configuration set, and this is
* not a manual failover.
* 4) It is serving slots. */
if (nodeIsMaster(myself) ||
myself->slaveof == NULL ||
(!nodeFailed(myself->slaveof) && !manual_failover) ||
(server.cluster_slave_no_failover && !manual_failover) ||
myself->slaveof->numslots == 0)
{
/* There are no reasons to failover, so we set the reason why we
* are returning without failing over to NONE. */
server.cluster->cant_failover_reason = CLUSTER_CANT_FAILOVER_NONE;
return;
}
3. 判断节点数据是否太旧
data_age:表示当前slave节点多长时间没有与master节点交互过了.如果slave节点的数据太旧就不能替换掉下线 master 节点,因此只能人工处理。
代码语言:javascript复制 /* Set data_age to the number of seconds we are disconnected fromthe master. */
if (server.repl_state == REPL_STATE_CONNECTED) {
data_age = (mstime_t)(server.unixtime - server.master->lastinteraction)
* 1000;
} else {
data_age = (mstime_t)(server.unixtime - server.repl_down_since) * 1000;
}
/* Remove the node timeout from the data age as it is fine that we are
* disconnected from our master at least for the time it was down to be
* flagged as FAIL, that's the baseline. */
if (data_age > server.cluster_node_timeout)
data_age -= server.cluster_node_timeout;
/* Check if our data is recent enough according to the slave validity
* factor configured by the user.
* Check bypassed for manual failovers. */
if (server.cluster_slave_validity_factor &&
data_age >
(((mstime_t)server.repl_ping_slave_period * 1000)
(server.cluster_node_timeout * server.cluster_slave_validity_factor)))
{
if (!manual_failover) {
clusterLogCantFailover(CLUSTER_CANT_FAILOVER_DATA_AGE);
return;
}
}
4. 启动故障转移流程
满足条件(auth_age > auth_retry_time)后,发起故障转移流程,将自己的数据和节点等信息广播出去
ailover_auth_rank:根据clusterGetSlaveRank()可以看出,排名根据数据复制位置来定,复制数据量越多,排名越靠前,越早进行故障转移;
代码语言:javascript复制 /* If the previous failover attempt timedout and the retry time has
* elapsed, we can setup a new one. */
if (auth_age > auth_retry_time) {
server.cluster->failover_auth_time = mstime()
500 /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */
random() % 500; /* Random delay between 0 and 500 milliseconds. */
server.cluster->failover_auth_count = 0;
server.cluster->failover_auth_sent = 0;
server.cluster->failover_auth_rank = clusterGetSlaveRank();
/* We add another delay that is proportional to the slave rank.
* Specifically 1 second * rank. This way slaves that have a probably
* less updated replication offset, are penalized. */
server.cluster->failover_auth_time =
server.cluster->failover_auth_rank * 1000;
/* However if this is a manual failover, no delay is needed. */
if (server.cluster->mf_end) {
server.cluster->failover_auth_time = mstime();
server.cluster->failover_auth_rank = 0;
}
serverLog(LL_WARNING,
"Start of election delayed for %lld milliseconds "
"(rank #%d, offset %lld).",
server.cluster->failover_auth_time - mstime(),
server.cluster->failover_auth_rank,
replicationGetSlaveOffset());
/* Now that we have a scheduled election, broadcast our offset
* to all the other slaves so that they'll updated their offsets
* if our offset is better. */
clusterBroadcastPong(CLUSTER_BROADCAST_LOCAL_SLAVES);
return;
}
5. 集群内拉选票
向集群内广播票选信息
type=CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST
代码语言:javascript复制 /* Ask for votes if needed. */
if (server.cluster->failover_auth_sent == 0) {
server.cluster->currentEpoch ;
server.cluster->failover_auth_epoch = server.cluster->currentEpoch;
...
clusterRequestFailoverAuth();
...
server.cluster->failover_auth_sent = 1;
return; /* Wait for replies. */
}
6. 故障转移,从主切换
节点切换为master
主要流程在clusterFailoverReplaceYourMaster(void)方法中
1.将节点相关信息修改为主节点:例如主从节点等信息
2.接受旧master的hash槽信息
3.广播通知其他所有节点
7. 节点投票
各节点在接收到CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST报文时,会进行投票,具体方法参考clusterSendFailoverAuthIfNeeded();
a. 首先会判断节点必须为主节点,而且负责一部分hash槽
代码语言:javascript复制if (nodeIsSlave(myself) || myself->numslots == 0) return;
b. 保证一次failover只做一次投票
代码语言:javascript复制/* I already voted for this epoch? Return ASAP. */
if (server.cluster->lastVoteEpoch == server.cluster->currentEpoch) {
serverLog(LL_WARNING,
"Failover auth denied to %.40s: already voted for epoch %llu",
node->name,
(unsigned long long) server.cluster->currentEpoch);
return;
}
c. 组装报文发回给sender节点
代码语言:javascript复制type=CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK
clusterSendFailoverAuth(node);
8. 延迟处理
在上述流程中,部分操作都是延后一段时间执行的,这样做的目的是让信息在各节点充分转发.