Raft: 寻找可理解的共识算法(3)

2022-07-06 15:26:27 浏览数 (2)

5.3 Log replication

Figure 6: Logs are composed of entries, which are numbered sequentially. Each entry contains the term in which it was created (the number in each box) and a command for the state machine. An entry is considered committed if it is safe for that entry to be applied to state machines.

图6:日志是由条目组成的,这些条目按顺序编号。每个条目都包含创建它的任期(每个方框中的数字)和状态机的命令。如果一个条目可以安全地应用于状态机,那么该条目就被认为是承诺的。

Once a leader has been elected, it begins servicing client requests. Each client request contains a command to be executed by the replicated state machines. The leader appends the command to its log as a new entry, then it uses AppendEntries RPCs in parallel to each of the other servers to replicate the entry. When the entry has been safely replicated (as described below), the leader applies the entry to its state machine and returns the result of that execution to the client. If followers crash or run slowly, or if network packets are lost, the leader retries Append Entries RPCs indefinitely (even after it has responded to the client) until all followers eventually store all log entries.

一旦一个领导者被选出,它就开始为客户端请求提供服务。每个客户端请求都包含一个要由复制的状态机执行的命令。领导者将命令作为一个新的条目附加到它的日志中,然后它使用AppendEntries RPCs平行于其他每个服务器来复制该条目。当条目被安全复制后(如下所述),领导者将条目应用于其状态机,并将执行结果返回给客户端。如果跟随者崩溃或运行缓慢,或者网络数据包丢失,领导者会无限期地重试Append Entries RPCs(甚至在它响应了客户端之后),直到所有跟随者最终存储所有的日志条目。

Logs are organized as shown in Figure 6. Each log entry stores a state machine command along with the term number when the entry was received by the leader. The term numbers in log entries are used to detect inconsistencies between logs and to ensure some of the properties in Figure 3. Each log entry also has an integer index identifying its position in the log.

日志的组织方式如图6所示。每个日志条目都存储了一个状态机命令,以及领导者收到该条目时的任期编号。日志条目中的任期编号被用来检测日志之间的不一致,并确保图3中的一些属性。每个日志条目也有一个整数索引,用于识别它在日志中的位置。

The leader decides when it is safe to apply a log entry to the state machines; such an entry is called committed. Raft guarantees that committed entries are durable and will eventually be executed by all of the available state machines. A log entry is committed once the leader that created the entry has replicated it on a majority of the servers (e.g., entry 7 in Figure 6). This also commits all preceding entries in the leader’s log, including entries created by previous leaders. Section 5.4 discusses some subtleties when applying this rule after leader changes, and it also shows that this definition of commitment is safe. The leader keeps track of the highest index it knows to be committed, and it includes that index in future AppendEntries RPCs (including heartbeats) so that the other servers eventually find out. Once a follower learns that a log entry is committed, it applies the entry to its local state machine (in log order).

领导决定何时将日志条目应用于状态机是安全的;这样的条目被称为承诺。Raft保证所提交的条目是持久的,最终会被所有可用的状态机执行。一旦创建该条目的领导者将其复制到大多数服务器上,该日志条目就会被提交(例如,图6中的条目7)。这也会提交领导者日志中所有之前的条目,包括之前领导者创建的条目。第5.4节讨论了在领导者变更后应用这一规则时的一些微妙之处,它还表明这种承诺的定义是安全的。领导者会跟踪它所知道的已承诺的最高索引,并且它在未来的AppendEntries RPC(包括心跳)中包括该索引,以便其他服务器最终获知。一旦跟随者得知一个日志条目被提交,它就会将该条目应用到它的本地状态机(按日志顺序)。

We designed the Raft log mechanism to maintain a high level of coherency between the logs on different servers. Not only does this simplify the system’s behavior and make it more predictable, but it is an important component of ensuring safety. Raft maintains the following properties, which together constitute the Log Matching Property in Figure 3:

我们设计的Raft日志机制在不同服务器上的日志之间保持高度的一致性。这不仅简化了系统的行为,使其更具可预测性,而且是确保安全的重要组成部分。Raft维护以下属性,它们共同构成了图3中的日志匹配属性:

• If two entries in different logs have the same index and term, then they store the same command.

如果不同日志中的两个条目具有相同的索引和任期,那么它们存储的是同一个命令。

• If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.

如果不同日志中的两个条目具有相同的索引和任期,那么日志中的所有前面的条目都是相同的。

The first property follows from the fact that a leader creates at most one entry with a given log index in a given term, and log entries never change their position in the log. The second property is guaranteed by a simple consistency check performed by AppendEntries. When sending an AppendEntries RPC, the leader includes the index and term of the entry in its log that immediately precedes the new entries. If the follower does not find an entry in its log with the same index and term, then it refuses the new entries. The consistency check acts as an induction step: the initial empty state of the logs satisfies the Log Matching Property, and the consistency check preserves the Log Matching Property whenever logs are extended. As a result, whenever AppendEntries returns successfully, the leader knows that the follower’s log is identical to its own log up through the new entries.

第一个属性来自于这样一个事实,即一个领导者在一个给定的任期中最多创建一个具有给定日志索引的条目,并且日志条目永远不会改变它们在日志中的位置。第二个属性由AppendEntries执行的简单一致性检查来保证。当发送AppendEntries RPC时,领导者包括其日志中紧接新条目之前的条目的索引和任期。如果跟随者在其日志中没有找到具有相同索引和任期的条目,那么它将拒绝新条目。一致性检查作为一个归纳步骤:日志的初始空状态满足了日志匹配属性,并且每当日志被扩展时,一致性检查都会保留日志匹配属性。因此,每当AppendEntries成功返回时,领导者知道跟随者的日志与自己的日志在新条目之前是相同的。

During normal operation, the logs of the leader and followers stay consistent, so the AppendEntries consistency check never fails. However, leader crashes can leave the logs inconsistent (the old leader may not have fully replicated all of the entries in its log). These inconsistencies can compound over a series of leader and follower crashes. Figure 7 illustrates the ways in which followers’ logs may differ from that of a new leader. A follower may be missing entries that are present on the leader, it may have extra entries that are not present on the leader, or both. Missing and extraneous entries in a log may span multiple terms.

在正常运行期间,领导者和追随者的日志保持一致,所以AppendEntries一致性检查从未失败。然而,领导者崩溃会使日志不一致(老领导者可能没有完全复制其日志中的所有条目)。这些不一致会在一系列领导者和追随者的崩溃中加剧。图7说明了追随者的日志可能与新领导者的日志不同的方式。跟随者可能会丢失领导者的条目,可能会有领导者没有的额外条目,或者两者都有。日志中的缺失和不相干的条目可能跨越多个任期。

In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log. Section 5.4 will show that this is safe when coupled with one more restriction.

在Raft中,领导者通过强迫追随者的日志重复自己的日志来处理不一致的情况。这意味着追随者日志中的冲突条目将被领导者日志中的条目覆盖。第5.4节将表明,如果再加上一个限制,这就是安全的。

Figure 7: When the leader at the top comes to power, it is possible that any of scenarios (a–f) could occur in follower logs. Each box represents one log entry; the number in the box is its term. A follower may be missing entries (a–b), may have extra uncommitted entries (c–d), or both (e–f). For example, scenario (f) could occur if that server was the leader for term 2, added several entries to its log, then crashed before committing any of them; it restarted quickly, became leader for term 3, and added a few more entries to its log; before any of the entries in either term 2 or term 3 were committed, the server crashed again and remained down for several terms.

图7:当最上方的领导者掌权时,在追随者的日志中可能会出现(a-f)的任何一种情况。每个盒子代表一个日志条目;盒子里的数字是它的任期。一个追随者可能缺少条目(a-b),可能有额外的未承诺的条目(c-d),或者两者都有(e-f)。例如,场景f会发生在如果该服务器是第2期的领导者,在其日志中增加了几个条目,然后在提交任何条目之前就崩溃了;它很快重新启动,成为第3期的领导者,并在其日志中增加了几个条目;在第2期或第3期的任何条目被提交之前,该服务器再次崩溃,并持续了几个任期。

To bring a follower’s log into consistency with its own, the leader must find the latest log entry where the two logs agree, delete any entries in the follower’s log after that point, and send the follower all of the leader’s entries after that point. All of these actions happen in response to the consistency check performed by AppendEntries RPCs. The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower. When a leader first comes to power, it initializes all nextIndex values to the index just after the last one in its log (11 in Figure 7). If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any). Once AppendEntries succeeds, the follower’s log is consistent with the leader’s, and it will remain that way for the rest of the term.

为了使追随者的日志与自己的日志保持一致,领导者必须找到两个日志一致的最新日志条目,删除追随者日志中该点之后的任何条目,并将该点之后领导者的所有条目发送给追随者。所有这些动作都是为了响应AppendEntries RPC所进行的一致性检查而发生的。领导为每个跟随者维护一个nextIndex,它是领导将发送给该跟随者的下一个日志条目的索引。当领导者第一次上台时,它将所有的nextIndex值初始化为其日志中最后一条的索引(图7中的11)。如果跟随者的日志与领导者的日志不一致,在下一个AppendEntries RPC中,AppendEntries一致性检查将失败。在拒绝之后,领导者会递减nextIndex并重试AppendEntries RPC。最终,nextIndex将达到一个领导者和追随者日志匹配的点。当这种情况发生时,AppendEntries就会成功,它将删除跟随者日志中任何冲突的条目,并追加领导者日志中的条目(如果有的话)。一旦AppendEntries成功,追随者的日志就与领导者的日志一致了,并且在剩下的任期里,它将保持这种状态。

If desired, the protocol can be optimized to reduce the number of rejected AppendEntries RPCs. For example, when rejecting an AppendEntries request, the follower can include the term of the conflicting entry and the first index it stores for that term. With this information, the leader can decrement nextIndex to bypass all of the conflicting entries in that term; one AppendEntries RPC will be required for each term with conflicting entries, rather than one RPC per entry. In practice, we doubt this optimization is necessary, since failures happen infrequently and it is unlikely that there will be many inconsistent entries.

如果需要,该协议可以被优化以减少被拒绝的AppendEntries RPC的数量。例如,当拒绝一个AppendEntries请求时,跟随者可以包括冲突条目的任期和它为该任期存储的第一个索引。有了这些信息,领导者可以递减nextIndex以绕过该任期中的所有冲突条目;每个有冲突条目的任期将需要一个AppendEntries RPC,而不是每个条目一个RPC。在实践中,我们怀疑这种优化是否有必要,因为故障不常发生,而且不太可能有很多不一致的条目。

With this mechanism, a leader does not need to take any special actions to restore log consistency when it comes to power. It just begins normal operation, and the logs automatically converge in response to failures of the Append Entries consistency check. A leader never overwrites or deletes entries in its own log (the Leader Append-Only Property in Figure 3).

有了这种机制,领导者在掌权时不需要采取任何特别的行动来恢复日志的一致性。它只是开始正常的操作,而日志会自动收敛以应对Append Entries一致性检查的失败。一个领导者从不覆盖或删除自己日志中的条目(图3中的 "Leader Append-Only"属性)。

This log replication mechanism exhibits the desirable consensus properties described in Section 2: Raft can accept, replicate, and apply new log entries as long as a majority of the servers are up; in the normal case a new entry can be replicated with a single round of RPCs to a majority of the cluster; and a single slow follower will not impact performance.

这种日志复制机制表现出第2节中所描述的理想的共识属性:只要大多数服务器是正常的,Raft就可以接受、复制和应用新的日志条目;在正常情况下,一个新的条目可以通过单轮RPC复制到集群的大多数;而且一个缓慢的跟随者不会影响性能。

5.4 Safety

The previous sections described how Raft elects leaders and replicates log entries. However, the mechanisms described so far are not quite sufficient to ensure that each state machine executes exactly the same commands in the same order. For example, a follower might be unavailable while the leader commits several log entries, then it could be elected leader and overwrite these entries with new ones; as a result, different state machines might execute different command sequences.

前面的章节描述了Raft如何选举领导和复制日志条目。然而,到目前为止所描述的机制还不足以确保每个状态机以相同的顺序执行完全相同的命令。例如,当领导者提交几个日志条目时,一个跟随者可能无法使用,然后它可能被选为领导者,并用新的条目覆盖这些条目;结果,不同的状态机可能执行不同的命令序列。

This section completes the Raft algorithm by adding a restriction on which servers may be elected leader. The restriction ensures that the leader for any given term contains all of the entries committed in previous terms (the Leader Completeness Property from Figure 3). Given the election restriction, we then make the rules for commitment more precise. Finally, we present a proof sketch for the Leader Completeness Property and show how it leads to correct behavior of the replicated state machine.

本节对Raft算法进行了完善,增加了对哪些服务器可以被选为领导者的限制条件。该限制确保了任何给定任期的领导者都包含了之前任期中承诺的所有条目(图3中的Leader Completeness属性)。考虑到选举限制,我们会使承诺的规则更加精确。最后,我们提出一个Leader Completeness属性的证明草图,并说明它如何导致副本状态机的正确行为。

5.4.1 Election restriction

In any leader-based consensus algorithm, the leader must eventually store all of the committed log entries. In some consensus algorithms, such as Viewstamped Replication [22], a leader can be elected even if it doesn’t initially contain all of the committed entries. These algorithms contain additional mechanisms to identify the missing entries and transmit them to the new leader, either during the election process or shortly afterwards. Unfortunately, this results in considerable additional mechanism and complexity. Raft uses a simpler approach where it guarantees that all the committed entries from previous terms are present on each new leader from the moment of its election, without the need to transfer those entries to the leader. This means that log entries only flow in one direction, from leaders to followers, and leaders never overwrite existing entries in their logs.

在任何基于领导者的共识算法中,领导者最终必须存储所有承诺的日志条目。在一些共识算法中,例如Viewstamped Replication[22],即使最初不包含所有已承诺的条目,也可以选出一个领导者。这些算法包含额外的机制来识别缺失的条目,并在选举过程中或之后不久将它们传送给新的领导者。不幸的是,这导致了相当多的额外机制和复杂性。Raft使用了一种更简单的方法,它保证从每一个新的领导者当选的那一刻起,以前的所有承诺条目都存在于其身上,而不需要将这些条目传输给领导者。这意味着日志条目只在一个方向流动,即从领导者到追随者,而且领导者永远不会覆盖他们日志中的现有条目。

Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed entries. A candidate must contact a majority of the cluster in order to be elected, which means that every committed entry must be present in at least one of those servers. If the candidate’s log is at least as up-to-date as any other log in that majority (where “up-to-date” is defined precisely below), then it will hold all the committed entries. The RequestVote RPC implements this restriction: the RPC includes information about the candidate’s log, and the voter denies its vote if its own log is more up-to-date than that of the candidate.

Raft使用投票程序来防止候选人赢得选举,除非其日志包含所有承诺的条目。候选人必须与集群中的大多数人联系才能当选,这意味着每个已承诺的条目必须至少存在于其中一个服务器中。如果候选人的日志至少和该多数人中的任何其他日志一样是最新的(这里的 "最新 "在下面有精确的定义),那么它将包含所有承诺的条目。RequestVote RPC实现了这一限制:RPC包括关于候选人日志的信息,如果投票人自己的日志比候选人的日志更及时,则拒绝投票。

Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs. If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more up-to-date.

Raft通过比较日志中最后条目的索引和任期来确定两个日志中哪一个是最新的。如果日志的最后条目有不同的任期,那么任期较晚的日志是最新的。如果日志以相同的任期结束,那么哪一个日志的任期更长,哪一个就更新。

5.4.2 Committing entries from previous terms

Figure 8: A time sequence showing why a leader cannot determine commitment using log entries from older terms. In (a) S1 is leader and partially replicates the log entry at index 2. In (b) S1 crashes; S5 is elected leader for term 3 with votes from S3, S4, and itself, and accepts a different entry at log index 2. In (c) S5 crashes; S1 restarts, is elected leader, and continues replication. At this point, the log entry from term 2 has been replicated on a majority of the servers, but it is not committed. If S1 crashes as in (d), S5 could be elected leader (with votes from S2, S3, and S4) and overwrite the entry with its own entry from term 3. However, if S1 replicates an entry from its current term on a majority of the servers before crashing, as in (e), then this entry is committed (S5 cannot win an election). At this point all preceding entries in the log are committed as well.

图8:一个时间序列显示了为什么领导者不能使用较早的任期的日志条目来确定承诺。在(a)中,S1是领导者,部分复制了索引2的日志条目。在(b)中,S1崩溃了;S5以S3、S4和它自己的投票当选为第三任期的领导者,并接受了日志索引2的不同条目。在(c)中,S5崩溃了;S1重新启动,被选为领导者,并继续复制。在这一点上,第2项的日志条目已经在大多数服务器上复制,但它没有被提交。如果S1像(d)那样崩溃,S5可以被选为领导者(有S2、S3和S4的投票),并用它自己的第3任期的条目覆盖该条目。然而,如果S1在崩溃前在大多数服务器上复制了其当前任期的条目,如(e),那么这个条目就被承诺了(S5不能赢得选举)。在这一点上,日志中所有前面的条目也被提交。

Figure 9: If S1 (leader for term T) commits a new log entry from its term, and S5 is elected leader for a later term U, then there must be at least one server (S3) that accepted the log entry and also voted for S5.

图9:如果S1(任期T的领导者)在其任期内提交了一个新的日志条目,而S5被选为后来任期U的领导者,那么至少有一个服务器(S3)接受了该日志条目,并且也为S5投票。

As described in Section 5.3, a leader knows that an entry from its current term is committed once that entry is stored on a majority of the servers. If a leader crashes before committing an entry, future leaders will attempt to finish replicating the entry. However, a leader cannot immediately conclude that an entry from a previous term is committed once it is stored on a majority of servers. Figure 8 illustrates a situation where an old log entry is stored on a majority of servers, yet can still be overwritten by a future leader.

如第5.3节所述,一旦一个条目被存储在大多数服务器上,领导者就知道其当前任期的条目被提交。如果一个领导者在提交条目之前崩溃了,未来的领导者将试图完成对该条目的复制。然而,领导者不能立即得出结论,一旦前一届的条目被存储在大多数服务器上,它就被提交了。图8说明了这样一种情况:一个旧的日志条目被存储在大多数服务器上,但仍然可以被未来的领导者所覆盖。

To eliminate problems like the one in Figure 8, Raft never commits log entries from previous terms by counting replicas. Only log entries from the leader’s current term are committed by counting replicas; once an entry from the current term has been committed in this way, then all prior entries are committed indirectly because of the Log Matching Property. There are some situations where a leader could safely conclude that an older log entry is committed (for example, if that entry is stored on every server), but Raft takes a more conservative approach for simplicity.

为了消除类似图8中的问题,Raft从不通过计算副本数量来提交以前的日志条目。只有领导者当前任期的日志条目是通过计算副本数量来提交的;一旦当前任期的条目以这种方式被提交,那么由于日志匹配特性,所有之前的条目都被间接提交。在某些情况下,领导者可以安全地断定一个较早的日志条目已被提交(例如,如果该条目存储在每个服务器上),但Raft为了简单起见,采取了更保守的方法。

Raft incurs this extra complexity in the commitment rules because log entries retain their original term numbers when a leader replicates entries from previous terms. In other consensus algorithms, if a new leader re-replicates entries from prior “terms,” it must do so with its new “term number.” Raft’s approach makes it easier to reason about log entries, since they maintain the same term number over time and across logs. In addition, new leaders in Raft send fewer log entries from previous terms than in other algorithms (other algorithms must send redundant log entries to renumber them before they can be committed).

Raft在承诺规则中产生了这种额外的复杂性,因为当领导者复制以前任期的条目时,日志条目会保留其原始任期编号。在其他共识算法中,如果一个新的领导者重新复制之前 "任期"中的条目,它必须用新的 "任期号 "来做。Raft的方法使得对日志条目的推理更加容易,因为它们在不同的时间和不同的日志中保持着相同的任期编号。此外,与其他算法相比,Raft的新领导从以前的任期中发送较少的日志条目(其他算法必须发送多余的日志条目,在它们被提交之前对其重新编号)。

5.4.3 Safety argument

Given the complete Raft algorithm, we can now argue more precisely that the Leader Completeness Property holds (this argument is based on the safety proof; see Section 9.2). We assume that the Leader Completeness Property does not hold, then we prove a contradiction. Suppose the leader for term T (leaderT) commits a log entry from its term, but that log entry is not stored by the leader of some future term. Consider the smallest term U > T whose leader (leaderU) does not store the entry.

基于完整的Raft算法,我们现在可以更精确地论证领导者完备性属性是否成立(这个论证是基于安全证明的,见第9.2节)。我们假设领导者完备性属性不成立,然后我们证明一个矛盾。假设任期T的领导者(leaderT)从其任期中提交了一个日志条目,但是这个日志条目没有被未来某个任期的领导者所存储。考虑最小的任期U > T,其领导者(leaderU)不存储该条目。

1. The committed entry must have been absent from leaderU’s log at the time of its election (leaders never delete or overwrite entries).

承诺的条目在leaderU当选时必须它不在的日志中(leader从不删除或改写条目)。

2. leaderT replicated the entry on a majority of the cluster, and leaderU received votes from a majority of the cluster. Thus, at least one server (“the voter”) both accepted the entry from leaderT and voted for leaderU, as shown in Figure 9. The voter is key to reaching a contradiction.

leaderT在集群的大多数服务器上复制了该条目,而领导者U从集群的大多数服务器上获得了投票。因此,至少有一台服务器("投票者")既接受了leaderT的条目,又投票给leaderU,如图9所示。这个投票者是达成矛盾的关键。

3. The voter must have accepted the committed entry from leaderT before voting for leaderU; otherwise it would have rejected the AppendEntries request from leaderT (its current term would have been higher than T).

投票者在投票给leaderU之前,必须接受leaderT的承诺条目;否则它就会拒绝leaderT的AppendEntries请求(它的当前任期会比T高)。

4. The voter still stored the entry when it voted for leaderU, since every intervening leader contained the entry (by assumption), leaders never remove entries, and followers only remove entries if they conflict with the leader.

投票者在投票给leaderU时仍然储存了该条目,因为每个介入的领导者都包含该条目(根据假设),领导者从不删除条目,而追随者只有在与领导者冲突时才会删除条目。

5. The voter granted its vote to leaderU, so leaderU’s log must have been as up-to-date as the voter’s. This leads to one of two contradictions.

投票者将自己的选票投给leaderU,所以leaderU的日志肯定和选民的一样是最新的。这导致了两个矛盾中的一个。

6. First, if the voter and leaderU shared the same last log term, then leaderU’s log must have been at least as long as the voter’s, so its log contained every entry in the voter’s log. This is a contradiction, since the voter contained the committed entry and leaderU was assumed not to.

首先,如果投票者和leaderU共享最后一个日志项,那么leaderU的日志至少要和投票者的一样长,所以它的日志包含了投票者日志中的每一个条目。这是一个矛盾,因为投票者包含了已承诺的条目,而leaderU被认为不包含。

7. Otherwise, leaderU’s last log term must have been larger than the voter’s. Moreover, it was larger than T, since the voter’s last log term was at least T (it contains the committed entry from term T). The earlier leader that created leaderU’s last log entry must have contained the committed entry in its log (by assumption). Then, by the Log Matching Property, leaderU’s log must also contain the committed entry, which is a contradiction.

否则,leaderU的最后一个日志的任期一定比投票者的大。而且,该任期比T大,因为投票人的最后一个日志任期至少是T(它包含T项的承诺条目)。创建leaderU的最后一个日志的早期leader,其日志中一定包含了已承诺的条目(根据假设)。那么,根据日志匹配属性,leaderU的日志也必须包含已承诺的条目,这是个矛盾。

8. This completes the contradiction. Thus, the leaders of all terms greater than T must contain all entries from term T that are committed in term T.

这就完成了矛盾。因此,所有大于T的任期的领导必须包含任期T的所有条目,这些条目在任期T中被承诺。

9. The Log Matching Property guarantees that future leaders will also contain entries that are committed indirectly, such as index 2 in Figure 8(d).

日志匹配属性保证了未来的领导也将包含间接承诺的条目,如图8(d)中的索引2。

Given the Leader Completeness Property, we can prove the State Machine Safety Property from Figure 3, which states that if a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index. At the time a server applies a log entry to its state machine, its log must be identical to the leader’s log up through that entry and the entry must be committed. Now consider the lowest term in which any server applies a given log index; the Log Completeness Property guarantees that the leaders for all higher terms will store that same log entry, so servers that apply the index in later terms will apply the same value. Thus, the State Machine Safety Property holds.

基于领导者完备性属性,我们可以证明图3中的状态机安全属性,即如果一个服务器在其状态机上应用了一个给定索引的日志条目,那么没有其他服务器会在同一索引上应用一个不同的日志条目。当一个服务器在其状态机上应用一个日志条目时,直到该条目的日志必须与领导者的日志相同,并且该条目必须被提交。现在考虑任何服务器应用一个给定的日志索引的最低任期;日志完整性属性保证所有更高任期的领导者将存储相同的日志条目,因此在以后的任期中应用索引的服务器将应用相同的值。因此,状态机安全属性成立。

Finally, Raft requires servers to apply entries in log index order. Combined with the State Machine Safety Property, this means that all servers will apply exactly the same set of log entries to their state machines, in the same order.

最后,Raft要求服务器按照日志索引顺序应用条目。结合状态机安全属性,这意味着所有服务器将以相同的顺序将相同的日志条目集应用于其状态机。

5.5 Follower and candidate crashes

Until this point we have focused on leader failures. Follower and candidate crashes are much simpler to handle than leader crashes, and they are both handled in the same way. If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully. If a server crashes after completing an RPC but before responding, then it will receive the same RPC again after it restarts. Raft RPCs are idempotent, so this causes no harm. For example, if a follower receives an AppendEntries request that includes log entries already present in its log, it ignores those entries in the new request.

在这之前,我们一直专注于领导者的失败。跟随者和候选人的崩溃比领导者的崩溃要简单得多,而且它们的处理方式都是一样的。如果一个追随者或候选人崩溃了,那么未来发送给它的RequestVote和AppendEntries RPC将会失败。Raft通过无限期地重试来处理这些失败;如果崩溃的服务器重新启动,那么RPC将成功完成。如果服务器在完成RPC后但在响应前崩溃,那么它将在重新启动后再次收到相同的RPC。Raft RPC是幂等的,所以这不会造成任何伤害。例如,如果一个跟随者收到一个AppendEntries请求,其中包括已经存在于其日志中的日志条目,那么它将忽略新请求中的这些条目。

5.6 Timing and availability

One of our requirements for Raft is that safety must not depend on timing: the system must not produce incorrect results just because some event happens more quickly or slowly than expected. However, availability (the ability of the system to respond to clients in a timely manner) must inevitably depend on timing. For example, if message exchanges take longer than the typical time between server crashes, candidates will not stay up long enough to win an election; without a steady leader, Raft cannot make progress.

我们对Raft的要求之一是安全性不能依赖于时间:系统不能因为某些事件发生得比预期快或慢而产生错误的结果。然而,可用性(系统及时响应客户的能力)必须不可避免地取决于时间。例如,如果消息交换的时间超过了服务器崩溃之间的典型时间,那么候选人就不会保持足够长的时间来赢得选举;没有一个稳定的领导者,Raft就无法取得进展。

Leader election is the aspect of Raft where timing is most critical. Raft will be able to elect and maintain a steady leader as long as the system satisfies the following timing requirement:

领袖选举是Raft中时间最关键的方面。只要系统满足以下时间要求,Raft就能选出并维持一个稳定的领导者:

broadcastTime electionTimeout MTBF

In this inequality broadcastTime is the average time it takes a server to send RPCs in parallel to every server in the cluster and receive their responses; electionTimeout is the election timeout described in Section 5.2; and MTBF is the average time between failures for a single server. The broadcast time should be an order of magnitude less than the election timeout so that leaders can reliably send the heartbeat messages required to keep followers from starting elections; given the randomized approach used for election timeouts, this inequality also makes split votes unlikely. The election timeout should be a few orders of magnitude less than MTBF so that the system makes steady progress. When the leader crashes, the system will be unavailable for roughly the election time out; we would like this to represent only a small fraction of overall time.

在这个不等式中,broadcastTime是一台服务器向集群中的每台服务器并行发送RPC并接收其响应所需的平均时间;electionTimeout是第5.2节中描述的选举超时;MTBF是单台服务器的平均故障间隔时间。广播时间应该比选举超时少一个数量级,这样领导者就可以可靠地发送心跳信息,以防止追随者开始选举;考虑到选举超时使用的随机方法,这种不平等也使得分裂投票不太可能发生。选举超时应该比MTBF小几个数量级,这样系统才能稳步前进。当领导者崩溃时,系统将在大约选举超时的时间内不可用;我们希望这只占总体时间的一小部分。

The broadcast time and MTBF are properties of the underlying system, while the election timeout is something we must choose. Raft’s RPCs typically require the recipient to persist information to stable storage, so the broadcast time may range from 0.5ms to 20ms, depending on storage technology. As a result, the election timeout is likely to be somewhere between 10ms and 500ms. Typical server MTBFs are several months or more, which easily satisfies the timing requirement.

广播时间和MTBF是底层系统的属性,而选举超时是我们必须选择的。Raft的RPC通常要求接收者将信息持久化到稳定的存储中,所以广播时间可能在0.5ms到20ms之间,这取决于存储技术。因此,选举超时可能是在10ms到500ms之间。典型的服务器MTBF是几个月或更长时间,这很容易满足时间要求。

0 人点赞