Raft: 寻找可理解的共识算法(2)

2022-07-06 15:17:08 浏览数 (1)

5 The Raft consensus algorithm

Figure 2: A condensed summary of the Raft consensus algorithm (excluding membership changes and log compaction). The server behavior in the upper-left box is described as a set of rules that trigger independently and repeatedly. Section numbers such as §5.2 indicate where particular features are discussed. A formal specification [31] describes the algorithm more precisely.

图2:Raft共识算法的浓缩摘要(不包括成员变化和日志压实)。左上角方框中的服务器行为被描述为一组独立和重复触发的规则。诸如§5.2的章节编号表示讨论特定功能的地方。一个正式的规范[31]更精确地描述了该算法。

Election Safety: at most one leader can be elected in a given term. §5.2 选举安全:在一个特定的任期内最多可以选出一名领导人。§5.2

Leader Append-Only: a leader never overwrites or deletes entries in its log; it only appends new entries. §5.3 Leader Append-Only:领导者从不覆盖或删除其日志中的条目;它只附加新条目。§5.3

Log Matching: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index. §5.3 日志匹配:如果两个日志包含一个具有相同索引和任期的条目,那么这两个日志的所有条目到给定索引都是相同的。§5.3

Leader Completeness: if a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms. §5.4 领导者的完整性:如果一个日志条目在某一任期中被承诺,那么该条目将出现在所有更高编号任期的领导者的日志中。§5.4

State Machine Safety: if a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index. §5.4.3 状态机安全:如果一个服务器在其状态机上应用了一个给定索引的日志条目,那么其他服务器将永远不会为同一索引应用不同的日志条目。§5.4.3

Figure 3: Raft guarantees that each of these properties is true at all times. The section numbers indicate where each property is discussed.

图3:Raft保证这些属性中的每一个在任何时候都是真的。节号表示每个属性的讨论位置。

Raft is an algorithm for managing a replicated log of the form described in Section 2. Figure 2 summarizes the algorithm in condensed form for reference, and Figure 3 lists key properties of the algorithm; the elements of these figures are discussed piecewise over the rest of this section.

Raft是一种用于管理第2节所述形式的复制日志的算法。图2概括了该算法的浓缩形式以供参考,图3列出了该算法的关键属性;这些数字元素将在本节的其余部分逐一讨论。

Raft implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines. Having a leader simplifies the management of the replicated log. For example, the leader can decide where to place new entries in the log without consulting other servers, and data flows in a simple fashion from the leader to other servers. A leader can fail or become disconnected from the other servers, in which case a new leader is elected.

Raft通过首先选举一个不同的领导者来实现共识,然后让领导者完全负责管理复制的日志。领导接受来自客户端的日志条目,将其复制到其他服务器上,并告诉服务器何时可以安全地将日志条目应用到他们的状态机。有一个领导者可以简化对复制日志的管理。例如,领导者可以决定在日志中放置新条目的位置,而不需要咨询其他服务器,并且数据以一种简单的方式从领导者流向其他服务器。一个领导者可能会失败或与其他服务器断开连接,在这种情况下,会选出一个新的领导者。

Given the leader approach, Raft decomposes the consensus problem into three relatively independent subproblems, which are discussed in the subsections that follow:

考虑到领导方法,Raft将共识问题分解为三个相对独立的子问题,这将在后面的小节中讨论:

• Leader election: a new leader must be chosen when an existing leader fails (Section 5.2).

领袖选举:当现有领袖失败时,必须选择一个新的领袖(第5.2节)。

• Log replication: the leader must accept log entries from clients and replicate them across the cluster, forcing the other logs to agree with its own (Section 5.3).

日志复制:领导者必须接受来自客户端的日志条目,并在整个集群中进行复制,迫使其他日志与自己的日志一致(第5.3节)。

• Safety: the key safety property for Raft is the State Machine Safety Property in Figure 3: if any server has applied a particular log entry to its state machine, then no other server may apply a different command for the same log index. Section 5.4 describes how Raft ensures this property; the solution involves an additional restriction on the election mechanism described in Section 5.2.

安全性:Raft的关键安全属性是图3中的状态机安全属性:如果任何服务器对其状态机应用了一个特定的日志条目,那么其他服务器不得对同一日志索引应用不同的命令。第5.4节描述了Raft如何确保这一属性;该解决方案涉及对第5.2节中描述的选举机制的额外限制。

After presenting the consensus algorithm, this section discusses the issue of availability and the role of timing in the system.

在介绍了共识算法之后,本节讨论了系统中的可用性问题和计时的作用。

5.1 Raft basics

Figure 4: Server states. Followers only respond to requests from other servers. If a follower receives no communication, it becomes a candidate and initiates an election. A candidate that receives votes from a majority of the full cluster becomes the new leader. Leaders typically operate until they fail.

图4:服务器状态。Follower只对来自其他服务器的请求做出回应。如果一个Follower没有收到任何通信,它就会成为一个Candidate并发起选举。获得整个集群中大多数人投票的Candidate成为新的Leader。Leader通常会一直运行,直到他们失效。

Figure 5: Time is divided into terms, and each term begins with an election. After a successful election, a single leader manages the cluster until the end of the term. Some elections fail, in which case the term ends without choosing a leader. The transitions between terms may be observed at different times on different servers.

图5:时间被划分为几个任期,每个任期以选举开始。选举成功后,由一个领导者管理集群,直到任期结束。有些选举失败,在这种情况下,任期结束时没有选择领导者。在不同的服务器上,可以在不同的时间观察到任期之间的转换。

A Raft cluster contains several servers; five is a typical number, which allows the system to tolerate two failures. At any given time each server is in one of three states: leader, follower, or candidate. In normal operation there is exactly one leader and all of the other servers are followers. Followers are passive: they issue no requests on their own but simply respond to requests from leaders and candidates. The leader handles all client requests (if a client contacts a follower, the follower redirects it to the leader). The third state, candidate, is used to elect a new leader as described in Section 5.2. Figure 4 shows the states and their transitions; the transitions are discussed below.

一个Raft集群包含几个服务器;五个是典型的数字,这使得系统可以容忍两个故障服务器。在任何时候,每个服务器都处于三种状态之一:领导者、追随者或候选者。在正常操作中,有且仅有一个领导者,其他所有的服务器都是追随者。追随者是被动的:他们自己不发出任何请求,只是对领导者和候选人的请求做出回应。领导者处理所有客户端的请求(如果客户端联系追随者,追随者将其重定向到领导者)。第三种状态,候选人,被用来选举一个新的领导者,如第5.2节所述。图4显示了这些状态和它们的转换;下面将讨论这些转换。

Raft divides time into terms of arbitrary length, as shown in Figure 5. Terms are numbered with consecutive integers. Each term begins with an election, in which one or more candidates attempt to become leader as described in Section 5.2. If a candidate wins the election, then it serves as leader for the rest of the term. In some situations an election will result in a split vote. In this case the term will end with no leader; a new term (with a new election) will begin shortly. Raft ensures that there is at most one leader in a given term.

如图5所示,Raft将时间划分为任意长度的任期。任期用连续的整数来编号。每个任期以选举开始,其中一个或多个候选人试图成为领导者,如第5.2节所述。如果一个候选人在选举中获胜,那么他将在剩下的任期内担任领导者。在某些情况下,选举的结果是分裂票。在这种情况下,任期结束时将没有领导者;新的任期(新的选举)将很快开始。Raft确保在一个给定的任期内最多有一个领导者。

Different servers may observe the transitions between terms at different times, and in some situations a server may not observe an election or even entire terms. Terms act as a logical clock [14] in Raft, and they allow servers to detect obsolete information such as stale leaders. Each server stores a current term number, which increases monotonically over time. Current terms are exchanged whenever servers communicate; if one server’s current term is smaller than the other’s, then it updates its current term to the larger value. If a candidate or leader discovers that its term is out of date, it immediately reverts to follower state. If a server receives a request with a stale term number, it rejects the request.

不同的服务器可能会在不同的时间观察任期之间的转换,在某些情况下,一个服务器可能不会观察一个选举甚至整个任期。任期在Raft中充当了逻辑时钟[14],它们允许服务器检测过时的信息,如过时的领导者。每个服务器都存储一个当前的任期编号,该编号随时间单调地增加。每当服务器进行通信时,就会交换当前任期;如果一个服务器的当前任期比另一个服务器的小,那么它就会将其当前任期更新为较大的值。如果一个候选者或领导者发现它的任期已经过时,它将立即恢复到跟随者状态。如果一个服务器收到的请求是一个过时的任期编号,它将拒绝该请求。

Raft servers communicate using remote procedure calls (RPCs), and the basic consensus algorithm requires only two types of RPCs. RequestVote RPCs are initiated by candidates during elections (Section 5.2), and Append Entries RPCs are initiated by leaders to replicate log entries and to provide a form of heartbeat (Section 5.3). Section 7 adds a third RPC for transferring snapshots between servers. Servers retry RPCs if they do not receive a response in a timely manner, and they issue RPCs in parallel for best performance.

Raft服务器使用远程过程调用(RPCs)进行通信,基本的共识算法只需要两种类型的RPCs。RequestVote RPCs由候选人在选举期间发起(第5.2节),Append Entries RPCs由领导者发起,用于复制日志条目并提供一种心跳形式(第5.3节)。第7节增加了第三个RPC,用于在服务器之间传输快照。如果服务器没有及时收到响应,它们会重试RPC,并且为了获得最佳性能,它们会并行地发出RPC。

5.2 Leader election

Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers. A server remains in follower state as long as it receives valid RPCs from a leader or candidate. Leaders send periodic heartbeats (AppendEntries RPCs that carry no log entries) to all followers in order to maintain their authority. If a follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader.

Raft使用心跳机制来触发领导者选举。当服务器启动时,它们开始是追随者。只要服务器收到来自领导者或候选人的有效RPC,它就一直处于跟随者状态。领导者定期向所有追随者发送心跳(AppendEntries RPCs,不携带任何日志条目),以保持他们的权威。如果追随者在一段被称为选举超时的时间内没有收到任何通信,那么它就认为没有可行的领导者,并开始选举以选择一个新的领导者。

To begin an election, a follower increments its current term and transitions to candidate state. It then votes for itself and issues RequestVote RPCs in parallel to each of the other servers in the cluster. A candidate continues in this state until one of three things happens: (a) it wins the election, (b) another server establishes itself as leader, or (c) a period of time goes by with no winner. These outcomes are discussed separately in the paragraphs below.

为了开始选举,追随者增加它的当前任期并过渡到候选状态。然后,它为自己投票,并向集群中的每个其他服务器发出RequestVote RPCs。候选者继续处于这种状态,直到发生三种情况之一。(a)它赢得了选举,(b)另一个服务器确立了自己的领导地位,或者(c)一段时间内没有赢家。这些结果将在下面的段落中分别讨论。

A candidate wins an election if it receives votes from a majority of the servers in the full cluster for the same term. Each server will vote for at most one candidate in a given term, on a first-come-first-served basis (note: Section 5.4 adds an additional restriction on votes). The majority rule ensures that at most one candidate can win the election for a particular term (the Election Safety Property in Figure 3). Once a candidate wins an election, it becomes leader. It then sends heartbeat messages to all of the other servers to establish its authority and prevent new elections.

如果一个候选人在同一任期内获得了整个集群中大多数服务器的投票,那么它就赢得了选举。每台服务器在给定的任期内最多为一名候选人投票,以先来后到为原则(注:第5.4节对投票增加了一个额外的限制)。少数服从多数的原则保证了最多只有一名候选人能够在某一任期内赢得选举(图3中的选举安全属性)。一旦一个候选人在选举中获胜,它就成为领导者。然后,它向所有其他服务器发送心跳信息,以建立其权威并防止新的选举。

While waiting for votes, a candidate may receive an AppendEntries RPC from another server claiming to be leader. If the leader’s term (included in its RPC) is at least as large as the candidate’s current term, then the candidate recognizes the leader as legitimate and returns to follower state. If the term in the RPC is smaller than the candidate’s current term, then the candidate rejects the RPC and continues in candidate state.

在等待投票的过程中,候选人可能会收到另一个服务器的AppendEntries RPC,声称自己是领导者。如果领导者的任期(包括在其RPC中)至少与候选人的当前任期一样大,那么候选人就会承认领导者是合法的,并返回到跟随者状态。如果RPC中的任期比候选者当前的任期小,那么候选者拒绝RPC,继续处于候选状态。

The third possible outcome is that a candidate neither wins nor loses the election: if many followers become candidates at the same time, votes could be split so that no candidate obtains a majority. When this happens, each candidate will time out and start a new election by incrementing its term and initiating another round of Request Vote RPCs. However, without extra measures split votes could repeat indefinitely.

第三个可能的结果是,一个候选人既没有赢得选举,也没有失去选举:如果许多追随者同时成为候选人,选票可能被分割,因此没有候选人获得多数。当这种情况发生时,每个候选人都会超时,并通过增加其任期和启动新一轮的请求投票RPC来开始新的选举。然而,如果没有额外的措施,分裂的投票可能会无限期地重复。

Raft uses randomized election timeouts to ensure that split votes are rare and that they are resolved quickly. To prevent split votes in the first place, election timeouts are chosen randomly from a fixed interval (e.g., 150–300ms). This spreads out the servers so that in most cases only a single server will time out; it wins the election and sends heartbeats before any other servers time out. The same mechanism is used to handle split votes. Each candidate restarts its randomized election timeout at the start of an election, and it waits for that timeout to elapse before starting the next election; this reduces the likelihood of another split vote in the new election. Section 9.3 shows that this approach elects a leader rapidly.

Raft使用随机的选举超时,以确保分裂投票很少发生,并能迅速解决。为了从一开始就防止分裂投票,选举超时是从一个固定的时间间隔中随机选择的(例如,150-300ms)。这就分散了服务器,所以在大多数情况下,只有一个服务器会超时;它赢得了选举,并在任何其他服务器超时之前发送心跳信号。同样的机制被用来处理分裂投票。每个候选人在选举开始时重新启动其随机的选举超时,并等待超时过后再开始下一次选举;这减少了在新的选举中再次出现分裂票的可能性。第9.3节显示,这种方法可以迅速选出一个领导者。

Elections are an example of how understandability guided our choice between design alternatives. Initially we planned to use a ranking system: each candidate was assigned a unique rank, which was used to select between competing candidates. If a candidate discovered another candidate with higher rank, it would return to follower state so that the higher ranking candidate could more easily win the next election. We found that this approach created subtle issues around availability (a lower-ranked server might need to time out and become a candidate again if a higher-ranked server fails, but if it does so too soon, it can reset progress towards electing a leader). We made adjustments to the algorithm several times, but after each adjustment new corner cases appeared. Eventually we concluded that the randomized retry approach is more obvious and understandable.

选举是一个例子,说明可理解性是如何指导我们在设计方案之间进行选择的。最初我们计划使用一个排名系统:每个候选人被分配一个独特的排名,用来在竞争的候选人之间进行选择。如果一个候选人发现了另一个排名更高的候选人,它就会回到追随者的状态,这样排名更高的候选人就能更容易地赢得下一次选举。我们发现这种方法在可用性方面产生了一些微妙的问题(如果一个排名较高的服务器失败了,一个排名较低的服务器可能需要超时并再次成为候选人,但如果它过早地这样做,它可能会重置选举领导者的进展)。我们对算法进行了多次调整,但每次调整后都会出现新的边界案例。最终我们得出结论,随机重试的方法更加明显和容易理解。

0 人点赞