6 Cluster membership changes
Figure 10: Switching directly from one configuration to another is unsafe because different servers will switch at different times. In this example, the cluster grows from three servers to five. Unfortunately, there is a point in time where two different leaders can be elected for the same term, one with a majority of the old configuration (Cold) and another with a majority of the new configuration (Cnew).
图10:直接从一个配置切换到另一个配置是不安全的,因为不同的服务器会在不同时间切换。在这个例子中,集群从三个服务器增长到五个。不幸的是,有一个时间点,两个不同的领导者可以在同一任期内当选,一个是旧配置的多数(Cold),另一个是新配置的多数(Cnew)。
Figure 11: Timeline for a configuration change. Dashed lines show configuration entries that have been created but not committed, and solid lines show the latest committed configuration entry. The leader first creates the Cold,new configuration entry in its log and commits it to Cold,new (a majority of Cold and a majority of Cnew). Then it creates the Cnew entry and commits it to a majority of Cnew. There is no point in time in which Cold and Cnew can both make decisions independently.
图11:配置变更的时间线。虚线表示已经创建但未提交的配置条目,实线表示最新提交的配置条目。领导者首先在其日志中创建 Cold,new 配置条目,并将其提交给 Cold,new (Cold 的大多数和 Cnew 的大多数)。然后,它创建了Cnew 条目,并将其提交给多数的Cnew 。在这个时间点上,Cold 和Cnew 都不能独立做出决定。
Up until now we have assumed that the cluster configuration (the set of servers participating in the consensus algorithm) is fixed. In practice, it will occasionally be necessary to change the configuration, for example to replace servers when they fail or to change the degree of replication. Although this can be done by taking the entire cluster off-line, updating configuration files, and then restarting the cluster, this would leave the cluster unavailable during the changeover. In addition, if there are any manual steps, they risk operator error. In order to avoid these issues, we decided to automate configuration changes and incorporate them into the Raft consensus algorithm.
到目前为止,我们一直假设集群配置(参与共识算法的服务器集合)是固定的。在实践中,偶尔有必要改变配置,例如在服务器故障时更换服务器或改变复制的程度。虽然这可以通过将整个集群下线,更新配置文件,然后重新启动集群来完成,但这将使集群在转换期间不可用。此外,如果有任何手动步骤,就有可能出现操作错误。为了避免这些问题,我们决定将配置变更自动化,并将其纳入Raft共识算法中。
For the configuration change mechanism to be safe, there must be no point during the transition where it is possible for two leaders to be elected for the same term. Unfortunately, any approach where servers switch directly from the old configuration to the new configuration is unsafe. It isn’t possible to atomically switch all of the servers at once, so the cluster can potentially split into two independent majorities during the transition (see Figure 10).
为了使配置变化机制安全,在过渡期间必须没有任何一点可以让两个领导人当选同一任期的情况。不幸的是,任何服务器直接从旧配置切换到新配置的方法都是不安全的。不可能一次性地切换所有的服务器,所以集群有可能在过渡期间分裂成两个独立的多数派(见图10)。
In order to ensure safety, configuration changes must use a two-phase approach. There are a variety of ways to implement the two phases. For example, some systems (e.g., [22]) use the first phase to disable the old configuration so it cannot process client requests; then the second phase enables the new configuration. In Raft the cluster first switches to a transitional configuration we call joint consensus; once the joint consensus has been committed, the system then transitions to the new configuration. The joint consensus combines both the old and new configurations:
为了确保安全,配置变更必须使用两阶段的方法。实现这两个阶段的方法有很多种。例如,一些系统(如[22])使用第一阶段禁用旧配置,使其无法处理客户请求;然后第二阶段启用新配置。在Raft中,集群首先切换到一个过渡性配置,我们称之为联合共识;一旦联合共识被提交,系统就会过渡到新的配置。联合共识结合了新旧两种配置:
• Log entries are replicated to all servers in both configurations.
日志条目被复制到两个配置中的所有服务器。
• Any server from either configuration may serve as leader.
两种配置中的任何一个服务器都可以作为领导者
• Agreement (for elections and entry commitment) requires separate majorities from both the old and new configurations.
达成协议(选举和日志条目承诺)需要新旧两个组合分别获得多数。
The joint consensus allows individual servers to transition between configurations at different times without compromising safety. Furthermore, joint consensus allows the cluster to continue servicing client requests throughout the configuration change.
联合共识允许单个服务器在不同时间在配置之间转换,而不影响安全。此外,联合共识允许集群在整个配置变化过程中继续为客户端请求提供服务。
Cluster configurations are stored and communicated using special entries in the replicated log; Figure 11 illustrates the configuration change process. When the leader receives a request to change the configuration from Cold to Cnew, it stores the configuration for joint consensus (Cold,new in the figure) as a log entry and replicates that entry using the mechanisms described previously. Once a given server adds the new configuration entry to its log, it uses that configuration for all future decisions (a server always uses the latest configuration in its log, regardless of whether the entry is committed). This means that the leader will use the rules of Cold,new to determine when the log entry for Cold,new is committed. If the leader crashes, a new leader may be chosen under either Cold or Cold,new, depending on whether the winning candidate has received Cold,new. In any case, Cnew cannot make unilateral decisions during this period.
集群配置是通过复制日志中的特殊条目来存储和通信的;图11说明了配置改变过程。当领导者收到将配置从Cold 改为Cnew 的请求时,它将联合共识的配置(图中的Cold,new )存储为一个日志条目,并使用之前描述的机制复制该条目。一旦某个服务器将新的配置条目添加到其日志中,它就会将该配置用于所有未来的决策(一个服务器总是使用其日志中的最新配置,无论该条目是否被提交)。这意味着领导者将使用 Cold,new 的规则来决定 Cold,new 的日志条目何时被提交。如果领导者崩溃了,一个新的领导者可能会在Cold 或Cold,new 下被选择,这取决于获胜的候选人是否已经收到了Cold,new 。在任何情况下,Cnew 都不能在这期间做出单方面的决定。
Once Cold,new has been committed, neither Cold nor Cnew can make decisions without approval of the other, and the Leader Completeness Property ensures that only servers with the Cold,new log entry can be elected as leader. It is now safe for the leader to create a log entry describing Cnew and replicate it to the cluster. Again, this configuration will take effect on each server as soon as it is seen. When the new configuration has been committed under the rules of Cnew, the old configuration is irrelevant and servers not in the new configuration can be shut down. As shown in Figure 11, there is no time when Cold and Cnew can both make unilateral decisions; this guarantees safety.
一旦Cold,new 被提交, Cold 和Cnew 都不能在未经对方批准的情况下做出决定,而且领导者完整性属性确保只有拥有Cold,new 日志条目的服务器才能被选为领导者。现在,领导者创建描述Cnew 的日志条目并将其复制到集群中是安全的。同样,这个配置一旦被看到,就会在每个服务器上生效。当新的配置在Cnew 的规则下被提交后,旧的配置就不重要了,不在新配置中的服务器可以被关闭。如图11所示,没有任何时候 Cold 和Cnew 可以同时做出单边决定;这保证了安全。
There are three more issues to address for reconfiguration. The first issue is that new servers may not initially store any log entries. If they are added to the cluster in this state, it could take quite a while for them to catch up, during which time it might not be possible to commit new log entries. In order to avoid availability gaps, Raft introduces an additional phase before the configuration change, in which the new servers join the cluster as non-voting members (the leader replicates log entries to them, but they are not considered for majorities). Once the new servers have caught up with the rest of the cluster, the reconfiguration can proceed as described above.
对于重新配置,还有三个问题需要解决。第一个问题是,新的服务器最初可能不会存储任何日志条目。如果它们在这种状态下被添加到集群中,可能需要相当长的时间才能赶上,在此期间,可能无法提交新的日志条目。为了避免可用性差距,Raft在配置改变之前引入了一个额外的阶段,在这个阶段,新的服务器作为非投票成员加入集群(领导者将日志条目复制给他们,但他们不被考虑为多数)。一旦新的服务器赶上了集群的其他部分,重新配置就可以按上述方法进行。
The second issue is that the cluster leader may not be part of the new configuration. In this case, the leader steps down (returns to follower state) once it has committed the Cnew log entry. This means that there will be a period of time (while it is committing Cnew) when the leader is managing a cluster that does not include itself; it replicates log entries but does not count itself in majorities. The leader transition occurs when Cnew is committed because this is the first point when the new configuration can operate independently (it will always be possible to choose a leader from Cnew). Before this point, it may be the case that only a server from Cold can be elected leader.
第二个问题是,集群领导者可能不是新配置的一部分。在这种情况下,一旦它提交了Cnew 日志条目,领导者就会下台(返回到跟随者状态)。这意味着会有一段时间(在它提交Cnew 的时候),领导者在管理一个不包括自己的集群;它复制日志条目,但不把自己算在多数中。领导者过渡发生在Cnew 被提交的时候,因为这是新配置可以独立运行的第一个点(它将始终有可能从Cnew 中选择一个领导者)。在这之前,可能只有Cold 的一个服务器可以被选为领导者。
The third issue is that removed servers (those not in Cnew) can disrupt the cluster. These servers will not receive heartbeats, so they will time out and start new elections. They will then send RequestVote RPCs with new term numbers, and this will cause the current leader to revert to follower state. A new leader will eventually be elected, but the removed servers will time out again and the process will repeat, resulting in poor availability.
第三个问题是,被移除的服务器(那些不在Cnew 中的服务器)会扰乱集群。这些服务器不会收到心跳,所以它们会超时并开始新的选举。然后他们会发送带有新任期编号的RequestVote RPCs,这将导致当前的领导者恢复到追随者状态。一个新的领导者最终将被选出,但被移除的服务器将再次超时,这个过程将重复,导致可用性差。
To prevent this problem, servers disregard RequestVote RPCs when they believe a current leader exists. Specifically, if a server receives a RequestVote RPC within the minimum election timeout of hearing from a current leader, it does not update its term or grant its vote. This does not affect normal elections, where each server waits at least a minimum election timeout before starting an election. However, it helps avoid disruptions from removed servers: if a leader is able to get heartbeats to its cluster, then it will not be deposed by larger term numbers.
为了防止这个问题,当服务器认为存在一个当前的领导者时,它们就会忽略RequestVote RPCs。具体来说,如果一个服务器在听到当前领袖的最小选举超时内收到RequestVote RPC,它不会更新其任期或授予其投票。这并不影响正常的选举,每个服务器在开始选举之前至少要等待一个最小的选举超时。然而,这有助于避免被移除的服务器的干扰:如果一个领导者能够得到其集群的心跳,那么它就不会被较大的任期数字所废黜。
7 Log compaction
Figure 12: A server replaces the committed entries in its log (indexes 1 through 5) with a new snapshot, which stores just the current state (variables x and y in this example). The snap shot’s last included index and term serve to position the snap shot in the log preceding entry 6.
图12:一个服务器用一个新的快照替换了其日志中的已提交条目(索引1到5),该快照只存储了当前状态(本例中的变量x和y)。快照最后包含的索引和任期用于定位快照在日志第6条之前的位置。
Raft’s log grows during normal operation to incorporate more client requests, but in a practical system, it can not grow without bound. As the log grows longer, it occupies more space and takes more time to replay. This will eventually cause availability problems without some mechanism to discard obsolete information that has accumulated in the log.
Raft的日志在正常运行期间不断增长,以纳入更多的客户端请求,但在实际系统中,它不可能无限制地增长。随着日志的增长,它占据了更多的空间,需要更多的时间来重放。如果没有某种机制来丢弃积累在日志中的过时信息,这最终会导致可用性问题。
Snapshotting is the simplest approach to compaction. In snapshotting, the entire current system state is written to a snapshot on stable storage, then the entire log up to that point is discarded. Snapshotting is used in Chubby and ZooKeeper, and the remainder of this section describes snapshotting in Raft.
快照是最简单的压缩方法。在快照中,整个当前系统状态被写入稳定存储的快照中,然后丢弃截至该点的整个日志。快照在Chubby和ZooKeeper中使用,本节的其余部分将介绍Raft中的快照。
Incremental approaches to compaction, such as log cleaning [36] and log-structured merge trees [30, 5], are also possible. These operate on a fraction of the data at once, so they spread the load of compaction more evenly over time. They first select a region of data that has accumulated many deleted and overwritten objects, then they rewrite the live objects from that region more compactly and free the region. This requires significant additional mechanism and complexity compared to snapshot ting, which simplifies the problem by always operating on the entire data set. While log cleaning would require modifications to Raft, state machines can implement LSM trees using the same interface as snapshotting.
递增的压缩方法,如日志清理[36]和日志结构的合并树[30,5],也是可能的。这些方法一次性对部分数据进行操作,因此它们将压缩的负荷更均匀地分散在时间上。它们首先选择一个积累了许多被删除和被覆盖的对象的数据区域,然后将该区域的活跃对象重写得更紧凑,并释放该区域。与快照处理相比,这需要大量的额外机制和复杂性,因为快照处理总是对整个数据集进行操作,从而简化了问题。虽然日志清理需要对Raft进行修改,但状态机可以使用与快照相同的接口实现LSM树。
Figure 12 shows the basic idea of snapshotting in Raft. Each server takes snapshots independently, covering just the committed entries in its log. Most of the work consists of the state machine writing its current state to the snapshot. Raft also includes a small amount of metadata in the snapshot: the last included index is the index of the last entry in the log that the snapshot replaces (the last entry the state machine had applied), and the last included term is the term of this entry. These are preserved to support the AppendEntries consistency check for the first log entry following the snapshot, since that entry needs a previous log index and term. To enable cluster membership changes (Section 6), the snapshot also includes the latest configuration in the log as of last included index. Once a server completes writing a snapshot, it may delete all log entries up through the last included index, as well as any prior snapshot.
图12显示了Raft中快照的基本思路。每台服务器独立进行快照,只覆盖其日志中已提交的条目。大部分的工作包括状态机将其当前状态写入快照。Raft还在快照中包含了少量的元数据:最后包含的索引是快照所取代的日志中最后一个条目的索引(状态机应用的最后一个条目),最后包含的任期是这个条目的任期。这些被保留下来是为了支持快照之后的第一个日志条目的AppendEntries一致性检查,因为该条目需要一个先前的日志索引和任期。为了实现集群成员的变化(第6节),快照还包括日志中的最新配置,即最后包含的索引。一旦服务器完成写入快照,它可以删除所有的日志条目,直到最后包含的索引,以及任何先前的快照。
Although servers normally take snapshots independently, the leader must occasionally send snapshots to followers that lag behind. This happens when the leader has already discarded the next log entry that it needs to send to a follower. Fortunately, this situation is unlikely in normal operation: a follower that has kept up with the leader would already have this entry. However, an exceptionally slow follower or a new server joining the cluster (Section 6) would not. The way to bring such a follower up-to-date is for the leader to send it a snapshot over the network.
尽管服务器通常是独立进行快照,但领导者偶尔必须向落后的跟随者发送快照。这种情况发生在领导者已经丢弃了它需要发送给跟随者的下一个日志条目。幸运的是,这种情况在正常操作中不太可能发生:一个跟上领导者的追随者已经有了这个条目。然而,一个特别慢的跟随者或一个新加入集群的服务器(第6节)就不会有这样的情况。让这样的追随者跟上的方法是,领导者通过网络向它发送一个快照。
Figure 13: A summary of the InstallSnapshot RPC. Snap shots are split into chunks for transmission; this gives the follower a sign of life with each chunk, so it can reset its election timer.
图13:InstallSnapshot RPC的摘要。快照被分割成若干块进行传输;这给追随者提供了每个块的生命迹象,所以它可以重置其选举计时器。
The leader uses a new RPC called InstallSnapshot to send snapshots to followers that are too far behind; see Figure 13. When a follower receives a snapshot with this RPC, it must decide what to do with its existing log entries. Usually the snapshot will contain new information not already in the recipient’s log. In this case, the follower discards its entire log; it is all superseded by the snapshot and may possibly have uncommitted entries that conflict with the snapshot. If instead the follower receives a snap shot that describes a prefix of its log (due to retransmission or by mistake), then log entries covered by the snapshot are deleted but entries following the snapshot are still valid and must be retained.
领导者使用一个叫做InstallSnapshot的新RPC来向落后于它的跟随者发送快照;见图13。当跟随者收到这个RPC的快照时,它必须决定如何处理其现有的日志条目。通常情况下,快照将包含新的信息,而不是在接收者的日志中。在这种情况下,跟随者会丢弃它的整个日志;它都被快照取代了,而且可能有与快照冲突的未提交的条目。相反,如果跟随者收到描述其日志前缀的快照(由于重传或错误),那么被快照覆盖的日志条目将被删除,但快照之后的条目仍然有效,必须保留。
This snapshotting approach departs from Raft’s strong leader principle, since followers can take snapshots without the knowledge of the leader. However, we think this departure is justified. While having a leader helps avoid conflicting decisions in reaching consensus, consensus has already been reached when snapshotting, so no decisions conflict. Data still only flows from leaders to followers, just followers can now reorganize their data.
这种快照方法偏离了Raft的强领导原则,因为跟随者可以在领导不知情的情况下进行快照。然而,我们认为这种背离是合理的。虽然有一个领导者有助于在达成共识时避免冲突的决定,但在快照时已经达成了共识,所以没有决定冲突。数据仍然只从领导者流向追随者,只是追随者现在可以重新组织他们的数据。
We considered an alternative leader-based approach in which only the leader would create a snapshot, then it would send this snapshot to each of its followers. However, this has two disadvantages. First, sending the snapshot to each follower would waste network bandwidth and slow the snapshotting process. Each follower already has the information needed to produce its own snapshots, and it is typically much cheaper for a server to produce a snapshot from its local state than it is to send and receive one over the network. Second, the leader’s implementation would be more complex. For example, the leader would need to send snapshots to followers in parallel with replicating new log entries to them, so as not to block new client requests.
我们考虑了另一种基于领导者的方法,即只有领导者会创建一个快照,然后它将这个快照发送给其每个追随者。然而,这有两个缺点。首先,向每个追随者发送快照会浪费网络带宽,并减缓快照过程。每个追随者都已经拥有产生其自身快照所需的信息,而且对于服务器来说,从其本地状态产生快照通常比通过网络发送和接收快照要便宜得多。第二,领导者的实现将更加复杂。例如,领导者需要在向追随者复制新的日志条目的同时向他们发送快照,以便不阻碍新的客户请求。
There are two more issues that impact snapshotting performance. First, servers must decide when to snapshot. If a server snapshots too often, it wastes disk bandwidth and energy; if it snapshots too infrequently, it risks exhausting its storage capacity, and it increases the time required to replay the log during restarts. One simple strategy is to take a snapshot when the log reaches a fixed size in bytes. If this size is set to be significantly larger than the expected size of a snapshot, then the disk bandwidth overhead for snapshotting will be small.
还有两个影响快照性能的问题。首先,服务器必须决定何时进行快照。如果服务器快照的频率过高,就会浪费磁盘带宽和能源;如果快照的频率过低,就会有耗尽存储容量的风险,而且会增加重启时重放日志的时间。一个简单的策略是,当日志达到一个固定的字节大小时进行快照。如果这个大小被设定为明显大于快照的预期大小,那么快照的磁盘带宽开销就会很小。
The second performance issue is that writing a snapshot can take a significant amount of time, and we do not want this to delay normal operations. The solution is to use copy-on-write techniques so that new updates can be accepted without impacting the snapshot being written. For example, state machines built with functional data structures naturally support this. Alternatively, the operating system’s copy-on-write support (e.g., fork on Linux) can be used to create an in-memory snapshot of the entire state machine (our implementation uses this approach).
第二个性能问题是,写入快照可能需要大量的时间,我们不希望因此而耽误正常的操作。解决方案是使用写时复制技术,这样就可以接受新的更新而不影响正在写入的快照。例如,用功能数据结构构建的状态机自然支持这一点。另外,操作系统的写时拷贝支持(例如Linux上的fork)可以用来创建整个状态机的内存快照(我们的实现采用了这种方法)。
8 Client interaction
This section describes how clients interact with Raft, including how clients find the cluster leader and how Raft supports linearizable semantics [10]. These issues apply to all consensus-based systems, and Raft’s solutions are similar to other systems.
本节介绍了客户端如何与Raft互动,包括客户端如何找到集群领导者以及Raft如何支持可线性化语义[10]。这些问题适用于所有基于共识的系统,而且Raft的解决方案与其他系统类似。
Clients of Raft send all of their requests to the leader. When a client first starts up, it connects to a randomly chosen server. If the client’s first choice is not the leader, that server will reject the client’s request and supply information about the most recent leader it has heard from (AppendEntries requests include the network address of the leader). If the leader crashes, client requests will time out; clients then try again with randomly-chosen servers.
Raft的客户端将其所有的请求发送给领导者。当客户端第一次启动时,它连接到一个随机选择的服务器。如果客户端的第一选择不是领导者,该服务器将拒绝客户端的请求,并提供它最近听到的领导者的信息(AppendEntries请求包括领导者的网络地址)。如果领导者崩溃了,客户端的请求就会超时;然后客户端会在随机选择的服务器上再次尝试。
Our goal for Raft is to implement linearizable semantics (each operation appears to execute instantaneously, exactly once, at some point between its invocation and its response). However, as described so far Raft can execute a command multiple times: for example, if the leader crashes after committing the log entry but before responding to the client, the client will retry the command with a new leader, causing it to be executed a second time. The solution is for clients to assign unique serial numbers to every command. Then, the state machine tracks the latest serial number processed for each client, along with the associated response. If it receives a command whose serial number has already been executed, it responds immediately without re-executing the request.
我们对Raft的目标是实现可线性化的语义(每个操作看起来都是瞬时执行的,在其调用和响应之间的某个点上正好一次)。然而,正如目前所描述的那样,Raft可以多次执行一个命令:例如,如果领导者在提交日志条目后但在响应客户端之前崩溃,客户端将用一个新的领导者重试该命令,导致它被第二次执行。解决方案是让客户端为每个命令分配唯一的序列号。然后,状态机跟踪为每个客户端处理的最新序列号,以及相关的响应。如果它收到一个序列号已经被执行的命令,它会立即响应,而不重新执行该请求。
Read-only operations can be handled without writing anything into the log. However, with no additional measures, this would run the risk of returning stale data, since the leader responding to the request might have been superseded by a newer leader of which it is unaware. Linearizable reads must not return stale data, and Raft needs two extra precautions to guarantee this without using the log. First, a leader must have the latest information on which entries are committed. The Leader Completeness Property guarantees that a leader has all committed entries, but at the start of its term, it may not know which those are. To find out, it needs to commit an entry from its term. Raft handles this by having each leader commit a blank no-op entry into the log at the start of its term. Second, a leader must check whether it has been deposed before processing a read-only request (its information may be stale if a more recent leader has been elected). Raft handles this by having the leader exchange heartbeat messages with a majority of the cluster before responding to read-only requests. Alternatively, the leader could rely on the heartbeat mechanism to provide a form of lease [9], but this would rely on timing for safety (it assumes bounded clock skew).
只读操作可以在不向日志中写入任何内容的情况下进行处理。然而,如果没有额外的措施,这将有返回陈旧数据的风险,因为响应请求的领导者可能已经被它不知道的较新的领导者所取代。可线性化的读取不可以返回陈旧的数据,Raft需要两个额外的预防措施来保证这一点而不使用日志。首先,领导者必须拥有关于哪些条目被提交的最新信息。领导者完整性属性保证领导者拥有所有已提交的条目,但在其任期开始时,它可能不知道这些条目是什么。为了找到答案,它需要从其任期内提交一个条目。Raft通过让每个领导者在其任期开始时向日志提交一个空白的无操作条目来处理这个问题。第二,领导者在处理只读请求之前,必须检查它是否已经被废黜(如果最近的领导者已经当选,那么它的信息可能是过时的)。Raft通过让领导者在响应只读请求之前与集群中的大多数人交换心跳信息来处理这个问题。另外,领导者可以依靠心跳机制来提供一种租赁形式[9],但这要依靠时间来保证安全(它假定了有界的时钟偏移)。