翻译自MIT 6.824《Students' Guide to Raft》
For the past few months, I have been a Teaching Assistant for MIT’s 6.824 Distributed Systems class. The class has traditionally had a number of labs building on the Paxos consensus algorithm, but this year, we decided to make the move to Raft. Raft was “designed to be easy to understand”, and our hope was that the change might make the students’ lives easier.
在过去的几个月里,我一直是麻省理工学院6.824分布式系统课的教学助理。这门课一直有一些建立在Paxos共识算法上的实验,但今年,我们决定转而使用Raft。Raft是 "设计成易于理解的",我们希望这一改变能使学生的生活更轻松。
This post, and the accompanying Instructors’ Guide to Raft post, chronicles our journey with Raft, and will hopefully be useful to implementers of the Raft protocol and students trying to get a better understanding of Raft’s internals. If you are looking for a Paxos vs Raft comparison, or for a more pedagogical analysis of Raft, you should go read the Instructors’ Guide. The bottom of this post contains a list of questions commonly asked by 6.824 students, as well as answers to those questions. If you run into an issue that is not listed in the main content of this post, check out the Q&A. The post is quite long, but all the points it makes are real problems that a lot of 6.824 students (and TAs) ran into. It is a worthwhile read.
这篇文章,以及随之而来的《Raft讲师指南》,记录了我们使用Raft的历程,希望对Raft协议的实施者和试图更好地了解Raft内部的学生有所帮助。如果你正在寻找Paxos与Raft的比较,或者想对Raft进行更多的教学分析,你应该去阅读《讲师指南》。这篇文章的底部包含了6.824学生常问的问题列表,以及这些问题的答案。如果你遇到的问题没有在本帖的主要内容中列出,请查看问答。这篇文章相当长,但它提出的所有观点都是很多6.824学生(和助教)遇到的真实问题。这是一篇值得一读的文章。
Background
Before we dive into Raft, some context may be useful. 6.824 used to have a set of Paxos-based labs that were built in Go; Go was chosen both because it is easy to learn for students, and because is pretty well-suited for writing concurrent, distributed applications (goroutines come in particularly handy). Over the course of four labs, students build a fault-tolerant, sharded key-value store. The first lab had them build a consensus-based log library, the second added a key value store on top of that, and the third sharded the key space among multiple fault-tolerant clusters, with a fault-tolerant shard master handling configuration changes. We also had a fourth lab in which the students had to handle the failure and recovery of machines, both with and without their disks intact. This lab was available as a default final project for students.
在我们深入研究Raft之前,先说一些有用的背景知识。6.824曾经有一套以Paxos为基础的实验,是用Go来构建的;选择Go是因为它对学生来说很容易学习,也因为它很适合编写并发的分布式应用(goroutines特别方便)。在四个实验的过程中,学生们建立了一个容错的、分片的键值存储。第一个实验室让他们建立了一个基于共识的日志库,第二个实验室在此基础上增加了一个键值存储,第三个实验室在多个容错集群中分片存储键值空间,由一个容错分片主站处理配置变化。我们还有第四个实验室,学生们必须处理机器的故障和恢复问题,包括磁盘完好无损的情况。这个实验室可以作为学生的默认期末项目。
This year, we decided to rewrite all these labs using Raft. The first three labs were all the same, but the fourth lab was dropped as persistence and failure recovery is already built into Raft. This article will mainly discuss our experiences with the first lab, as it is the one most directly related to Raft, though I will also touch on building applications on top of Raft (as in the second lab).
今年,我们决定使用Raft重写所有这些实验室。前三个实验室都是一样的,但第四个实验室被放弃了,因为持久性和故障恢复已经内置于Raft中。本文将主要讨论我们在第一个实验中的经验,因为它是与Raft最直接相关的实验,尽管我也会涉及到在Raft之上构建应用程序(如第二个实验)。
Raft, for those of you who are just getting to know it, is best described by the text on the protocol’s web site:
对于那些刚刚了解Raft的人来说,该协议网站上的文字对它的描述是最好的:
Raft is a consensus algorithm that is designed to be easy to understand. It’s equivalent to Paxos in fault-tolerance and performance. The difference is that it’s decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems. We hope Raft will make consensus available to a wider audience, and that this wider audience will be able to develop a variety of higher quality consensus-based systems than are available today.Raft是一种共识算法,设计得很容易理解。它在容错和性能方面与Paxos相当。不同的是,它被分解成相对独立的子问题,并干净地解决了实际系统所需的所有主要部分。我们希望Raft能让更多的人了解共识,而这些人将能够开发出比现在更高质量的基于共识的系统。 |
---|
Visualizations like this one give a good overview of the principal components of the protocol, and the paper gives good intuition for why the various pieces are needed. If you haven’t already read the extended Raft paper, you should go read that before continuing this article, as I will assume a decent familiarity with Raft.
像这样的可视化,对协议的主要组成部分有一个很好的概述,而该论文对为什么需要这些不同的部分给出了很好的直觉。如果你还没有读过扩展的Raft论文,你应该在继续这篇文章之前先去读一下,因为我将假设你对Raft有一定的熟悉程度。
As with all distributed consensus protocols, the devil is very much in the details. In the steady state where there are no failures, Raft’s behavior is easy to understand, and can be explained in an intuitive manner. For example, it is simple to see from the visualizations that, assuming no failures, a leader will eventually be elected, and eventually all operations sent to the leader will be applied by the followers in the right order. However, when delayed messages, network partitions, and failed servers are introduced, each and every if, but, and and, become crucial. In particular, there are a number of bugs that we see repeated over and over again, simply due to misunderstandings or oversights when reading the paper. This problem is not unique to Raft, and is one that comes up in all complex distributed systems that provide correctness.
就像所有的分布式共识协议一样,魔鬼在细节上。在没有故障的稳定状态下,Raft的行为很容易理解,并且可以用直观的方式解释。例如,从可视化中可以简单地看到,假设没有故障,最终会选出一个领导者,最终所有发给领导者的操作都会被跟随者按照正确的顺序应用。然而,当延迟信息、网络分区和失败的服务器被引入时,每一个if、but、and,都变得至关重要。特别是,我们看到有一些bug反复出现,仅仅是因为阅读论文时的误解或疏忽。这个问题不是Raft独有的,是所有提供正确性的复杂分布式系统都会出现的问题。
Implementing Raft
The ultimate guide to Raft is in Figure 2 of the Raft paper. This figure specifies the behavior of every RPC exchanged between Raft servers, gives various invariants that servers must maintain, and specifies when certain actions should occur. We will be talking about Figure 2 a lot in the rest of this article. It needs to be followed to the letter.
Raft的最终指南在Raft论文的图2中。该图规定了Raft服务器之间交换的每个RPC的行为,给出了服务器必须保持的各种不变性,并规定了某些动作应该在什么时候发生。在本文的其余部分,我们将经常讨论图2。它需要被严格遵守。
Figure 2 defines what every server should do, in ever state, for every incoming RPC, as well as when certain other things should happen (such as when it is safe to apply an entry in the log). At first, you might be tempted to treat Figure 2 as sort of an informal guide; you read it once, and then start coding up an implementation that follows roughly what it says to do. Doing this, you will quickly get up and running with a mostly working Raft implementation. And then the problems start.
图2定义了每个服务器在任何状态下对每个传入的RPC应该做什么,以及某些其他事情应该在什么时候发生(比如什么时候在日志中应用一个条目是安全的)。一开始,你可能会想把图2当作一种非正式的指南;你读一遍,然后开始编写一个实现,大致按照它说的去做。这样做,你很快就会有一个基本可行的Raft实现。然后,问题就开始了。
In fact, Figure 2 is extremely precise, and every single statement it makes should be treated, in specification terms, as MUST, not as SHOULD. For example, you might reasonably reset a peer’s election timer whenever you receive an AppendEntries or RequestVote RPC, as both indicate that some other peer either thinks it’s the leader, or is trying to become the leader. Intuitively, this means that we shouldn’t be interfering. However, if you read Figure 2 carefully, it says:
事实上,图2是非常精确的,在规范方面,它的每一条语句都应该被视为必须的,而不是应该的。例如,每当你收到AppendEntries或RequestVote RPC时,你可能会合理地重置一个对等体的选举计时器,因为这两者都表明其他对等体要么认为自己是领导者,要么正试图成为领导者。直观地说,这意味着我们不应该干预。然而,如果你仔细阅读图2,它说:
If election timeout elapses without receiving AppendEntries RPC from current leader or granting vote to candidate: convert to candidate.如果选举超时,没有收到现任领导的AppendEntries RPC,也没有向候选人授予投票权:转换为候选人。 |
---|
The distinction turns out to matter a lot, as the former implementation can result in significantly reduced liveness in certain situations.
这个区别原来是很重要的,因为前者的实现在某些情况下会导致显著降低的有效性。
The importance of details
To make the discussion more concrete, let us consider an example that tripped up a number of 6.824 students. The Raft paper mentions heartbeat RPCs in a number of places. Specifically, a leader will occasionally (at least once per heartbeat interval) send out an AppendEntries RPC to all peers to prevent them from starting a new election. If the leader has no new entries to send to a particular peer, the AppendEntries RPC contains no entries, and is considered a heartbeat.
为了使讨论更加具体,让我们考虑一个绊住了一些6.824学生的例子。Raft的论文在很多地方提到了心跳RPC。具体来说,领导者会偶尔(每个心跳间隔至少一次)向所有对等体发出AppendEntries RPC,以防止它们开始新的选举。如果领导者没有新的条目要发送给某个特定的对等体,那么AppendEntries RPC就不包含任何条目,并被认为是一次心跳。
Many of our students assumed that heartbeats were somehow “special”; that when a peer receives a heartbeat, it should treat it differently from a non-heartbeat AppendEntries RPC. In particular, many would simply reset their election timer when they received a heartbeat, and then return success, without performing any of the checks specified in Figure 2. This is extremely dangerous. By accepting the RPC, the follower is implicitly telling the leader that their log matches the leader’s log up to and including the prevLogIndex included in the AppendEntries arguments. Upon receiving the reply, the leader might then decide (incorrectly) that some entry has been replicated to a majority of servers, and start committing it.
我们的许多学生认为,心跳在某种程度上是 "特殊 "的;当一个对等体收到心跳时,它应该与非心跳的AppendEntries RPC区别对待。特别是,许多人在收到心跳时,会简单地重置他们的选举计时器,然后返回成功,而不执行图2中规定的任何检查。这是很危险的。通过接受RPC,跟随者隐含地告诉领导者,他们的日志与领导者的日志相匹配,包括AppendEntries参数中的prevLogIndex。在收到回复后,领导者可能会(错误地)认为某些条目已经被复制到大多数服务器上,并开始提交它。
Another issue many had (often immediately after fixing the issue above), was that, upon receiving a heartbeat, they would truncate the follower’s log following prevLogIndex, and then append any entries included in the AppendEntries arguments. This is also not correct. We can once again turn to Figure 2:
许多人遇到的另一个问题(通常是在修复了上述问题之后)是,在收到心跳时,他们会在prevLogIndex之后截断跟随者的日志,然后追加AppendEntries参数中包括的任何条目。这也是不正确的。我们可以再一次转向图2:
If an existing entry conflicts with a new one (same index but different terms), delete the existing entry and all that follow it.如果一个现有的条目与一个新的条目相冲突(相同的索引,但不同的任期),请删除现有的条目和后面所有的条目。 |
---|
The if here is crucial. If the follower has all the entries the leader sent, the follower MUST NOT truncate its log. Any elements following the entries sent by the leader MUST be kept. This is because we could be receiving an outdated AppendEntries RPC from the leader, and truncating the log would mean “taking back” entries that we may have already told the leader that we have in our log.
这里的if很关键。如果跟随者拥有领导者发送的所有条目,跟随者必须不截断其日志。任何跟随领导者发送的条目的元素都必须被保留。这是因为我们可能从领导者那里收到了一个过时的AppendEntries RPC,而截断日志将意味着 "收回 "我们可能已经告诉领导者我们的日志中的条目。
Debugging Raft
Inevitably, the first iteration of your Raft implementation will be buggy. So will the second. And third. And fourth. In general, each one will be less buggy than the previous one, and, from experience, most of your bugs will be a result of not faithfully following Figure 2.
不可避免的是,你的Raft实现的第一次迭代将是错误的。第二次也会如此。第三次。还有第四次。一般来说,每一次的错误都会比前一次少,而且,根据经验,你的大多数错误都是没有忠实于图2的结果。
When debugging, Raft, there are generally four main sources of bugs: livelocks, incorrect or incomplete RPC handlers, failure to follow The Rules, and term confusion. Deadlocks are also a common problem, but they can generally be debugged by logging all your locks and unlocks, and figuring out which locks you are taking, but not releasing. Let us consider each of these in turn:
在调试时,Raft通常有四个主要的bug来源:死锁、不正确或不完整的RPC处理程序、没有遵循规则以及任期混乱。死锁也是一个常见的问题,但它们通常可以通过记录你所有的锁和解锁来调试,并找出你正在使用但未释放的锁。让我们依次考虑这些问题:
Livelocks
When your system livelocks, every node in your system is doing something, but collectively your nodes are in such a state that no progress is being made. This can happen fairly easily in Raft, especially if you do not follow Figure 2 religiously. One livelock scenario comes up especially often; no leader is being elected, or once a leader is elected, some other node starts an election, forcing the recently elected leader to abdicate immediately.
当你的系统活锁时,系统中的每个节点都在做一些事情,但整体上节点处于这样一种没有任何进展状态。这种情况在Raft中很容易发生,尤其是当你没有虔诚地遵循图2时。有一种活锁情况出现得特别频繁;没有选出领袖,或者一旦选出了领袖,其他节点又开始选举,迫使最近选出的领袖立即退位。
There are many reasons why this scenario may come up, but there is a handful of mistakes that we have seen numerous students make:
这种情况出现的原因有很多,但有几个错误是我们看到许多学生犯的:
- Make sure you reset your election timer exactly when Figure 2 says you should. Specifically, you should only restart your election timer if a) you get an AppendEntries RPC from the current leader (i.e., if the term in the AppendEntries arguments is outdated, you should not reset your timer); b) you are starting an election; or c) you grant a vote to another peer.
确保你在图2描述的地方准确地重置你的选举计时器。具体来说,你只应该在以下情况下重置你的选举计时器:a)你从当前的领导者那里得到一个AppendEntries RPC(即,如果AppendEntries参数中的任期已经过时,你不应该重启你的计时器);b)你正在开始一个选举;或者c)你授予另一个对等体一个投票。
This last case is especially important in unreliable networks where it is likely that followers have different logs; in those situations, you will often end up with only a small number of servers that a majority of servers are willing to vote for. If you reset the election timer whenever someone asks you to vote for them, this makes it equally likely for a server with an outdated log to step forward as for a server with a longer log.
最后一种情况在不可靠的网络中特别重要,因为追随者很可能有不同的日志;在这些情况下,你最终往往只有少数服务器,而大多数服务器都愿意为其投票。如果你每当有人要求你为他投票时就重置选举计时器,这就使得一个有过时日志的服务器和一个有较长日志的服务器同样有可能站出来。
In fact, because there are so few servers with sufficiently up-to-date logs, those servers are quite unlikely to be able to hold an election in sufficient peace to be elected. If you follow the rule from Figure 2, the servers with the more up-to-date logs won’t be interrupted by outdated servers’ elections, and so are more likely to complete the election and become the leader.
事实上,由于具有足够最新的日志的服务器太少,这些服务器很不可能在足够平静的情况下举行选举而当选。如果按照图2的规则,拥有较多最新日志的服务器不会被过时的服务器的选举打断,所以更有可能完成选举,成为领导者。
- Follow Figure 2’s directions as to when you should start an election. In particular, note that if you are a candidate (i.e., you are currently running an election), but the election timer fires, you should start another election. This is important to avoid the system stalling due to delayed or dropped RPCs.
你应该按照图2的指示开始选举。特别要注意的是,如果你是一个候选人(即,你目前正在进行选举),但选举计时器启动了,你应该开始另一个选举。这一点很重要,可以避免系统因RPC的延迟或丢包而停滞不前。
- Ensure that you follow the second rule in “Rules for Servers” before handling an incoming RPC. The second rule states:
在处理传入的RPC之前,请确保你遵循 "服务器规则 "中的第二条规则。第二条规则指出:
If RPC request or response contains term T > currentTerm: set currentTerm = T, convert to follower (§5.1)如果RPC请求或响应包含任期 T > currentTerm:设置currentTerm = T,转换为follower(§5.1)。 |
---|
For example, if you have already voted in the current term, and an incoming RequestVote RPC has a higher term that you, you should first step down and adopt their term (thereby resetting votedFor), and then handle the RPC, which will result in you granting the vote!
例如,如果你已经在当前任期内投票,而传入的RequestVote RPC的任期比你高,你应该首先下台,采用他们的任期(从而重新设置 votedFor),然后处理RPC,这将导致你授予投票权。
Incorrect RPC handlers
Even though Figure 2 spells out exactly what each RPC handler should do, some subtleties are still easy to miss. Here are a handful that we kept seeing over and over again, and that you should keep an eye out for in your implementation:
尽管图2清楚地说明了每个RPC处理程序应该做什么,但一些细微之处仍然容易被忽略。以下是我们反复看到的一些情况,你应该在你的实现中注意这些情况:
- If a step says “reply false”, this means you should reply immediately, and not perform any of the subsequent steps.
如果一个步骤说 "回复错误",这意味着你应该立即回复,而不是执行任何后续的步骤。
- If you get an AppendEntries RPC with a prevLogIndex that points beyond the end of your log, you should handle it the same as if you did have that entry but the term did not match (i.e., reply false).
如果你收到一个AppendEntries RPC,其prevLogIndex指向你的日志的末尾,你应该像你确实有那个条目但任期不匹配那样处理它(即,回复false)。
- Check 2 for the AppendEntries RPC handler should be executed even if the leader didn’t send any entries.
AppendEntries RPC处理程序的检查2应该被执行,即使领导者没有发送任何条目。
- The min in the final step (#5) of AppendEntries is necessary, and it needs to be computed with the index of the last new entry. It is not sufficient to simply have the function that applies things from your log between lastApplied and commitIndex stop when it reaches the end of your log. This is because you may have entries in your log that differ from the leader’s log after the entries that the leader sent you (which all match the ones in your log). Because #3 dictates that you only truncate your log if you have conflicting entries, those won’t be removed, and if leaderCommit is beyond the entries the leader sent you, you may apply incorrect entries.
AppendEntries的最后一步(#5)中的min是必要的,它需要用最后一个新条目的索引来计算。仅仅让应用lastApplied和commitIndex之间的日志内容的函数在到达日志的末尾时停止是不够的。这是因为你的日志中可能会有与领导者的日志不同的条目,在领导者发给你的条目之后(这些条目都与你的日志中的条目一致)。因为#3决定了你只有在有冲突的条目时才会截断你的日志,那些条目不会被删除,如果leaderCommit超出了领导发给你的条目,你就可能应用不正确的条目。
- It is important to implement the “up-to-date log” check exactly as described in section 5.4. No cheating and just checking the length!
重要的是,要完全按照第5.4节中的描述来实现 "最新日志 "的检查。不要作弊,只检查长度!。
Failure to follow The Rules
While the Raft paper is very explicit about how to implement each RPC handler, it also leaves the implementation of a number of rules and invariants unspecified. These are listed in the “Rules for Servers” block on the right hand side of Figure 2. While some of them are fairly self-explanatory, the are also some that require designing your application very carefully so that it does not violate The Rules:
虽然Raft论文非常明确地说明了如何实现每个RPC处理程序,但它也没有明确规定一些规则和不变因素的实现。这些规则列在图2右侧的 "服务器规则 "部分。虽然其中有些是不言自明的,但也有一些需要非常仔细地设计你的应用程序,使其不违反规则:
- If commitIndex > lastApplied at any point during execution, you should apply a particular log entry. It is not crucial that you do it straight away (for example, in the AppendEntries RPC handler), but it is important that you ensure that this application is only done by one entity. Specifically, you will need to either have a dedicated “applier”, or to lock around these applies, so that some other routine doesn’t also detect that entries need to be applied and also tries to apply.
如果在执行过程中的任何时候commitIndex > lastApplied,你应该应用一个特定的日志条目。你是否直接这样做并不关键(例如,在AppendEntries RPC处理程序中),但重要的是你要确保这种应用只由一个实体完成。具体来说,你需要有一个专门的 "applier",或者围绕这些应用进行锁定,这样其他的例程就不会检测到需要应用的条目,并且也试图进行应用。
- Make sure that you check for commitIndex > lastApplied either periodically, or after commitIndex is updated (i.e., after matchIndex is updated). For example, if you check commitIndex at the same time as sending out AppendEntries to peers, you may have to wait until the next entry is appended to the log before applying the entry you just sent out and got acknowledged.
确保定期检查 commitIndex > lastApplied,或者在 commitIndex 更新后(即 matchIndex 更新后)检查。例如,如果你在检查commitIndex的同时向对等体发送AppendEntries,你可能不得不等到下一个条目被追加到日志中,然后再应用你刚刚发送并得到确认的条目。
- If a leader sends out an AppendEntries RPC, and it is rejected, but not because of log inconsistency (this can only happen if our term has passed), then you should immediately step down, and not update nextIndex. If you do, you could race with the resetting of nextIndex if you are re-elected immediately.
如果一个领导者发出了AppendEntries RPC,并且它被拒绝了,但不是因为日志不一致(这只能发生在我们的任期已过的情况下),那么你应该立即下台,而不是更新nextIndex。如果你这样做了,那么如果你立即重新当选,你可能会与nextIndex的重设发生冲突。
- A leader is not allowed to update commitIndex to somewhere in a previous term (or, for that matter, a future term). Thus, as the rule says, you specifically need to check that log[N].term == currentTerm. This is because Raft leaders cannot be sure an entry is actually committed (and will not ever be changed in the future) if it’s not from their current term. This is illustrated by Figure 8 in the paper.
领导者不允许将commitIndex更新到前一个任期的某个地方(或者,未来的任期)。因此,正如规则所说,你需要特别检查log[N].term == currentTerm。这是因为,如果一个条目不是来自他们当前的任期,Raft领导就不能确定它真的被提交了(并且在未来不会被改变)。论文中的图8说明了这一点。
One common source of confusion is the difference between nextIndex and matchIndex. In particular, you may observe that matchIndex = nextIndex - 1, and simply not implement matchIndex. This is not safe. While nextIndex and matchIndex are generally updated at the same time to a similar value (specifically, nextIndex = matchIndex 1), the two serve quite different purposes. nextIndex is a guess as to what prefix the leader shares with a given follower. It is generally quite optimistic (we share everything), and is moved backwards only on negative responses. For example, when a leader has just been elected, nextIndex is set to be index index at the end of the log. In a way, nextIndex is used for performance – you only need to send these things to this peer.
一个常见的问题来源是nextIndex和matchIndex之间的区别。特别是,你可能会观察到matchIndex = nextIndex - 1,而干脆不实现matchIndex。这是不安全的。虽然nextIndex和matchIndex通常在同一时间被更新为类似的值(具体来说,nextIndex = matchIndex 1),但两者的作用完全不同。它通常是相当乐观的(我们分享一切),并且只在消极的反应中向后移动。例如,当一个领导者刚刚当选时,nextIndex被设置为日志末尾的索引指数。在某种程度上,nextIndex是用于性能的--你只需要将这些东西发送给这个对等体。
matchIndex is used for safety. It is a conservative measurement of what prefix of the log the leader shares with a given follower. matchIndex cannot ever be set to a value that is too high, as this may cause the commitIndex to be moved too far forward. This is why matchIndex is initialized to -1 (i.e., we agree on no prefix), and only updated when a follower positively acknowledges an AppendEntries RPC.
matchIndex是用于安全的。MatchIndex不能被设置为一个太高的值,因为这可能会导致commitIndex被向前移动得太远。这就是为什么matchIndex被初始化为-1(也就是说,我们不同意任何前缀),并且只在跟随者肯定地确认AppendEntries RPC时才更新。
Term confusion
Term confusion refers to servers getting confused by RPCs that come from old terms. In general, this is not a problem when receiving an RPC, since the rules in Figure 2 say exactly what you should do when you see an old term. However, Figure 2 generally doesn’t discuss what you should do when you get old RPC replies. From experience, we have found that by far the simplest thing to do is to first record the term in the reply (it may be higher than your current term), and then to compare the current term with the term you sent in your original RPC. If the two are different, drop the reply and return. Only if the two terms are the same should you continue processing the reply. There may be further optimizations you can do here with some clever protocol reasoning, but this approach seems to work well. And not doing it leads down a long, winding path of blood, sweat, tears and despair.
任期混淆是指服务器被来自旧任期的RPC所迷惑。一般来说,在收到RPC时,这不是一个问题,因为图2中的规则确切地说明了当你看到一个旧任期时你应该做什么。然而,图2一般没有讨论当你收到旧的RPC回复时你应该做什么。根据经验,我们发现到目前为止,最简单的做法是首先记录回复中的任期(它可能比你当前的任期高),然后将当前任期与你在原始RPC中发送的任期进行比较。如果两者不同,就放弃回复并返回。只有当这两个任期相同时,你才应该继续处理回复。也许你可以通过一些巧妙的协议推理在这里做进一步的优化,但这种方法似乎很有效。而不这样做会导致一条漫长而曲折的血汗、泪水和绝望的道路。
A related, but not identical problem is that of assuming that your state has not changed between when you sent the RPC, and when you received the reply. A good example of this is setting matchIndex = nextIndex - 1, or matchIndex = len(log) when you receive a response to an RPC. This is not safe, because both of those values could have been updated since when you sent the RPC. Instead, the correct thing to do is update matchIndex to be prevLogIndex len(entries[]) from the arguments you sent in the RPC originally.
一个相关但不完全相同的问题是,预设你的状态在你发送RPC和你收到回复之间没有变化。这方面的一个很好的例子是,当你收到RPC的响应时,设置matchIndex = nextIndex - 1,或者matchIndex = len(log)。这并不安全,因为这两个值都可能在你发送RPC后被更新。相反,正确的做法是将 matchIndex 更新为你最初在 RPC 中发送的参数中 prevLogIndex len( entries[]) 。
An aside on optimizations
The Raft paper includes a couple of optional features of interest. In 6.824, we require the students to implement two of them: log compaction (section 7) and accelerated log backtracking (top left hand side of page 8). The former is necessary to avoid the log growing without bound, and the latter is useful for bringing stale followers up to date quickly.
Raft论文中包括几个可选功能。在6.824中,我们要求学生实现其中的两个:日志压缩(第7节)和加速日志回溯(第8页的左上方)。前者对于避免日志无限制地增长是必要的,而后者对于使陈旧的追随者快速更新是有用的。
These features are not a part of “core Raft”, and so do not receive as much attention in the paper as the main consensus protocol. Log compaction is covered fairly thoroughly (in Figure 13), but leaves out some design details that you might miss if you read it too casually:
这些功能不是 "核心Raft "的一部分,因此在本文中没有得到像主要共识协议那样的关注。日志压缩的内容相当全面(在图13中),但遗漏了一些设计细节,如果你太随意地阅读,可能会错过:
- When snapshotting application state, you need to make sure that the application state corresponds to the state following some known index in the Raft log. This means that the application either needs to communicate to Raft what index the snapshot corresponds to, or that Raft needs to delay applying additional log entries until the snapshot has been completed.
当快照应用程序状态时,你需要确保应用程序状态与Raft日志中某个已知索引之后的状态相对应。这意味着应用程序需要向Raft传达快照所对应的索引,或者Raft需要延迟应用额外的日志条目,直到快照完成。
- The text does not discuss the recovery protocol for when a server crashes and comes back up now that snapshots are involved. In particular, if Raft state and snapshots are committed separately, a server could crash between persisting a snapshot and persisting the updated Raft state. This is a problem, because step 7 in Figure 13 dictates that the Raft log covered by the snapshot must be discarded.
文中没有讨论当服务器崩溃并恢复时的恢复协议,因为现在涉及到快照。特别是,如果Raft状态和快照是分开提交的,服务器可能会在坚持快照和坚持更新的Raft状态之间崩溃。这是一个问题,因为图13中的第7步决定了快照所覆盖的Raft日志必须被丢弃。
If, when the server comes back up, it reads the updated snapshot, but the outdated log, it may end up applying some log entries that are already contained within the snapshot. This happens since the commitIndex and lastApplied are not persisted, and so Raft doesn’t know that those log entries have already been applied. The fix for this is to introduce a piece of persistent state to Raft that records what “real” index the first entry in Raft’s persisted log corresponds to. This can then be compared to the loaded snapshot’s lastIncludedIndex to determine what elements at the head of the log to discard.
如果当服务器重新启动时,它读取的是更新的快照,而不是过时的日志,那么它最终可能会应用一些已经包含在快照中的日志条目。这种情况会发生,因为commitIndex和lastApplied没有被持久化,所以Raft不知道这些日志条目已经被应用。解决这个问题的方法是在Raft中引入一个持久化状态,记录Raft持久化日志中的第一个条目所对应的 "真实 "索引。然后,这可以与加载的快照的lastIncludedIndex进行比较,以确定要丢弃日志头部的哪些元素。
The accelerated log backtracking optimization is very underspecified, probably because the authors do not see it as being necessary for most deployments. It is not clear from the text exactly how the conflicting index and term sent back from the client should be used by the leader to determine what nextIndex to use. We believe the protocol the authors probably want you to follow is:
加速日志回溯的优化是非常不明确的,可能是因为作者认为在大多数部署中没有必要。文本中并没有明确指出,领导者应该如何使用从客户端发回的冲突索引和任期来决定使用哪一个NextIndex。我们认为作者可能希望你遵循的协议是:
- If a follower does not have prevLogIndex in its log, it should return with conflictIndex = len(log) and conflictTerm = None.
如果一个跟随者的日志中没有prevLogIndex,它应该以conflictIndex = len(log)和conflictTerm = None返回。
- If a follower does have prevLogIndex in its log, but the term does not match, it should return conflictTerm = log[prevLogIndex].Term, and then search its log for the first index whose entry has term equal to conflictTerm.
如果一个跟随者在其日志中确实有prevLogIndex,但是任期不匹配,它应该返回conflictTerm = log[prevLogIndex].Term,然后在其日志中搜索其条目中任期等于conflictTerm的第一个索引。
- Upon receiving a conflict response, the leader should first search its log for conflictTerm. If it finds an entry in its log with that term, it should set nextIndex to be the one beyond the index of the last entry in that term in its log.
在收到冲突响应后,领导者应该首先搜索其日志中的conflictTerm。如果它在它的日志中找到一个具有该任期的条目,它应该将nextIndex设置为其日志中该任期的最后一个条目的索引之外的那个索引。
- If it does not find an entry with that term, it should set nextIndex = conflictIndex.
如果它没有找到该任期的条目,它应该设置 nextIndex = conflictIndex。
A half-way solution is to just use conflictIndex (and ignore conflictTerm), which simplifies the implementation, but then the leader will sometimes end up sending more log entries to the follower than is strictly necessary to bring them up to date.
一个半途而废的解决方案是只使用conflictIndex(而忽略conflictTerm),这简化了实现,但这样一来,领导者有时会向跟随者发送更多的日志条目,而不是严格意义上所需要的,以使他们达到最新状态。
Applications on top of Raft
When building a service on top of Raft (such as the key/value store in the second 6.824 Raft lab, the interaction between the service and the Raft log can be tricky to get right. This section details some aspects of the development process that you may find useful when building your application.
在Raft之上构建服务时(如第二个6.824 Raft实验室中的k/v存储),服务与Raft日志之间的互动可能会很棘手,需要正确处理。本节详细介绍了开发过程中的一些方面,你可能会发现在构建你的应用程序时,这些方面是有用的。
Applying client operations
You may be confused about how you would even implement an application in terms of a replicated log. You might start off by having your service, whenever it receives a client request, send that request to the leader, wait for Raft to apply something, do the operation the client asked for, and then return to the client. While this would be fine in a single-client system, it does not work for concurrent clients.
你可能对如何在复制的日志方面实现一个应用程序感到困惑。你可以先让你的服务在收到客户端请求时,将该请求发送给领导者,等待Raft应用一些东西,做客户端要求的操作,然后再返回给客户端。虽然这在单客户系统中很好,但对于并发的客户来说,这并不可行。
Instead, the service should be constructed as a state machine where client operations transition the machine from one state to another. You should have a loop somewhere that takes one client operation at the time (in the same order on all servers – this is where Raft comes in), and applies each one to the state machine in order. This loop should be the only part of your code that touches the application state (the key/value mapping in 6.824). This means that your client-facing RPC methods should simply submit the client’s operation to Raft, and then wait for that operation to be applied by this “applier loop”. Only when the client’s command comes up should it be executed, and any return values read out. Note that this includes read requests!
相反,服务应该被构建为一个状态机,客户端操作将机器从一个状态过渡到另一个状态。你应该在某个地方有一个循环,每次接收一个客户端操作(在所有服务器上的顺序是一样的--这就是Raft的作用),并将每个操作按顺序应用到状态机中。这个循环应该是你的代码中唯一接触到应用程序状态的部分(6.824中的k/v映射)。这意味着你面向客户端的RPC方法应该简单地将客户端的操作提交给Raft,然后等待该操作被这个 "应用者循环 "应用。只有当客户端的命令出现时,它才应该被执行,并读出任何返回值。请注意,这包括读取请求!
This brings up another question: how do you know when a client operation has completed? In the case of no failures, this is simple – you just wait for the thing you put into the log to come back out (i.e., be passed to apply()). When that happens, you return the result to the client. However, what happens if there are failures? For example, you may have been the leader when the client initially contacted you, but someone else has since been elected, and the client request you put in the log has been discarded. Clearly you need to have the client try again, but how do you know when to tell them about the error?
这就带来了另一个问题:你怎么知道客户端操作何时完成?在没有失败的情况下,这很简单--你只需等待你放进日志的东西再出来(即被传递给apply())。当这种情况发生时,你将结果返回给客户端。然而,如果有失败会怎样?例如,当客户端最初与你联系时,你可能是领导者,但后来有其他人当选了,而你放在日志中的客户端请求被丢弃了。很明显,你需要让客户端再试一次,但你怎么知道什么时候告诉他们这个错误?
One simple way to solve this problem is to record where in the Raft log the client’s operation appears when you insert it. Once the operation at that index is sent to apply(), you can tell whether or not the client’s operation succeeded based on whether the operation that came up for that index is in fact the one you put there. If it isn’t, a failure has happened and an error can be returned to the client.
解决这个问题的一个简单方法是,当你插入客户端的操作时,记录它在Raft日志中出现的位置。一旦该索引的操作被发送到apply(),你就可以根据该索引出现的操作是否真的是你放在那里的操作来判断客户端的操作是否成功。如果不是,就发生了失败,可以向客户端返回一个错误。
Duplicate detection
As soon as you have clients retry operations in the face of errors, you need some kind of duplicate detection scheme – if a client sends an APPEND to your server, doesn’t hear back, and re-sends it to the next server, your apply() function needs to ensure that the APPEND isn’t executed twice. To do so, you need some kind of unique identifier for each client request, so that you can recognize if you have seen, and more importantly, applied, a particular operation in the past. Furthermore, this state needs to be a part of your state machine so that all your Raft servers eliminate the same duplicates.
一旦你让客户端在发生错误时重试操作,你就需要某种重复检测方案--如果一个客户端向你的服务器发送了一个APPEND,但没有得到回音,并将其重新发送到下一个服务器,你的apply()函数需要确保APPEND不会被执行两次。要做到这一点,你需要为每个客户端的请求提供某种唯一的标识符,这样你就能识别出你是否在过去看到过,更重要的是,应用过某个特定的操作。此外,这种状态需要成为你的状态机的一部分,以便你的所有Raft服务器都能消除相同的重复内容。
There are many ways of assigning such identifiers. One simple and fairly efficient one is to give each client a unique identifier, and then have them tag each request with a monotonically increasing sequence number. If a client re-sends a request, it re-uses the same sequence number. Your server keeps track of the latest sequence number it has seen for each client, and simply ignores any operation that it has already seen.
分配这种标识符的方法有很多。一个简单且相当有效的方法是给每个客户端一个唯一的标识符,然后让他们用单调增加的序列号来标记每个请求。如果一个客户端重新发送一个请求,它就重新使用同一个序列号。你的服务器会跟踪它为每个客户端看到的最新序列号,并简单地忽略它已经看到的任何操作。
Hairy corner-cases
If your implementation follows the general outline given above, there are at least two subtle issues you are likely to run into that may be hard to identify without some serious debugging. To save you some time, here they are:
如果你的实现遵循上面给出的一般大纲,至少有两个微妙的问题你可能会遇到,如果不进行认真的调试,可能很难识别。为了节省你的时间,这里有两个问题:
Re-appearing indices:
Say that your Raft library has some method Start() that takes a command, and return the index at which that command was placed in the log (so that you know when to return to the client, as discussed above). You might assume that you will never see Start() return the same index twice, or at the very least, that if you see the same index again, the command that first returned that index must have failed. It turns out that neither of these things are true, even if no servers crash.
重新出现的索引。假设你的Raft库有一些Start()方法,它接收一个命令,并返回该命令在日志中的索引(以便你知道何时返回客户端,如上所述)。你可能会认为,你永远不会看到Start()两次返回相同的索引,或者至少,如果你再次看到相同的索引,第一次返回该索引的命令一定是失败了。事实证明,这两件事都不是真的,即使没有服务器崩溃。
Consider the following scenario with five servers, S1 through S5. Initially, S1 is the leader, and its log is empty.
考虑以下有五个服务器的情景,S1到S5。最初,S1是领导者,它的日志是空的。
- Two client operations (C1 and C2) arrive on S1
两个客户端操作(C1和C2)到达S1上
- Start() return 1 for C1, and 2 for C2.
Start()对C1返回1,对C2返回2。
- S1 sends out an AppendEntries to S2 containing C1 and C2, but all its other messages are lost.
S1向S2发送了一个包含C1和C2的AppendEntries,但它的所有其他消息都丢失了。
- S3 steps forward as a candidate.
S3作为候选人站了出来。
- S1 and S2 won’t vote for S3, but S3, S4, and S5 all will, so S3 becomes the leader.
S1和S2不会投票给S3,但S3、S4和S5都会,所以S3成为领导者。
- Another client request, C3 comes in to S3.
另一个客户端请求,C3来到了S3。
- S3 calls Start() (which returns 1)
S3调用Start()(返回1)。
- S3 sends an AppendEntries to S1, who discards C1 and C2 from its log, and adds C3.
S3向S1发送了一个AppendEntries,S1从它的日志中丢弃了C1和C2,并添加了C3。
- S3 fails before sending AppendEntries to any other servers.
S3在向其他服务器发送AppendEntries之前就失败了。
- S1 steps forward, and because its log is up-to-date, it is elected leader.
S1站了出来,由于它的日志是最新的,它被选为领导者。
- Another client request, C4, arrives at S1
另一个客户端请求,C4,到达了S1
- S1 calls Start(), which returns 2 (which was also returned for Start(C2).
S1调用Start(),返回2(Start(C2)也返回了这个值)。
- All of S1’s AppendEntries are dropped, and S2 steps forward.
S1的所有AppendEntries都被丢弃,S2站了出来成为候选者。
- S1 and S3 won’t vote for S2, but S2, S4, and S5 all will, so S2 becomes leader.
S1和S3不会为S2投票,但S2、S4和S5都会,所以S2成为领导者。
- A client request C5 comes in to S2
一个客户端请求C5进入S2
- S2 calls Start(), which returns 3.
S2调用Start(),返回3。
- S2 successfully sends AppendEntries to all the servers, which S2 reports back to the servers by including an updated leaderCommit = 3 in the next heartbeat.
S2成功地向所有服务器发送了AppendEntries,S2通过在下一次心跳中加入更新的leaderCommit=3来向服务器报告。
Since S2’s log is [C1 C2 C5], this means that the entry that committed (and was applied at all servers, including S1) at index 2 is C2. This despite the fact that C4 was the last client operation to have returned index 2 at S1.
由于S2的日志是[C1 C2 C5],这意味着在索引2处提交(并在所有服务器,包括S1处应用)的条目是C2。尽管C4是最后一个在S1返回索引2的客户端操作,但这也是事实。
The four-way deadlock:
All credit for finding this goes to Steven Allen, another 6.824 TA. He found the following nasty four-way deadlock that you can easily get into when building applications on top of Raft.
发现这个问题的所有功劳要归功于Steven Allen,另一个6.824 TA。他发现了以下讨厌的四向死锁,当你在Raft之上构建应用程序时,很容易陷入其中。
Your Raft code, however it is structured, likely has a Start()-like function that allows the application to add new commands to the Raft log. It also likely has a loop that, when commitIndex is updated, calls apply() on the application for every element in the log between lastApplied and commitIndex. These routines probably both take some lock a. In your Raft-based application, you probably call Raft’s Start() function somewhere in your RPC handlers, and you have some code somewhere else that is informed whenever Raft applies a new log entry. Since these two need to communicate (i.e., the RPC method needs to know when the operation it put into the log completes), they both probably take some lock b.
你的Raft代码,无论其结构如何,很可能有一个类似Start()的函数,允许应用程序向Raft日志添加新命令。它还可能有一个循环,当 commitIndex 被更新时,对日志中 lastApplied 和 commitIndex 之间的每个元素调用 apply()。在你基于Raft的应用程序中,你可能在RPC处理程序的某个地方调用Raft的Start()函数,并且在其他地方有一些代码在Raft应用新的日志条目时被通知。由于这两者需要进行通信(即RPC方法需要知道它放入日志的操作何时完成),它们可能都需要一些锁b。
In Go, these four code segments probably look something like this:
在Go中,这四个代码段可能看起来像这样:
func (a *App) RPC(args interface{}, reply interface{}) { // ... a.mutex.Lock() i := a.raft.Start(args) // update some data structure so that apply knows to poke us later a.mutex.Unlock() // wait for apply to poke us return } |
---|
func (r *Raft) Start(cmd interface{}) int { r.mutex.Lock() // do things to start agreement on this new command // store index in the log where cmd was placed r.mutex.Unlock() return index } |
func (a *App) apply(index int, cmd interface{}) { a.mutex.Lock() switch cmd := cmd.(type) { case GetArgs: // do the get // see who was listening for this index // poke them all with the result of the operation // ... } a.mutex.Unlock() } |
func (r *Raft) AppendEntries(...) { // ... r.mutex.Lock() // ... for r.lastApplied < r.commitIndex { r.lastApplied r.app.apply(r.lastApplied, r.log[r.lastApplied]) } // ... r.mutex.Unlock() } |
Consider now if the system is in the following state:
现在考虑一下,如果系统处于以下状态:
- App.RPC has just taken a.mutex and called Raft.Start
App.RPC刚刚获取了a.mutex并调用Raft.Start
- Raft.Start is waiting for r.mutex
Raft.Start正在等待r.mutex
- Raft.AppendEntries is holding r.mutex, and has just called App.apply
Raft.AppendEntries持有r.mutex,并且刚刚调用了App.apply。
We now have a deadlock, because:
我们现在有一个死锁,因为:
- Raft.AppendEntries won’t release the lock until App.apply returns.
Raft.AppendEntries不会释放锁,直到App.apply返回。
- App.apply can’t return until it gets a.mutex.
App.application在获得a.mutex之前不能返回。
- a.mutex won’t be released until App.RPC returns.
a.mutex不会被释放,直到App.RPC返回。
- App.RPC won’t return until Raft.Start returns.
在Raft.Start返回之前,App.RPC不会返回。
- Raft.Start can’t return until it gets r.mutex.
Raft.Start在获得r.mutex之前不能返回。
- Raft.Start has to wait for Raft.AppendEntries.
Raft.Start必须等待Raft.AppendEntries。
There are a couple of ways you can get around this problem. The easiest one is to take a.mutex after calling a.raft.Start in App.RPC. However, this means that App.apply may be called for the operation that App.RPC just called Raft.Start on before App.RPC has a chance to record the fact that it wishes to be notified. Another scheme that may yield a neater design is to have a single, dedicated thread calling r.app.apply from Raft. This thread could be notified every time commitIndex is updated, and would then not need to hold a lock in order to apply, breaking the deadlock.
有几种方法可以解决这个问题。最简单的方法是在App.RPC中调用a.raft.Start后获取a.mutex。然而,这意味着在App.RPC有机会记录它希望被通知的事实之前,App.apply可能会被调用,用于App.RPC刚刚调用Raft.Start的操作。另一个可能产生更整洁的设计的方案是,由一个专门的线程从Raft调用r.app.apply。这个线程可以在每次更新commitIndex时得到通知,这样就不需要为了应用而持有一个锁,从而打破了死锁。