Four modifications for the Raft consensus algorithmRaft一致性算法的四个改进译文
Posted snake-fly
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Four modifications for the Raft consensus algorithmRaft一致性算法的四个改进译文相关的知识,希望对你有一定的参考价值。
最近用业余时间对Four modifications for the Raft consensus algorithm论文进行了翻译,该论文从4个方面优化了Raft实现,对工程实现的借鉴意义如下:
1、Cluster initialization:可以解决多数派节点异常的情况下,集群始终不可用问题,让少数派集群恢复到正常服务的状态。
2、Universally Unique Database Identifier:可以在多raft组、multi raft、多数据中心场景下,避免由于运维人员操作错误引发的一致性服务存储数据不一致问题。
3、Prevote algorithm和Leader stickiness可以让集群在网络环境不稳定,频繁出现闪断的情况下,仍能最大限度保证集群的稳定运行。
详细论文及翻译如下文所示:
Four modifications for the Raft consensus algorithm
Raft一致性算法的4个改进
1. Background
In my work at MongoDB, I‘ve been involved in a team that is adapting our replication protocol to conform to the principles set out in the academic paper that describes the Raft algorithm(Ongaro,2014). Breaking with the academia‘s long standing obsession with Paxos, of which I‘m yet to hear about a robust real world and open source implementation, Raft describes a simple, easy to understand leader based consensus algorithm.(In the vocabulary used outside of academia, Raft describes a single master, synchronous replication protocol using majority elections.)
我在MongoDB工作时,曾经参与到一个团队,负责调整我们的复制协议以满足Raft一致性算法学术论文中描述的原则。Raft打破了长期以来学术界对Paxos的困扰,对于Paxos,我还没听说过生产环境或开源社区有关于Paxos的实现,Raft描述了一个基于领导者的简单的、易于理解的一致性算法(使用非学术界的术语,Raft描述了一个通过多数派选举的单领导者、同步复制协议。)
Hence, while my original interest in Raft is purely work related, clearly it is meaningful to consider Raft itself as a valuable contribution to the brittle art of making databases highly available. As such, this paper is written purely within the context of Raft itself, without any reference to the replication implementation in MongoDB, and ignoring the multitude of differences that a MongoDB implementation will have compared to Raft. (One of which is MongoDB being pull based, while Raft is a push based algorithm.)
因此,虽然我最初对Raft产生兴趣完全是因为工作需要,很明显,将Raft本身看作是对数据库高可用这一脆弱艺术的宝贵贡献是有意义的。因此,本文完全是在Raft论文的基础上编写的,没有提到MongoDB中关于复制的实现,也忽略了MongoDB和Raft之间的众多差异。(其中一个差异是MOngoDB是基于pull的,而Raft是基于push的算法。)
This is the second version of this paper and supersedes its predecessor from a month ago, which was called "Three modifications for Raft consensus". This version adds an explicit cluster initialization step, hence making it four modifications. Adding the cluster initialization as an explicit part of the algorithm, makes the description of the database id more straightforward. As part of simplifying the creation of the database id, this paper no longer proposes the option to automatically delete old data from a server this was seen as a an unsafe operation by several reviewers of the first version and therefore became cause for much unnecessary confusion . For the benefit of those who already read the previous version, the new and changed parts of this paper are colored dark red.
这是论文的第二个版本,取代了一个月前“Raft一致性算法3个修改”的版本。这个版本添加了一个显示的集群初始化步骤,因此称为4个修改。将集群初始化作为Ratf算法的一部分明确提出来,使得对databaseID的描述更加简单明了。作为简化databaseID创建的一部分,该论文不再提出从服务器中自动删除数据,这样做的原因是因为,通过审阅前一版本,认为这样做是不安全的,可能会引起不必要的混乱。为了方便已经阅读过第一个版本的人阅读该论文,将新增和改变的部分用深红色进行了标注。
I would like to particularly thank Oren Eini for his thorough review of the first version of this paper and for proposing to codify the cluster initialization to a more explicit part of the algorithm.
我要特别感谢Oren Eini对该论文第一版本的审阅,并提议将集群初始化作为Raft算法的一部分明确提出来。
2. Introduction
The major contribution of the Raft authors has clearly been to finally introduce an easy to understand, easy to implement protocol for distributed consensus. While more complicated protocols may offer additional value in terms of feature set such as supporting a multimaster scheme it is a valuable resource for the community of distributed databases to at least have a robust alternative for the most basic kind of replication, leader based consensus. Somewhat surprisingly, the level of elegance achieved by Raft has not been clearly documented previously, let alone provided with proofs of correctness. The closest, and compared to Paxos rather useful, cousin to Raft would be the family of ViewStamped replication algorithms(Oki & Liskov, 1988), however Raft significantly simplifies compared to ViewStamped replication.
Raft作者的最主要贡献是提出了一个易于理解、易于实现的分布式协议。虽然更复杂的协议可能提供更完善的功能,比如支持多Leader方案,但对于分布式数据库社区来说,它是一个宝贵的资源,至少可以为最基本的复制类型(基于领导者的一致性算法)提供一种健壮的替代方案。令人惊讶的是Raft算法的优雅程度并没有被清晰的表达出来,更不用说提供正确性的证明了。与Paxos相比,Raft算法的近亲是ViewStamped复制算法家族(Oki & Liskov, 1988),但是Raft和ViewStamped复制相比,大大简化了。
It is common for academic papers to focus mostly on the core replication of events themselves, and the closely related mechanism of electing a leader or determining a quorum. On the other hand they commonly neglect important surrounding topics, such as cluster maintenance activities or "environmental corner cases", that in the real world are equally important ingredients in creating a complete solution for a highly available database.
一个很常见的现象是,学术论文一般只关注事件本身的核心复制,领导人选举或者确定仲裁团。另一方面,他们通常忽略了周边同样重要的机制,比如集群维护或极端情况,在工程实现中,这些周边的机制,对于创建完整的高可用数据库同样重要。
Also the Raft algorithm has evolved in similar fashion. This is evident when following the succession of published papers from the original Usenix 2014 paper(Ongaro & Ousterhout, Usenix 2014) to the extended version of the same paper(Ongaro & Ousterhout, 2014) to the thesis by Diego Ongaro published later in 2014.
Raft算法也是类似的发展方式。从2014年论文发表(Ongaro & Ousterhout, Usenix 2014)到论文的扩展版本(Ongaro & Ousterhout, 2014)再到2014年Diego Ongaro发表的论文,就是最好的证明。
For example, the original Usenix paper did not include any mention of snapshotting the state machine, rather simply describes an indefinitely growing log of events clearly not realistic for most real world systems that might implement Raft, including the authors‘ own Ramcloud database. The extended version then added a discussion on snapshotting, including an InstallSnapshot RPC for sending the snapshot to followers when needed.
例如,Usenix论文最开始并没有提到如何把状态机保存的数据落快照,只是简单地描述了一个无限增长的事件日志,对于大多数用raft算法实现的真实系统,这显然是不现实的,包括作者自己的Ramcloud 数据库。随后扩展版本添加了一个对于快照的讨论,包括一个InstallSnapshot RPC,在需要时将快照发送给跟随者。
Similarly the original Usenix paper does include a discussion on cluster membership changes, but it is obvious even to the casual reader that this part of the paper did not receive the same amount of thought that went into the core of the algorithm, and certainly does not achieve the simplicity goal the authors had set themselves. Ultimately the cluster membership change protocol ends up in the curious state where members of the cluster are receiving (and accepting!) events from a leader that‘s not even part of the current cluster. The Ongaro thesis then replaces that complexity with 2 very simple RPCs to add and remove servers one at a time. And as is often the case, the simpler algorithm also turns out to be more robust than the complex one!
同样,Usenix最初的论文讨论了集群成员变更,普通的读者也能很明显的发现,作者并没有把这部分像核心算法那样进行深入思考,当然也就不可能实现作者自己设定的简单性目标。最终集群成员变更协议会在一个奇怪的状态下结束,这个奇怪的状态就是已经被移除的成员还在作为Leader给集群中的成员发送消息。Ongaro论文用两个简单的RPC替换了之前复杂的算法,即一次只添加或删除一个成员。通常情况下,简单的算法比复杂的算法更加健壮!
In the same spirit of evolving from a core protocol to a more complete and realistic implementation, the goal of this paper is to introduce 4 modifications to Raft, that are relevant to realworld distributed databases:
本着让Raft核心算法更完整,更贴近现实的想法,该论文提出了与生产环境分布式数据库相关的Raft论文4个修改:
1. Cluster initialization: A cluster initialization step that is the starting point of the lifecycle of the cluster. Having this step explicitly described makes it more straightforward to describe also the database id, and their relation to AddServer RPC.
1.集群初始化:集群初始化是集群生命周期的开始。对该步骤进行明确的说明,可以更直接的描述databaseID和AddServer RPC之间的关系。
2. Universally Unique Database Identifier: to identify whether a snapshot or a log on one server is in fact some (predecessor or successor) state of the same state machine on other servers of the cluster, or whether it‘s a snapshot of some completely different state machine that has nothing to do with this cluster.
2.数据库唯一标识符:用来区分一个节点上的快照或日志文件是否是该集群中的其他状态机产生的,或者是其他集群中的状态机产生的。
3. Prevote algorithm: this paper provides a more detailed specification of the idea suggested only in passing in §9.6 in (Ongaro, 2014)
3.预选举算法:该论文对 (Ongaro, 2014)9.6节描述的方法进行了更详细的说明。
4. Leader stickiness: Building on the prevote algorithm, we also modify Raft to reject servers that attempt to elect themselves as leader, if the current leader appears to still be healthy to the rest of the cluster. This is to avoid flipflopping between two competing leaders.
4.惰性领导者:基于预选举算法,我们修改Raft协议以应对,如果集群中的其他节点认为Leader服务正常的情况下,某些节点尝试选举自己作为Leader的场景。这么做是为了避免两个节点重复竞争Leader。
The proposed modifications in this paper are written against the most recent publication of Raft, Diego Ongaro‘s thesis paper(Ongaro, 2014), which the reader is assumed to be familiar with. The tables summarizing the algorithm have been reproduced on the next few pages. The additions of this paper are highlighted in blue.
本文提出的修改是针对最近出版的Raft,Diego Ongaro‘s的论文(Ongaro,2014),大家应该已经非常熟悉了。总结该算法的表格会在接下来几页中阐述。论文的附加部分会用蓝色加粗表示。
3. Cluster Initialization
A real world implementation of Raft, or any other consensus protocol, would of course have some kind of bootstrap procedure, that will execute multiple actions to initialize the state of a new cluster. Arguably, such a bootstrap sequence is clearly implied in (Ongaro, 2014), as the state is set to some default values "at first boot".
Raft实现或其他一致性协议,都理所当然包含一些引导过程,引导过程执行多个操作来初始化新集群的状态,类似这样的引导序列在Ongaro的论文中有明确的表示,在第一次启动时,需要设置一些默认值。
However, reviewers of previous versions of this paper pointed out that the descriptions of a databaseId and the closely related AddServer RPC would benefit from having an explicit initialization step as part of the Raft algorithm. In addition, the discussion resulted in a consensus that all real world implementations we are familiar with, actually have an initialization sequence that does exactly the steps we will document here. (Of course a real world implementation would still do a number of other initialization tasks that will still not be relevant to describe as part of Raft.)
然而,前一版本的审阅者指出,作为Raft算法的一部分,对初始化步骤进行明确说明,将更有利于描述databaseID和AddServer RPC之间的关系。此外,通过讨论达成了一个共识,所有我们熟悉的工程实现,实际上都有一个初始化序列,它会执行我们这里所列出的步骤。(当然工程实现也包含与Raft无关的一些其他初始化任务。)
The cluster initialization step is simple yet profound. When a server starts, it is incapable of becoming a leader and hence incapable of processing any events from client. The server must either join an existing cluster via an AddServer call, or explicitly become a new cluster with InitializeCluster.
集群初始化步骤简单而深刻。当服务器启动时,它不能成为领导者,因此无法处理来自客户端的任何请求。服务器必须通过AddServer调用加入现有集群或者通过InitializeCluster成为一个新的集群。
If InitializeCluster is called on a server, it will generate a new databaseId (see next section) and also set any other state variables to their default values. Note that after initialization, the server will itself be a majority of a cluster with 1 server, so it will proceed to elect itself as a leader.
如果在一个节点上调用InitializeCluster,它将生成一个新的databaseID(查看下一章节),并将任何其他状态变量设置为默认值。注意在初始化完成之后,组成一个单节点的集群,因此会将自己选为Leader。
InitializeCluster marks the starting point of the lifecycle of a Raft cluster and state machine. The life cycle ends when the last server is removed (or rather, removes itself) from the cluster. In practice the lifecycle is more likely to end simply with the servers being shut down, never to be restarted again.
InitializeCluster标记着Raft集群和状态机生命周期的开始。当最后一个服务器从集群移除时(更确切的说,移除自身),生命周期结束。在工程实践中,当服务器被关机永远不再启动时,生命周期很可能就结束了。
However, it is also allowed to call InitializeCluster on a server that is already part of an initialized cluster (that has one or more members). This could be necessary for a number of reasons. In the Raft algorithm itself, it is for example possible to get into a state where a majority of servers have been lost permanently. The cluster is now without a leader and will not be able to accept new events from clients, and also will not be able to elect a new leader. Without a leader it is also impossible to correct the situation, since AddServer and RemoveServer should be called on the leader.
但是,也允许在已经初始化过,包含一个或多个节点的集群中调用InitializeCluster。在很多情况下这么做是必要的。比如,Raft算法本身可能会进入多数派节点永久离线的状态。集群中没有领导者,也就不能够处理客户端的请求,也不能选举出新的Leader。没有Leader也就不能纠正这种情况,因为AddServer和RemoveServer都需要由Leader发起。
In such cases the solution is to call InitializeCluster again, essentially starting a new cluster from the remains of the one that was just lost. As the databaseId is now regenerated, this also helps prevent split brain. Or, in the case that the admin would call InitializeCluster on multiple parts of the lost cluster, they would still split into separately identifiable parts, each with their own distinct databaseId.
在这种情况下,解决方案是再次调用InitializeCluster,用集群中剩下的节点创建新的集群。随着databaseID的重新生成,避免了脑裂问题,以及运维人员在剩余的不同节点上执行InitializeCLuster,把集群分裂成每个节点都有不同databaseID且互相独立的问题。
Therefore, when calling InitializeCluster, the server may or may not be empty of data. If there already is some data on the server, the currentTerm and log index are preserved, but they should be interpreted as the starting point in the lifecycle of a new cluster and detached from the cluster they previously belonged to the log now lives irrevocably in a parallel history.
因此,在调用InitializeCluster时,服务器上可能没有数据。如果服务器上已经有一些数据,currentTerm和log index应该保留,把它们作为新集群生命周期的开始,从之前集群分离出来的日志,现在不可逆转的存在于一个平行的历史中。
Note that for the other route for a server to become part of a Raft cluster, via the AddServer call, the opposite is true: The server must either already be in a state with a matching databaseId or it must be in the empty and uninitialized state. It is not allowed to add a server that contains some unrelated data (or state) the administrator would have to first delete such database/state from the new server.
注意一个节点必须已经拥有匹配的databaseID或者处于空的未初始化状态,才能通过调用AddServer,成为Raft集群的一部分。不允许添加一个包含不相关数据或不相关状态的节点,运维人员必须先从新节点中删除这类database或状态。
Note that in the algorithm description above, we have specified InitializeCluster as an operation that can only be called on a single server, which then becomes the sole member of its new cluster. This is mostly an artefact of the way Raft is designed: it is always the Leader (or sometimes a Candidate) that makes Remote Procedure Calls toward followers. But there is no procedure where followers call to each other.
注意在上面的算法描述中,我们将InitializeCluster指定为只能在单个服务器上调用的操作,该服务器随后成为新集群的唯一成员。Raft故意设计成这样:总是Leader(有时是Candidate)对Follower进行RPC调用,但是没有一个机制可以让Follower之间互相通信。
For real world implementations it may be desirable in particular to avoid unnecessary InstallSnapshot calls, which are expensive on large snapshot files to allow a minority of servers, still connected to each other, to enter into the new cluster together, rather than initializing a single server first and then readding the others with AddServer RPC. This use case is in fact implicitly supported: After calling InitializeCluster on the first server, it would be possible to copy the newly generated databaseId to any other servers one would want to bring into the new cluster. For example, an implementation may provide some kind of setDatabaseId() call for this purpose. Note that this would definitively be considered an "admin forceful override" type of operation. The admin would only use this in unusual circumstances and must be sure to only call it on nodes that actually share the same log history with the server the databaseId is copied from. After setting the databaseId on a second server, it could then join the first server‘s cluster normally via an AddServer call. Since the databaseId‘s of the servers match, the AddServer call will succeed without an InstallSnapshot operation.
对于工程实现,可能需要避免不必要的InstallSnapshot调用,当快照文件很大时,代价很高,允许彼此互相能通信的少数服务器一起进入新的集群,而不是先初始化单个服务器,然后再使用AddServer RPC重新添加其他服务器。这个用例实际上隐含的是支持的:在第一个节点上调用InitializeCluster之后,就可以将新生成的databaseID复制到任何其他希望带到新集群中的服务器上。例如实现可以为此提供某种setDatabaseID()调用。这将最终被视为“管理强制重写”类型的操作。运维人员只会在异常情况下使用他,并且必须确保该节点与拷贝dataserverID的源节点拥有同样的日志。在第二个节点上设置databaseID后,可以通过调用AddServer加入到第一个节点的集群中。由于databaseID匹配,AddServer调用不需要通过InstallSnapshot就能成功。
4. Universally Unique Database Identifier
While the Raft algorithm as published in (Ongaro, 2014) does a great job in maintaining the integrity of a single replicated state machine, in the real world database clusters don‘t live in a vacuum. A sysadmin will be operating multiple servers in one or more datacenters, each server belonging to some cluster. Failing to take this obvious fact into account, implementations will often be left vulnerable to various split brain conditions, especially due to operational errors such as misconfiguration. While we could blame many of these conditions on the sysadmin, it is actually simple for an implementation to protect against such errors, and one should of course therefore do so.
尽管(Ongaro,2014)发表的Raft算法在维护单个状态机的完整性上表现的很好,但是实际的数据库并不是工作在完全隔离的环境中。运维人员会在一个或多个数据中心同时操作多台服务器,每个节点都属于一个集群。如果不考虑这一显而易见的事实,实际会受到各种可能导致脑裂情况的干扰,尤其是因为操作失误,比如配置错误。虽然我们可以将这种情况归咎于运维人员,但实际上可以通过简单的实现来避免这些错误,因此也应该这样做。
To illustrate a simple path to a split brain, one out of many possible, consider a 2 server cluster and the following steps, that are fully legitimate operations as defined in the (Ongaro, 2014) version of Raft:
考虑多种情况中的其中一种导致脑裂的场景,两节点集群,按照如下步骤,采用(Ongaro,2014)Raft描述的完全合法的操作:
Sequence of steps leading to split brain, which goes undetected:
按照如下的步骤会导致脑裂,出现不可预知的错误:
a) Two servers S1 and S2 form the cluster, with S1 being the leader. They commit x<1 and y<2, after which there is a network split. Since neither server alone is a majority, there is no leader and replication stops. At this point both servers have (term=1, index=2).
a)集群中包含S1和S2两个节点,此时S1是Leader节点,它提交x=1和y=2两条日志后,出现网络分裂。由于两个节点都不能达成多数派,因此没有Leader,也无法进行日志复制。这时两个节点都有(term=1和Index=2)的日志。
b) To get the service back into available state, the sysadmin has to change the cluster configuration by calling S1.RemoveServer(S2). (Strictly speaking this can only be called on the leader, however since it is not possible for any server to become leader in this state, a real world implementation would provide or the sysadmin would in any case invent some method of forcing the reconfiguration to get back to a functional state.) Unfortunately, another sysadmin at the same time calls S2.RemoveServer(S1), causing split brain. S1 commits z<3 and x<4. (term=2, index=2) S2 commits z<9. (term=2, index=1)
b)要使集群服务恢复到可用状态,运维人员必须调用S1.RemoveServer(S2)来更改集群配置。(严格的说,只有Leader节点上才能发起这样的调用,但是在这种场景下,任何一个节点都不能成为Leader,需要提供一种实现机制或者允许运维人员在任何情况下都可以强制进行配置变更来使集群恢复到可用状态)。不幸的是,另外一个运维人员也在同时调用S2.RemoveServer(S1),这样就导致了脑裂。S1提交了z=3和x=4两条日志(term=2,index=2)。S2提交了z=9(term=2,index=1)
c) As the network is restored, the sysadmins take steps to rejoin the servers with each other, calling S1.AddServer(S2). S1 then calls AppendEntries RPC to compare the term and index of each of the servers logs, and sending the one log entry (x<4) that appears to be missing, causing S2 to "catch up". Finally, S1 commits, and replicates to S2: y<5. (term=3, index=1) The diverging value at (term=2, index=1) goes undetected, perpetuating a split brain between the state machines, until overwritten by a new value.
c)网络恢复后,运维人员通过调用S1.AddServer(S2)让两个节点重新组成集群。S1调用AppendEntries RPC来比较每个节点日志的term和index,然后给S2发送x=4的日志,以便S2能追上S1。最后,S1提交,给S2复制y=5(term=3,index=1)的日志。在(term=2,index=1)处的日志不一致没有被发现,在被新的值覆盖之前,状态机之间的不一致会一直存在。
More generally, the various conditions that the AddServer RPC must be capable of dealing with, are the following:
一般来说,AddServer RPC需要能处理各种场景,如下所示:
The problem for the implementor of a replicated state machine is that it is not easy to distinguish between the different scenarios 2-4. A simple robust solution would be to always delete data and reset state on the added server, so that one would always commence with InstallSnapshot RPC. However, this would defeat the sysadmin‘s attempt at optimizing the process in scenario #4, and may often be suboptimal also for scenario #2.
复制状态机开发者面临的问题是不能非常容易的区分场景2-4。一个简单而健壮的解决方案是总是删除新加节点上的数据,并重置状态,以便总是从InstallSnapshot RPC开始。然而这样会使运维人员在场景4所做的优化变成无用功,同样在场景2,这样的处理方式也不是最优的。
To be able to distinguish scenario #3 from the others, we will have to introduce a Universally Unique Database Identifier, which we will call databaseId in the above summary tables of the Raft algorithm.
为了将场景3同其他的场景区分开来,我们必须引入数据库唯一标识符,就是在上述Raft算法中提到的databaseID。
Any pseudounique identifier is fine, such as using a UUID. Implementations should not allow users to provide a unique identifier. That would just lead to every other databaseId having the value "mycluster".
任何唯一标识符都可以,比如UUID。实现者不应该允许用户输入唯一标识符。这样会导致每一个databaseID都有一个“mycluster”值。
The databaseId is generated at the creation of a new cluster with InitializeCluster and is thereafter constant and the same value for all servers in the cluster. It should be persisted to durable storage, and should typically be part of backups. Note that the triplet of (databaseId, term, index) will uniquely and globally identify a state of the database or state machine. If two state machines both are in the state (databaseId=x, term=y, index=z), they are guaranteed to be identical. Note how this extends the Log Matching property into a world where there exists more than one distributed state machine (ie. database cluster), making it a globally valid property.
databaseID是在使用InitializeCluster创建集群时生成的一个常量,集群中所有节点都采用同样的值。它应该被持久化存储,并作为备份的一部分。注意(databaseID,term,index)三元组将全局唯一的标识数据库或状态机。如果两个状态机状态都是(databaseID=x,term=y,index=z),则保证他们是相同的。通过这种方式可以把日志匹配属性扩展到分布式状态机上,使其成为全局有效的属性。
5. Pre-vote algorithm
Section 9.6 in (Ongaro, 2014) introduced the idea of a prevote algorithm, without actually spelling out the details of such an algorithm. We will define one such algorithm here, so that we can build upon it in the next section.
(Ongaro,2014)中的9.6节介绍了预选举算法的概念,但没有实际说明这种算法的细节。我们将在这里定义一个这样的算法,以便我们可以在下一节中构建它。
The usefulness of a prevote algorithm is easily explained. The Raft algorithm has a strong property that leads its participants to always adopt the highest term they have observed from another server. This property is key to the robustness of the algorithm: elections become deterministic because of this, and the Log Matching property likewise depends on this.
预选举算法的价值很容易理解。Raft算法有一个很强的特性,这使得它的参与者总是采用他们从另一个服务器上观察到的最大term。这个属性是算法健壮性的关键:选举因此成为确定性的,而日志匹配属性同样依赖于这个。
A drawback in realworld implementations is that this easily leads to unnecessary "term inflation". Assuming servers will use 64bit integer mathematics, they are unlikely to run out of numbers during the lifetime of a cluster, but clusters do become subject to a behavior where the malfunctioning server will force an election when rejoining a cluster, even if the rest of the cluster has been healthy and continues to have a fully functioning leader.
实际实现的一个缺点是,这很容易导致不必要的“term膨胀”。假设服务器使用64位整型来存储term,那么在集群的生存期内,它们不太可能用完所有的数字,但是故障服务器重新加入集群时由于触发选举,影响集群的正常运行,即使集群中的其他成员都很健康,且Leader也工作正常。
0) A Raft cluster with 3 servers, one of which (S2) is temporarily disconnected from the rest.
0)三节点Raft集群,目前S2与其他节点网络隔离。
1) Raft causes the disconnected server to call for an election once every election timeout, causing it to increment currentTerm. As it cannot connect to other servers, it will lose its election, and keep retrying.
1)Raft算法要求节点在election timeout之后发起选举,使得currentTerm不断增长。由于S2不能与其他节点通信,所以它无法赢得选举,一直重试。
2) Once it is able to reconnect with the rest of the cluster, its higher term will propagate to S1 and S3. This will cause S3 to stop accepting new log entries from S1, and will cause S1 to step down as leader.
2)一旦S2与集群其他节点之间的网络恢复之后,S2更高的term将会传播到S1和S3。导致S3停止从S1接收新的日志,使得S1 退出Leader身份。
3) A new election for term=58 is eventually triggered and will be won by whichever server goes ahead first.
3)最终会触发一个term=58的选举,率先超时的节点最终成为Leader。
The solution is to introduce a prevote algorithm, which is executed before changing to candidate status. Only if the prevote is successful, should the server switch to candidate, otherwise it should wait for another election timeout.
解决方案是在切换到candidate状态之前,引入一个预选举机制。只有预选举成功,节点才能切换到candidate状态,否则等到下一个election timeout之后重试。
The implementation of the prevote algorithm is straightforward: receivers of a PreVote RPC should respond with the same result that they would if this was an actual vote.
预选举算法的实现非常简单:PreVote RPC的接收者像处理真实投票一样回应预选举的投票。
However, it is important to emphasize that the prevote response is not binding on a server. While each server must only vote for a single candidate for a given term, this is not true for the prevote phase. Multiple tobecandidates could receive a positive indication in the prevote phase, however once they proceed to the actual election, only one of them would receive the actual vote. This behavior is key to avoiding race conditions that could lead to inconclusive elections.
但是,必须强调的是,回应预选举请求对服务器没有约束力。在预选举阶段,每个节点并不一定只能给特定的term投票(预选举阶段,对于同一term,一个节点可以给多个节点投票)。多个预备候选人在预选举阶段会得到积极的暗示,但是一旦他们进行真实的选举,只有一个节点会得到实际的投票(选举阶段,对于同一term,一个节点最多给一个节点投票)。这种行为是避免导致无效选举情况的关键。
For example, a server could succeed in getting a prospective majority in the prevote phase, but then become itself disconnected before it is able to proceed with the actual election. In this case it would be a waste of precious failover time not to vote for another candidate who still can win the election.
例如,一个节点可以在预选举阶段成功获得多数派节点的投票,但在能够继续进行实际选举之前,节点本身的连接断开。在这种情况下,如果不投票给另一位能赢得选举的候选人,那将是浪费宝贵的故障转移时间。
6. Leader stickiness
The last flaw to address in this paper is Raft‘s vulnerability, in some corner cases, to leader flipflopping. An example of such a scenario is shown in the following picture:
本文要解决的最后一个缺陷是Raft在某些极端场景下易受Leader频繁切换的影响。下图显示了这样一个场景的示例:
0) In the starting state, S1 is a leader and S2 and S3 followers. At this specific moment, clients are not sending any new events, so nothing is added to the log on S1, but it is sending empty AppendEntries RPC calls for heartbeat purposes. Note that the below scenario only happens when all logs in the cluster remain equal. Raft would in fact protect against flipflopping in a case where S1 is able to successfully replicate new events to S3, since that would cause S2 to lose elections (and prevotes) until it can catch up to the same position in its own log.
0)初始状态,S1是Leader,S2和S3是Follower。此时,客户端没有发送任何新的请求,所以S1不需要追加任何日志,但它需要定时发送空的AppendEntries RPC调用作为心跳。注意以下场景仅在集群中所有节点的日志相等时发生。事实上在S1能成功将日志复制到S3的情况下,Ratf可以防止出现Leader频繁切换的场景,因为这将导致S2在它的日志赶上之前,无法赢得选举。
1) Network connectivity between S1 and S2 breaks down, causing each of them to conclude that the other server is disconnected from the cluster. However, both of them can still connect to S3.
1)S1和S2之间的网络连接断开,每一个节点都认为对方与集群断开连接,然而,他们都能连通S3。
2) As the election timeout is triggered on S2, it will increase its currentTerm and call RequestVote RPC. S3 will grant the vote, and S2 will become the new leader.
2)当S2的选举计时器率先超时时,它会增加自己的term并调用RequestVote RPC,S3给S2投票,使得S2成为Leader。
3) S1 will learn from S3 about the increased term, and step down as leader. However, it will not get any connection from S2, which will cause election timeout to trigger. S1 will then increase its currentTerm and call for an election. S3 will grant the vote, making S1 leader again.
3)S1通过S3了解到已经增加的term,不再作为Leader。然而它收不到来自S2的任何信息,在选举计时器超时之后,S1增加自己的term,并发起选举。S3给S1投票使得S1再一次成为Leader。
4) This flipflopping continues forever until the network connectivity between S1 and S2 is restored.
4)在网络恢复之前,S1和S2之间的竞争会一直存在。
It is important to understand that, on a superficial level, flipflopping is not a failure of the Raft protocol. Superficially, all the requirements are met at all times: There is exactly one leader at any time, hence the cluster remains available. Also the Log Matching property remains intact, the integrity of data is not corrupted.
有很重要的一点需要明白,就是从表面上看,leader的竞争并不是Raft协议的问题。从表面上看,所有的请求都能得到满足。同一时间只有一个Leader存在,集群也能保证可用。此外,日志匹配属性保持不变,数据的完整性也没有被破坏。
In practice however, such flipflop behavior is undesired. It would be unwanted overhead for the clients to have to reconnect to a new leader every few seconds. For some types of implementations such as a database supporting long lived, multistatement transactions it could make the cluster defacto unavailable, if the leader‘s elected term is shorter than the time it takes to commit a transaction.
在实践中,这种Leader之间的频繁切换并不是我们所期望的。客户端每隔几秒钟就必须重新连接到新的Leader,这种开销是不必要的。某些类型的实现,比如支持长在线,多语句事务,如果Leader的任期比事务提交所需要的时间还短,实际上集群是不可用的。
So instead of such flipflops, we would expect a (healthy) cluster to keep the same leader for days and weeks, maybe even months.
因此,我们希望一个领导者可以在几天、几周,甚至几个月保持不变,而不是现在的Leader来回切换。
Raft actually does protect against some cases of flipflopping. Since it is a requirement for the winning server to have its log at least as up to date as a majority of the cluster, this means that a single server, or minority of servers, cannot disrupt the majority once they have fallen behind. Which they are of course likely to do as soon as they become disconnected from the leader.
Ratf实际上可以防止一些Leader来回切换的场景。由于Leader需要与集群中的大多数节点日志保持同样新,这就意味着当一个节点或者少数派节点的日志落后之后,他就无法对多数派集群产生影响。当然,一旦它们与Leader之间的连接断开,它们可以这样做。
The example described above therefore requires that no new log entries are added, or at least not replicated and committed, so it is admittedly a corner case. However, in the real world flipflopping may often also be caused by flaky networks, where connectivity breaks down for some seconds, then is restored, then breaks down again. Such network conditions would increase the likelihood of flipflopping compared to our simple example, since the intermittently-disconnected servers will be able to catch up with the leader. In such conditions the desired behavior would be for the cluster to keep its current leader as long as possible, rather than descending into a cycle of frequent elections and new leaders. It is with these realworld situations in mind that the following improvement is proposed.
上面示例描述的不添加新的日志或者至少没有被复制或者提交,这属于极端场景。实际生产环境中可能会因为不稳定的网络导致Leader来回切换,比如网络中断几秒钟,然后恢复,接着又再次中断。与前面的示例相比,这样的网络环境增加了Leader来回切换的可能性,因为断断续续连接断开的服务器有可能追上Leader。在这种情况下,我们期望Leader当选的任期可以尽可能的长,而不是陷入频繁选举和产生新Leader的循环。正是基于这样的考虑,我们提出如下的改进建议。
A simple fix to such network flakiness would be to simply increase the election timeout to be longer than the typical network interruption. This would allow short disconnections to heal themselves before servers in the cluster start calling for election. And this is in fact a common and effective fix to flakiness. But increasing the election time also has the effect of increasing the failover time in every case. Hence this proposal to add "leader stickiness" to Raft can be seen as a more advanced solution. While it adds some complexity, it is motivated by minimizing failover time, or maximizing availability.
解决类似网络闪断问题的一个简单方法是把选举超时时间加长,使其大于网络闪断时间。在触发集群选举之前,短暂的网络断开可以自行修复。事实上这是一种简单而且有效的解决网络闪断问题的方法。但是增加选举超时时间会增加故障转移的时间。因此在Raft协议中增加“惰性Leader”可以认为是一个更先进的解决方案。虽然增加了一些复杂性,但是它的目的是将故障时间降至最低,将可用时间扩到最大。
The change to add "leader stickiness" is intuitive: Followers should reject new leaders, if from their point of view the existing leader is still functioning correctly meaning that they have received an AppendEntries call less than an election timeout ago.
增加“惰性Leader”机制很容易理解:Follower如果在选举超时时间之前收到过Leader的AppendEntries请求,则认为Leader工作正常,此时会拒绝给RequestVote 请求投票。
This will have the effect that a single "problematic" server (or even a minority of servers) with connectivity issues will not be able to win new elections, if a majority of the cluster is still connected to the current leader and functioning normally. Or, put in more political wording: followers will be loyal to their existing leader, and only servers that agree that an election is needed (Candidates) will vote for a leader from within themselves.
如果集群中的多数派节点与Leader保持连通且工作正常,这时一个有问题的节点或者少数派节点有问题,都不能赢得选举。或者用更加政治化的语言描述是:Follower会忠诚于当前Leader,只有当节点同意选举时,才会在他们内部选举出一个Leader。
The intuitive fix does introduce a nonobvious change to how new leaders are chosen also in the common case. Now, with the stickiness added to the elections, the first servers to have their election timeout lapse after a leader failure, will actually be rejected in the PreVote phase, since a majority of servers will still think that the leader is there. Only when at least half of the servers have reached the election timeout, will a server be able to gain majority support from the PreVote phase, and then ultimately from the RequestVote itself. In the simple case, for a cluster with 3 members, the second server to call PreVote RPC will become the new leader, in a 5 member cluster the third server, and so on.
直观的解决方案引入了一些不明显的变化,即一般场景下如何选择出一个Leader。现在随着将“惰性Leader”机制增加到选举机制中,第一个选举超时的节点会在Pre-vote阶段被拒绝,因为多数派节点仍然认为Leader是正常工作的。只有当多数派节点选举超时之后,某个节点才能在Pre-vote阶段得到多数派的支持,然后通过RequestVote得到多数派支持。比如一个简单的场景,3节点集群,第二个调用Pre-vote RPC的节点会成为新的Leader,5节点集群,第三个会成为新的Leader,以此类推。
以上是关于Four modifications for the Raft consensus algorithmRaft一致性算法的四个改进译文的主要内容,如果未能解决你的问题,请参考以下文章
Java 中 modifer 'public' is reduntant for interface methods