Compared to Paxos, one of Raft’s distinguishing features is that the algorithm is decomposed into relatively orthogonal parts — leader election, log replication, state persistence, log compaction, and membership changes. You can tell by looking at the course outline that, except for the last part, these modules correspond to what we implement in PartA ~ PartD of the course. The benefit of orthogonally decomposing the algorithm is that each module becomes relatively self-contained, making the whole easier to understand and implement — this is also the original design intent of the Raft algorithm.

Below, I do not intend to use a precise approach to explain each module — that is the job of the paper and the code implementation. Instead, in this chapter I will guide everyone to build an intuitive understanding of Raft’s basic concepts (term, election) and two major flows (leader election, log replication). Armed with this intuitive understanding, you can then carefully study the paper, and I believe you will be able to sort out the massive details of the Raft algorithm with twice the result for half the effort.

Author: 木鸟杂记 https://www.qtmuniao.com/2023/11/15/raft-explain Please indicate the source when reprinting

Term

Term is important in any consensus protocol. All key events in Raft unfold based on terms. The most intuitive understanding of a term is a leader’s term of office, like a “presidential term.”

But fundamentally it is a metaphor about time, which can be understood as an “era” — like the villagers in Peach Blossom Spring who “knew not of Han, let alone Wei and Jin” — that kind of dynastic era. Similar to the villagers of Peach Blossom Spring, in Raft, if a network partition occurs and some peers are isolated, they can easily be unaware of which term other peers have reached. After some time, when the isolated peers in a partition re-establish communication with other peers (like the Wuling native discovering them), the first thing to do is align terms, which is the foundation for all subsequent communication.

From another perspective, the term is also a metaphor for priority or power:

When a peer with a lower term receives any message from a peer with a higher term, it will automatically “follow” the term and become a Follower.
When a peer with a higher term receives any request from a peer with a lower term, it will directly reject it.

When all peers “communicate” (RPC), the term is the first priority. Only after aligning terms is there a basis for discussing anything else.

Leader Election

Raft uses a “strongman mode,” meaning that once a Leader is elected, it has absolute authority over what the log looks like during its term. Raft also adopts a “there can’t be two tigers on one mountain” strategy — there is at most one Leader in any given term (or possibly none elected). But at the same moment, there may be multiple Leaders, yet they must be in different terms.

Because of the strongman strategy, elections must be conducted with great care — in order to select someone who can “take the overall situation into account” (a candidate with all committed logs). To this end, each peer, when voting, must compare whose log is “newer and more complete.” Once a Follower casts its vote, it signifies heartfelt submission to that candidate — a “promise” not to initiate an election for a period of time (resetting the election timer).

After being elected, the first thing a Leader must do is “announce to the world” (heartbeats) to “suppress” others “attempting to challenge authority” — “forcing” each Follower to promise not to initiate an election for a period of time. After receiving a heartbeat, as long as its term is not greater than the Leader’s, the Follower must obediently give this “promise” (resetting the election timer).

After that, the Leader will periodically send “decrees” until it receives a message from a higher “term,” at which point it must obediently “relinquish power” and step down as Leader. From this we can also see that the term is the first priority; this is an attack of the “law of time” that even the Leader cannot be immune to. Of course, there is also an optimization: when the Leader discovers it has become a “loner” (finding that most people no longer respond to its “decrees,” usually occurring when the Leader is network-partitioned from the majority of Followers), it automatically steps down.

Log Replication

After receiving a “request” from the “client,” the Leader will package it as a “decree” (log entry) and “attach” it to periodic broadcasts (heartbeats), “instilling” the decree into each Follower. The simplest and most brute-force method is to attach all local logs to the heartbeat; after receiving them, the Follower simply replaces its local logs. But if the Leader has a very large amount of logs, the communication cost will be very high. Therefore, the Leader adopts an “optimistic + fallback” approach for synchronization:

Optimistic: At first, the heartbeat does not attach any logs, only sending some “code words.” If the Follower discovers through the “code word” that its logs are completely consistent with the Leader’s, it directly replies: consistent, and subsequent heartbeats do not need to attach any logs.
Fallback: If the Follower discovers through the “code word” that its logs are not consistent with the Leader’s, it also tells the Leader — next time you need to attach logs. Then the Leader attaches some trailing logs; if it finds they are still inconsistent, it must continue to fall back, attaching more logs further forward, while updating the “code word,” until it receives an affirmative reply from the Follower, at which point it resumes heartbeats without attaching any logs.

This “code word” is the tuple of the previous log entry information that the Leader attaches with the logs: <index, term>. If the heartbeat does not attach any logs, the code word is the information about the Leader’s last log entry. To ensure that the Leader’s attached logs always have a preceding log entry, when we initialize the logs, we place an “empty log entry” at the beginning, thereby avoiding some boundary checks (this practice is similar to a linked list with a dummy head node).

Then why does matching the “code word” guarantee that the two parties’ log prefixes are consistent?

Simply put, matching the “code word” is a recursive process. According to mathematical induction, every time a new log is appended, the preceding logs must be aligned. And since terms are monotonically increasing and there is at most one Leader per term, the log prefixes of the same term must match. In this way, if the prefixes of the same term match, and before cross-term synchronization the previous term’s logs are aligned, then under conditions of “unimpeded decrees,” all logs will eventually converge to the Leader’s log.

After the “decree” (log) is replicated to the majority of nodes, the Leader will announce that the decree is “in effect” (committed). But the paper specifically emphasizes that the Leader cannot directly announce a predecessor’s “decree” as effective; instead, after issuing a “decree” in its own term, it indirectly “ratifies” the relevant “decrees” of previous terms by “committing” its own term’s “decree.” Why is this? (The following is the famous “Figure 8” from the Raft paper; many students who have implemented Raft have probably been tortured by this figure.)

raft-fig8.png

This is because during our election we decide whether a Leader is elected by comparing the “size” of the last log entry, so a predecessor’s log, if it has not been “settled” by a decree in the current term, could potentially be overwritten by another peer that is elected Leader with a newer log at the same index — this is the situation in c, d, e of the figure above — without log 4 “pressing down” on it, S5 could be elected Leader, after which S1~S3’s log 2 could potentially be overwritten by S5’s log 3.

This article comes from the course “Building a Distributed KV from Scratch” that I co-created with roseduan. This course will hand-hold you through understanding a consensus protocol, as well as all aspects and details of a distributed KV based on a consensus protocol; it will also teach you how to organize and write beautiful engineering code. Distributed systems are the foundational architecture of today’s mainstream internet systems, and consensus protocols are a typical representative and the cornerstone of cornerstones among them. Studying this course will give you a comprehensive and in-depth understanding of the problems faced by distributed systems and the techniques used to solve them.
Interested students are welcome to click “Building a Distributed KV from Scratch” for details.

cover1.jpeg

我是青藤木鸟，一个喜欢摄影、专注大规模数据系统的程序员，欢迎关注我的公众号：“木鸟杂记”，有更多的分布式系统、存储和数据库相关的文章，欢迎关注。关注公众号后，回复“资料”可以获取我总结一份分布式数据库学习资料。回复“优惠券”可以获取我的大规模数据系统付费专栏《系统日知录》的八折优惠券。

我们还有相关的分布式系统和数据库的群，可以添加我的微信号：qtmuniao，我拉你入群。加我时记得备注：“分布式系统群”。另外，如果你不想加群，还有一个分布式系统和数据库的论坛（点这里），欢迎来玩耍。