Overview
Recording some thoughts and experiences from implementing the 6.824 Lab 2 Raft, for future reference.
Lab Overview
6.824 is a distributed systems course at MIT. I followed the 2018 spring offering. The second lab requires implementing a simple distributed consensus protocol—Raft.
This is a protocol specifically designed to facilitate teaching and engineering implementation. It breaks the protocol down into several relatively independent modules—leader election, log replication, and safety guarantees. Figure 2 in the paper essentially gives all the implementation details of Raft, truly every word is a pearl. But because it is so concise and profound, some state transitions are scattered across different descriptions. If you truly implement it based only on that figure, it’s easy to miss some details.
Author: Muniao’s Notes https://www.qtmuniao.com, please indicate the source when reprinting
Leader Election
This is the content of Lab 2A. To my shame, from the start of design to passing the test cases, it dragged on for more than two months. Although I only wrote code in the evenings and on weekends, progress was indeed slow. However, I gained a lot. I secretly thought that if undergraduate labs were all like this, perhaps it would really be as Zhihu says—at most two courses per semester. Looking back at my junior year with four four-credit courses, the course design was destined to make us slack off.
Rant over, let me talk about the pitfalls I stepped into.
Timers
Raft mainly has two event loops: one where (Follower, Candidate) timeouts trigger an election, and one where (Leader) sends periodic heartbeats (sometimes piggybacking log synchronization). The easiest approach to think of is the loop+sleep method used brainlessly for undergraduate course projects: an outer while true, with an inner sleep using a slightly smaller time interval (say, t) (but at least one order of magnitude smaller than electionTimeout and heartbeatInterval to be sufficient), to periodically check whether the time points (needElection, heartbeat) have arrived.
But my OCD made me feel this wasn’t accurate—the error is at least that detection interval t. With so many threads running around and facing such complex state changes, could errors caused by this inaccuracy lead to bugs? But using Go’s timer seemed quite complicated; every time I thought about this, I couldn’t figure it out, and this delayed me for a while.
Later, reminded by another kid also working on Raft, I noticed that the course had very considerately provided a suggestion—it was still to use loop+sleep to periodically check for timeouts. At that moment, everything clicked, and I realized the comment I made in parentheses above—as long as the check interval is one or two orders of magnitude smaller than the timeout interval, there is basically no problem. The advantage of this implementation is that it is simple, crude, direct, and controllable.
While implementing, I suddenly found it fun and made two slightly different implementations. The code is attached below; for clarity, both have been somewhat simplified.
One uses only a single loop:
1 | go func() { |
The other uses two nested loops, with the inner loop dedicated to waiting (perhaps I was reminded of CPU busy-waiting).
1 | for { |
Although the former is more concise, the latter has clearer logic; sometimes concise code reuses different logic, which may make the semantics slightly unclear. Considering that the first principle of code is for humans to read (What? You say it’s for the machine? Ahem, I think that’s the prerequisite for code being code; otherwise the compiler won’t let it pass), I think the latter is better.
Locking
The course also very considerately provided a hint, but as someone who started “playing with locks” since the freshman-year elevator project, I directly skipped it, charging forward without hesitation, writing code frantically. However, memory can be deceptive, and the lesson was painful—I kept getting deadlocks without knowing where the problem was. Later, I had to obediently read the hint several times and found it was really well written…
The gist, summarized, is to first add locks everywhere global variables are read or written, and then remove locks where there is blocking (RPC calls, etc.).
I’ll set aside this principle for now and talk about two places where I fell into pits:
Holding a lock outside a function call and reacquiring it inside the function call:
The reason for this mistake is that looking at such long code, based on experience, I always wanted to wrap it up. When wrapping it up, I realized—damn, it’s all global variables, better apply for a lock quickly. But later when calling this function, I forgot to place it in the critical section; that is, eating from the bowl while also wanting to scoop from the pot. Deadlock achievement 1 get. No more nonsense, here’s the code:
1 | a. |
Forgetting to release locks on branch break/return
break, continue, return—this sneaky behavior of prematurely ending branches. Although we usually do it happily, it doesn’t match people’s inherent sense of symmetry, which sometimes causes us to forget to handle it. What do I mean by asymmetry? Here’s the code:
1 | if !check(arg) { |
v.s.
1 | if !check(arg) { |
With the latter, a quick glance at the if else alignment makes the number of function exits obvious at a glance. However, with the former, if branch statements are buried in massive amounts of code, it’s easy to forget that some obscure corner still hides an exit, and naturally you won’t unlock.
Specifically in my case, when the candidate asks for votes, if it gets a majority of votes, it directly becomes leader; if someone still gives votes later, I would check whether the current identity is already leader, and if so, return directly without caring about these votes—indeed quite devious… So, you know, I forgot to return the lock before returning. The code is as follows, with deletions.
1 | go func(server int, args RequestVoteArgs) { |
After Gaps, Remember to Self-Check
Because state changes back and forth across multiple loops (well, okay, just two, but why does it feel like so many?), before and after asynchronous/blocking calls, the identity role (self) and election term (external) may have already changed dramatically. Therefore, self-checking is needed.
To my shame, I only realized this after consulting various materials.
- Asynchronous calls, i.e., inside and outside goroutines.
- Blocking calls, i.e., before and after RPCs.
When the Leader performs heartbeat checks, because it is an asynchronous call, it needs to first check whether it is still the leader and whether it is still in its own term (the args saved the term at that time); when the RPC returns, perform this check again. Only if it is correct can it proceed with the next actions as the leader.
1 | go func(server int, args *AppendEntriesArgs) { |
