Preface
After finishing lab2a, i.e., Raft leader election last time, I was stuck on log replication for a long time. Until last night, when I optimized appendEntries (skipping all log entries in the current term when prevLog doesn’t match), the long-troubling TestBackup2B magically passed. I ran it twice and still couldn’t quite believe it, so I deliberately changed it back and saw it fail as expected before I felt relieved — it seems the efficiency was too low and timed out.
While it’s still fresh, I might as well write down the blood, sweat, and tears of this period tonight.
Author: Muniao’s Notes https://www.qtmuniao.com, please indicate the source when reprinting
Lab Overview
6.824 is a distributed systems course at MIT. I followed the 2018 spring version. The second lab requires a simple implementation of a distributed consensus protocol — Raft.
This is a protocol designed specifically for ease of teaching and engineering implementation. It breaks the protocol down into several relatively independent modules — leader election, log replication, and safety guarantees. Figure 2 in the paper basically gives all the implementation details of Raft, every word is precious. But because it is so concise and profound, some state transitions are scattered across different descriptions. If you truly implement it just by following this figure, it’s easy to miss some details.
Log Replication
This is the content of lab2B. After going back and forth, hitting pitfalls, clearing mines, giving up, and picking it back up for over a month, I finally completed the log replication part. The final code is less than seven hundred lines (694), and some implementation details were even inspired by online resources. The mental state during this period was similar to a comic I saw online making fun of programmers — ah, a bug, why? fix fix fix; still wrong, fix fix fix; ah finally correct, but then, why?
I feel the final passing version still needs improvement, and some places vaguely feel a bit redundant. But let’s be happy for now and get these thoughts down on paper, after all, the memory in this old person’s brain is quite limited and volatile.
The previous post on leader election mainly recorded some implementation tips; this post mainly explores the various logical details of the implementation.
General Flow
At first, I more or less understood the paper and felt I had grasped all the details, but the whole process still couldn’t run in my head. Now looking back, after a Leader comes online, the log replication process goes roughly like this:
- After the Leader is elected, it initializes, mainly the
nextIndexarray andmatchIndexarray. There may be many approaches, but this is what I did: allnextIndex = len(rf.log), allmatchIndex = 0 - Then immediately start heartbeats, i.e., AppendEntries. The paper says sending an empty heartbeat is sufficient at the start, but in my implementation, I sent a heartbeat with one logEntry. This depends on my initialization method above and the parameter construction strategy below.
- I divided the synchronization process into a probing phase for the match position and a transfer phase after the match; that is, first matching a certain
logEntryin the Follower by advancingprevLogIndex+prevLogTermeach time, then the Leader sends all entries after the matchedlogEntryto the Follower in one go. - For each parameter construction, the other fields are relatively certain. The main considerations are three fields:
prevLogIndex,prevLogTerm, andentries. SinceprevLogTermis determined byprevLogIndex, the main considerations areprevLogIndexandentries. ForprevLogIndex, in the probing phase I set it tonextIndex-1, which reduces unnecessaryentriestransmission during probing; in the transfer phase,prevLogIndexcan be set tomatchIndex, i.e., the last matched entry. Forentries, it takes a closed interval from the Leader’s log of[prevIndex+1, min(nextIndex, len(rf.log)-1)]. - Thus, no data is transferred during each probe (i.e.,
[], because at this timeprevLogIndex=nextIndex-1); in the transfer phase, all logEntries on the Leader after the match can be transferred in one go. If there are no new logs, several variables will remain:nextIndex = len(rf.log); prevLogIndex = matchIndex = len(rf.log)-1,entriesis[];
Other points are some details attached to this process:
- Periodically check the majority logEntry match Index to decide whether to commit, i.e., advance the Leader’s
commitIndex, and synchronize it to each Follower in subsequentAppendEntriesRPCs. - At the same time, another thread needs to check whether
lastAppliedhas caught up withcommitIndex, ensuring that committed logs are promptly applied to the state machine. I think these two variables are separated mainly for logical decoupling — on the Leader, the Leader actively checks to updatecommitIndex; on the Follower, it passively accepts Leader messages to update. - After the Leader accepts a new command, it needs to update its own corresponding position in the
matchIndexarray, because it itself also counts as one vote when finally calculating the majority. - After a Follower disconnects, the Leader needs to promptly stop executing the callback content.
AppendEntries Heartbeat & Sync
This mainly corresponds to the two phases mentioned above, the probing phase and the transfer phase; other things to note are the locking and state self-check mentioned in the previous post.
1 | // need handle the reply |
Then comes parameter construction. For prevIndex, it is also constructed differently in the two phases. Then take the appropriate window of entries.
1 | func (rf *Raft) constructAppendEntriesArg(idx int) *AppendEntriesArgs { |
Finally, the AppendLogEntries callback: when seeing a request from a previous term, reject it directly — nothing much to say here, just remember to bring back your own term. In addition, regardless of whether args.Term > or = rf.currentTerm, the peer receiving AppendEntries must become a Follower and reset the timer. Also, when matching logEntry, pay attention to truncation.
1 | func (rf *Raft) AppendEntries(args *AppendEntriesArgs, reply *AppendEntriesReply) { |
CommitIndex Update
This mainly relies on the leader’s matchIndex[]. The specific approach is to sort it in ascending order and take the index at the median position (I came up with this on the spot at the time, and it felt magical). Another important point emphasized in the paper is that only logEntries from the current Term can be committed. This is to prevent repeated overwriting caused by Leaders coming and going.
1 | func (rf *Raft) checkCommitIndex() { |
Responding to Votes
Each peer can cast at most one vote per term, but after each term update, votedFor can be set to -1, meaning it can vote again. This case occurs when responding to a candidate’s request for a vote: if it has already voted, it seems it cannot vote, but if it finds its term is not as large as the candidate’s, then it needs to immediately become a Follower and vote for that candidate.
1 | func (rf *Raft) RequestVote(args *RequestVoteArgs, reply *RequestVoteReply) { |
This involves the implementation of becomeFollower, i.e., changing the state, then resetting the timer, and deciding whether it can vote based on whether the term was updated.
1 | func (rf *Raft) becomeFollower(term int) { |
