木鸟杂记

大规模数据系统

Paxos Made Simple: A Guided Reading

Introduction

Paxos is an unavoidable algorithm in distributed systems, but it is notoriously difficult to understand. So I had always been avoiding it, but after doing so for a long time, I began to feel some regret. Thus, over the past week, I spent my spare time collecting a lot of materials and trying many different approaches, and finally got a preliminary grasp of it. While it’s still fresh in my mind, I’ll put my understanding down on paper as a brief summary for future reference.

The inventor of the Paxos algorithm, Leslie Lamport, is one of the founding figures of distributed systems, with many interesting anecdotes. You can catch a glimpse of this just from the name Paxos: Paxos is a fictional ancient Greek city-state that Lamport invented to introduce the consensus problem in distributed systems. After the initial related paper, The Part-Time Parliament, was published in 1998, many people said they couldn’t understand it. So in 2001, Lamport restated the core ideas using relatively concise language and logic, resulting in Paxos Made Simple.

The abstract of Lamport’s Paxos Made Simple paper contains only one sentence:

The Paxos algorithm, when presented in plain English, is very simple

Yet, I cannot understand this kind of simple.

Author: Muniao’s Notes https://www.qtmuniao.com/2021/06/14/paxos/zhixing. Please cite the source when reposting.

Modeling

According to my consistent theory, if you can’t understand something simple, it must be because your approach (modeling approach) is wrong. Since the author says the theory is simple, as long as I can find an explanation that fits my context, I will definitely be able to understand what he is doing.

So I searched through a lot of materials, and after watching the video Zhixing Academy — Paxos and Distributed Systems, everything suddenly became clear. This video uses the programmer-familiar Client-Server + lock model to explain Paxos concepts and constraints layer by layer, giving me an intuitive preliminary grasp of the problems Paxos aims to solve and the ideas behind its solution. After that, when I went back to read the paper, many parts made sense.

So how does Zhixing Academy break down this consensus algorithm? Below, based on my understanding of the paper, I will briefly summarize the video content, with some modifications. If you haven’t watched the video, I strongly recommend checking it out first.

Note: This article does not cover Learners, nor does it provide a detailed interpretation of the original paper, nor does it discuss the engineering of Paxos. I originally wanted to include all of these in one article, but later found it would be too long, so I’ll split it into a series.

Basic Concepts

As a programmer trying to understand Paxos, there are certainly two questions: What problem does Paxos solve? How is Paxos applied to distributed systems in engineering?

The Paxos algorithm is used to determine the value of an immutable variable. An immutable, single value doesn’t seem very useful for distributed systems. But we can use a bridge — write-ahead logs (WAL), also called operation logs — to build a distributed system that externally behaves like a single machine.

On the one hand, operation logs can be viewed as a sequence of immutable operation records. For an immutable single operation record, having multiple machines reach consensus is exactly the problem Paxos solves. With slight extension, multiple machines can reach consensus on determining the sequence of operations. On the other hand, if we have a globally unique operation sequence, each replica can execute this sequence in the same order to build identical state machines, and state machines are highly expressive, capable of solving a large class of system problems.

Regarding the role of logs in distributed systems, Confluent’s founder Jay Kreps has written a well-researched article that unifies distributed systems, real-time processing systems, and data integration systems from the perspective of logs (write-ahead logs or commit logs or transaction logs). Highly recommended.

Problem Abstraction

Now, let’s return to understanding the Paxos algorithm itself. The video first abstracts a real engineering problem and progressively presents better solutions to break down the constraints of the Paxos algorithm.

Problem Description: Design a system to store a variable named var.

System Roles:

  1. Inside the system, there are multiple Acceptors responsible for storing and managing the var variable.
  2. Outside the system, there are multiple Proposers that concurrently call the system API to submit different values for var.

System API: propose(var, V) → <OK, f> or <Error> where f is the value of var stored in the Acceptor.

System Requirements:

  1. Once the value of var is determined, it cannot be changed, and the value can always be read afterwards.
  2. Can tolerate failures of any Proposer.
  3. Can tolerate failures of less than half of the Acceptors.

Like solving an algorithm problem, we can first simplify the problem to sort out the basic idea. Then generalize it to gradually arrive at the solution to the original problem.

Solution 1

Assume the system consists of a single Acceptor.

To handle concurrent calls from multiple Proposers, the simplest approach is to use a mutual exclusion lock on the Acceptor and proceed in two phases:

Prepare phase:

A Proposer obtains the Acceptor’s lock and the current value of var, f, via Acceptor::prepare. If the lock is already held, it aborts.

Accept phase:

If f is null, it accepts the Proposer’s value V via Acceptor::accept(var, V).

If f is not null, it releases the lock via Acceptor::release().

The problem with this solution: it cannot tolerate Proposer machine failures. When a Proposer calls Acceptor::prepare to acquire the lock and then crashes, the lock will be held indefinitely, rendering the Acceptor unavailable.

Solution 2

Solving the Proposer failure problem.

Proposers are numerous, and the occasional failure of a single Proposer is inevitable. Since we cannot control the life and death of Proposers, we can only work on the lock. For example, introducing a timeout for the lock, or making the lock preemptible. For the former, the timeout threshold is hard to control: too long and performance suffers, too short and there may be frequent retries. The latter is better — the original lock only becomes invalid when a new Proposer requests it.

Let’s elaborate on the design of preemptible locks.

Preemption inevitably introduces a priority issue: a higher-priority Proposer can preempt the lock of a lower-priority Proposer. We use the simplest priority rule: each Proposer must first apply for a number n (global logical clock, a cornerstone of distributed systems) when requesting the lock; the Proposer with the larger number has higher priority. Here I don’t use the term epoch from the video, but rather the term n from the paper, though they mean the same thing. This number can be assigned by a global number issuer to ensure monotonic increase; or it can simply use timestamps, though there is the issue of clock synchronization across multiple machines.

The video also mentions another problem: after Proposer2 preempts Proposer1’s lock, it finds that the Acceptor’s var has already been set. At this point, can Proposer2 modify it? I think this is not a problem in a single Acceptor — as long as we always follow the rule that once the Acceptor’s var is set, it cannot be modified.

Solution 2 is largely the same as Solution 1; the Acceptor only needs to save one additional state: the number latest-n of the Proposer currently granted the lock. And in both phases, it first compares the Proposer’s number proposer-n with the currently saved number latest-n to decide whether to reject or accept the request. Additionally, there is no need to explicitly release the lock via an API call each time.

The problem with this solution: a single Acceptor failure will cause the system to be unable to provide service.

Solution 3

Introducing multiple Acceptors on top of Solution 2.

Here we assume the Acceptor cluster has a fixed size. When extending Solution 2 to this case, several issues arise:

  1. How does the Acceptor cluster determine a value? A value is considered determined when more than half of the Acceptors’ vars are set to the same value. Therefore, for a single Acceptor, if it first accepts a minority value, it must later be able to re-accept a majority value, which requires the ability to accept values multiple times.
  2. During the prepare phase, if a Proposer finds that some Acceptors have values, can it simply release the lock? Not necessarily, because there are now multiple Acceptors, and more than half of the Acceptors must agree on the same value for the process to conclude. Therefore, before the number of Acceptors with that value reaches a majority, the Proposer needs to continue to the accept phase. But what value should be chosen at this point: randomly pick one from the set of values obtained in phase one, or pick one according to some rule? To achieve rapid convergence, we choose the value with the largest number.
  3. How does a Proposer acquire the “lock” of the Acceptor cluster in the prepare phase? By obtaining locks from more than half of the Acceptors. According to the pigeonhole principle, for a given number n, at most one Proposer can acquire the lock.

After resolving the above issues, the final solution is within reach:

For the Proposer:

Prepare phase: Obtain number n, initiate a prepare(n) request to the Acceptor cluster. If it does not receive more than half OK responses, abort. When it receives more than half OK responses, if there are non-null values in the responses, select the value v with the largest number; if there are no values in the responses, it can choose any value v and initiate an accept request.

Accept phase: Use the value v selected in the previous phase, initiate an accept(n, v) request to the Acceptor cluster. If it receives more than half OK responses, the cluster has successfully accepted v. Otherwise, it may have been preempted by a Proposer with a higher number, or some Acceptors have failed.

Note 1: In neither phase is it necessary to send requests to all Acceptors in the cluster; it is sufficient to select a majority subset. Moreover, the subsets of Acceptors chosen in phase one and phase two need not be the same.

Note 2: In the first phase, we emphasized only the role of acquiring the logical “lock” (fast failure). Another important role is to obtain (read) the previously accepted values.

For the Acceptor:

States to maintain: the currently accepted value and its corresponding number <accepted-n, accepted-v>, and the maximum number of the currently granted lock: latest-n.

Prepare phase: Upon receiving a Proposer’s prepare(n) request, if latest-n > n, return Error. Otherwise, return <OK, accepted-n, accepted-v>, and update latest-n to n.

Accept phase: Upon receiving a Proposer’s accept(n, v) request, if latest-n > n, return Error. Otherwise, accept the request, update latest-n, accepted-n, and accepted-v, and return OK.

Summary

After watching the Zhixing Academy video and combining it with the summary above, you should be able to achieve a good understanding when reading the Paxos Made Simple paper. Later, I wondered: what makes it difficult to read the paper directly? On the one hand, the paper doesn’t provide much setup and starts reasoning right away (of course, in the original paper the author does provide some background), but we don’t yet have a suitable model in mind to understand the various concepts mentioned in the paper. On the other hand, the paper is organized in a reverse manner — starting from the conclusion, it gradually derives the conditions that need to be satisfied, and finally combines all the conditions together. All of these contribute to the difficulty of picking up the paper and reading it directly.

Finally, let me summarize the key points for understanding Paxos once more:

  1. Understand the original purpose of Paxos: multiple Acceptors reaching consensus on a single immutable value.
  2. Use the engineering Client-Server + lock model as an aid to understanding.
  3. Dividing the algorithm into two phases enables fast failure.
  4. The introduction of number n is to solve deadlock and preemption ordering issues.
  5. Choosing the value with the largest number in phase two allows the accept process to converge quickly. Why choose the largest rather than the smallest? By mathematical induction, if consensus has not been reached, a later Proposer with a higher number will also choose this value when proposing.

References

  1. Paper: http://www.scs.stanford.edu/20sp-cs244b/sched/readings/paxos_made_simple.pdf
  2. Translation: https://www.cnblogs.com/yaodd/p/6150498.html
  3. Zhixing Academy — Paxos and Distributed Systems: https://www.bilibili.com/video/BV1Lt411m7cW


我是青藤木鸟,一个喜欢摄影、专注大规模数据系统的程序员,欢迎关注我的公众号:“木鸟杂记”,有更多的分布式系统、存储和数据库相关的文章,欢迎关注。 关注公众号后,回复“资料”可以获取我总结一份分布式数据库学习资料。 回复“优惠券”可以获取我的大规模数据系统付费专栏《系统日知录》的八折优惠券。

我们还有相关的分布式系统和数据库的群,可以添加我的微信号:qtmuniao,我拉你入群。加我时记得备注:“分布式系统群”。 另外,如果你不想加群,还有一个分布式系统和数据库的论坛(点这里),欢迎来玩耍。

wx-distributed-system-s.jpg