木鸟杂记

大规模数据系统

Overview

Dynamo is a highly available KV storage system. To ensure high availability and high performance, Dynamo adopts an eventual consistency model, providing developers with a novel API that uses versioning and resolves conflicts with client-side assistance. Dynamo aims to provide uninterrupted service while guaranteeing performance and scalability. Since Amazon widely adopts a decentralized, highly decoupled microservices architecture, the availability requirements for the storage system underlying these microservices are particularly high.

S3 (Simple Storage Service) is another well-known storage service from Amazon. Although it can also be understood as a KV store, its target scenario differs from that of Dynamo. S3 is an object storage service for large files, mainly storing binary files and not providing cross-object transactions. Dynamo, on the other hand, is a document storage service for small files, mainly storing structured data (such as JSON). It allows indexing data and supports transactions across data items.

Compared to traditional relational databases, Dynamo can be seen as providing only a primary key index, thereby achieving higher performance and better scalability.

To achieve scalability and high availability while ensuring eventual consistency, Dynamo combines the following techniques:

  1. Consistent hashing for data partitioning and replication.
  2. Versioning mechanism (Vector Clock) to handle data consistency issues.
  3. Quorum and decentralized synchronization protocols to maintain consistency among replicas (Merkle Tree).
  4. Gossip Protocol for failure detection and replica maintenance.

In terms of implementation, Dynamo has the following characteristics:

  1. Fully decentralized, with no central node; all nodes are peers.
  2. Adopts eventual consistency, using version numbers to resolve conflicts, even requiring users to participate in conflict resolution.
  3. Uses hash values for data partitioning, organizing data distribution, and balancing data load.
Read more »

1591539058644.jpg

When I first learned “Let’s Swing the Oars” as a child, I only found the melody catchy; as I grew older, humming it occasionally, a few simple lines revealed boundless imagery; later, while studying in the imperial capital, visiting Beihai Park, I saw precisely “the beautiful White Pagoda reflected on the lake, surrounded by green trees and red walls”—time flies, yet what remains unchanged is the vitality of the written word.

The lyrics were written by Mr. Qiao Yu, whose hand also produced many other well-known masterpieces: “My Motherland,” “Unforgettable Tonight,” and “Love My China.” The lyrics are divided into three stanzas, progressing layer by layer. The first stanza describes the scene of rowing; in just a few lines, beginning and end are connected, moving from near to far, sketching out a fourfold panorama. The second stanza expresses joyful emotions—childlike exuberance, a light heart, depicting full-bodied childish delight. The third stanza then elevates further, asking how such beautiful scenery, such a life, such an era came to be? Then it stops abruptly—words exhausted, yet meaning infinite.

Read more »

6.824-schedule.png

MIT finally released the lecture videos on YouTube this year. I had followed about half of this course before, and this year I plan to watch the videos and take some notes. The course is structured around the fundamentals of distributed systems—fault tolerance, replication, and consistency—and uses carefully selected industrial-strength system papers as the main thread, supplemented with extensive reading materials and well-crafted labs, bridging academic theory and industrial practice. It is truly an excellent course on distributed systems. Course videos: YouTube, Bilibili. Course materials: 6.824 homepage. This post covers the sixth lecture, the first part of the Raft paper lecture, mainly summarizing several types of fault tolerance and leader election in Raft.

Read more »

6.824-schedule.png

MIT finally released in-class video materials on YouTube this year. I had followed about half of this course before, and this year I plan to watch the videos and write some lecture notes. The course follows the thread of distributed systems fundamentals: fault tolerance, replication, and consistency, with carefully selected industrial-grade system papers as the main line, supplemented with detailed reading materials and sophisticated course labs, bridging academic theory and industrial practice. It is truly a rare gem of a distributed systems course. Course videos: Youtube, Bilibili. Course materials: 6.824 homepage. This is the fifth lecture note, including two parts: the first part is given by a TA about some Go primitives, design patterns, and practical techniques that will be used in lab 2, including memory models, goroutines and closures, time library, locks, condition variables, channels, signals, parallelism, and some common tools, etc. The second part is given by two other TAs sorting out some common bugs and debugging methods in Raft.

Read more »

6.824-schedule.png

MIT has finally released the lecture videos on YouTube this year. I followed about half of this course before, and this year I plan to watch the videos and write some lecture notes. The course follows the thread of distributed systems fundamentals: fault tolerance, replication, and consistency, using carefully selected industry-grade system papers as the main line, supplemented with detailed reading materials and well-crafted labs, bridging academic theory and industrial practice. It is truly an excellent distributed systems course. Course videos: YouTube, Bilibili. Course materials: 6.824 homepage. This post is the notes for the fourth lecture, VM-FT.

Replication — Fault Tolerance

Failure

How to define it? From the perspective of other computers, it stops providing services externally.
Through backup / replication (Replication)
Can solve: fail-stop, such as CPU overheating shutdown, host or network power failure, disk space exhaustion, etc.
Cannot solve: some correlated issues (where the primary and replica machines fail simultaneously), such as software bugs, human configuration errors.

Prerequisite

One assumption for primary-backup replication to work is that the failure probabilities of the primary and backup machines need to be independent.
For example: machines from the same batch, or machines on the same rack, have strongly positively correlated failure probabilities.

Is It Worth It

It is necessary to consider the business scenario and the required cost to determine whether replication is truly needed. For example, bank data needs multiple backups, while a course website may not.

Read more »

Objective

To fully leverage the performance of modern SSD storage, and under the same API, significantly reduce the read/write amplification of LSM-trees to improve their performance.

Background

On traditional disks, sequential I/O performance is roughly 100x that of random I/O. LSM-trees build on this by implementing massive KV random reads/writes as in-memory random access + sequential flushing to disk + periodic merging (compaction), thereby improving read/write performance. This is especially suitable for scenarios with more writes than reads and strong temporal locality (recent data is accessed most frequently).

wisckey-lsm-tree.png

Read more »

My blog was originally hosted on GitHub Pages, but it seems Baidu’s crawler was too aggressive and got banned by GitHub. According to marketmechian data, in mainland China’s search engine market, Baidu still holds half the market share:

  • Baidu: 67.09%
  • Sogou: 18.75%
  • Shenma: 6.84%
  • Google: 2.64%
  • bing: 2.6%
  • Other: 2.08%

As a Chinese blog, I still hope to be seen by more domestic users, so I’ve been looking for a way to let Baidu’s crawler automatically index my blog. While browsing blogs, I came across someone recommending zeit.co as a hosting platform. After trying it out, I found it to be an excellent static site hosting + CI Serverless Function platform, and I’m recommending it here.

Read more »

6.824-schedule.png

MIT has finally released in-class video materials on YouTube this year. I followed about half of this course before, and this year I plan to watch the videos and write some lecture notes. The course takes distributed fundamentals—fault tolerance, replication, and consistency—as its main thread, uses carefully selected industrial-strength system papers as its backbone, and fills in detailed reading materials and well-crafted labs, bridging academic theory and industrial practice. It is truly a rare and excellent course on distributed systems. Course videos: Youtube, Bilibili. Course materials: 6.824 homepage. This post is the lecture notes for the third class, GFS.

Overview

Storage is a very critical abstraction with a wide range of uses.

The GFS paper also touches on many issues related to fault tolerance, replication, and consistency.

GFS itself is a very successful production system inside Google. Its key ideas were well organized into a single academic paper, covering many issues from hardware to software, and is well worth studying.

For a detailed understanding of GFS, you can also check out my earlier GFS paper notes.

Read more »

6.824-schedule.png

MIT finally released the lecture videos on YouTube this year. I had followed about half of this course before, and this year I plan to watch through the videos and take some lecture notes. The course is structured around the fundamentals of distributed systems—fault tolerance, replication, and consistency—using carefully selected industrial-grade system papers as the backbone, supplemented with detailed reading materials and well-crafted labs, bridging academic theory and industrial practice. It is truly an excellent distributed systems course. Course videos: YouTube, Bilibili. Course materials: 6.824 homepage. This post is the notes for the second lecture, RPC and threads.

Why Go

  1. Modern syntax. It has built-in language-level support for threads (goroutines) and channels. It also has good support for locking and synchronization between threads.
  2. Type safe. Memory safe, making it hard to write bugs like out-of-bounds memory accesses as in C++.
  3. Garbage collection (GC). No need for manual memory management, which is especially important in multi-threaded programming, since it is easy to reference some memory and then forget where it was referenced.
  4. Concise and intuitive. Not as many complex language features as C++, and it is very friendly with error messages.
Read more »

6.824-schedule.png

MIT finally released lecture videos on YouTube this year. I had followed about half of this course before, and this year I plan to watch the videos and take notes. The course is structured around distributed systems fundamentals—fault tolerance, replication, and consistency—and uses carefully selected industrial-grade system papers as its backbone, supplemented with extensive reading materials and well-designed labs, bridging academic theory and industrial practice. It is truly an exceptional course on distributed systems. Course videos: YouTube, Bilibili. Course materials: 6.824 homepage. This post covers the first lecture: Introduction.

Course Background

Reasons for building distributed systems:

  1. Parallelism: harnessing resources in parallel (improving efficiency).
  2. Fault tolerance.
  3. Physical: inherent physical distribution of the system.
  4. Security: untrusted counterparts (blockchain).

Challenges faced by distributed systems:

  1. Concurrency: many components, complex parallelism, intricate interactions.
  2. Partial failure: partial failures exist, unlike single machines that either run normally or crash completely.
  3. Performance: careful design is required to achieve performance linearly proportional to the number of machines.
Read more »

This article originated from Lu Pan’s personal blog post: https://blog.the-pans.com/cap/, a Chinese translation of Martin Kleppmann’s article “Please stop calling databases CP or AP” authorized by Martin himself. However, many sentences in that translation read rather oddly. I compared it with the English original, and re-translated it sentence by sentence according to my own understanding. This article discusses why we should not abuse the concept of the CAP theorem; it is well-documented and penetrating, well worth reading. Even more commendable is that Martin cites sources for all key viewpoints in the text, and recommends some learning materials at the end—all excellent reading. Below is the main text.

In Jeff Hodges’s excellent blog post Notes on Distributed Systems for Young Bloods, he suggests we use the CAP theorem to evaluate systems. Many people have followed this advice, calling their systems “CP” (providing consistency but unavailable during network partitions), “AP” (highly available but inconsistent during network partitions), or simply “CA” (indicating they haven’t read Coda’s article from five years ago).

I agree with all of Jeff’s other points, but I cannot endorse his advice on using the CAP theorem. The CAP theorem itself is too simplistic and widely misunderstood to serve effectively when characterizing systems. Therefore, I ask that everyone stop citing the CAP theorem, stop discussing the CAP theorem. Instead, we should use more precise terms to express our trade-offs in system design.

(Of course, it’s ironic that while I don’t want others to discuss this topic anymore, I am myself writing an article about it. But at least from now on, when someone asks me why I don’t like discussing the CAP theorem, I can send them the link to this article. Also, sorry if this article comes across as a bit of a rant, but at least these rants come with citations.)

Read more »

cap-consistency-example.png

Introduction

I was once asked in an interview to explain my understanding of CAP. At the time, relying on second-hand information I had googled while preparing for the interview, I stammered out a few words like a schoolchild reciting a textbook. The interviewer smiled wryly and concisely summarized the key points he was looking for, leaving me feeling deeply embarrassed. Naturally, I didn’t get the job. Later, while working, I occasionally came across this term. At first I couldn’t grasp its essence, but after understanding and verifying it from different sources and perspectives, I gradually pieced together a clearer picture. Here, I’d like to organize my thoughts and put this shadow onto paper for future reference.

The interviewer roughly summarized it this way: In a distributed system, failures are inevitable, and partition tolerance (P) is absolutely necessary; therefore, when designing a system, you need to make a trade-off between availability (A) and consistency ©. At the time, the lesson left a deep impression on me. Looking back now, this summary merely points out the tip of the iceberg.

Read more »