木鸟杂记

大规模数据系统

Recently I’ve been in the mood to read up on China’s economy, and a friend recommended this book. I rushed through it on WeChat Reading in about a week. Although I devoured it somewhat hastily, it was thoroughly exhilarating. I had read some chapters of Wu Xiaobo’s Turbulent Thirty Years before; this book serves as a sequel, with a consistent style, but reading it resonated more deeply, because this was precisely the golden decade in the lives of us first-generation post-90s. As a “gradually conscious” firsthand witness, the sense of replay is very strong.

Overview

Wu Xiaobo excels at grand narratives and emotional buildup. In this book, he panoramically recreates the turbulent ups and downs of China’s economy and enterprises from 2008 to 2018. The title “Big Water, Big Fish” seems tacky at first glance, but after finishing the book and reflecting on it, it’s actually quite vivid. The relationship between market and enterprise is precisely that of fish and water—the vast water accommodates the growth and struggle of fish, while the big fish makes the water vibrant and magnificent.

This decade was also a life stage for me, from an ignorant teenager to a young adult entering society. As a firsthand witness to this period of economic history, some of the phenomena I didn’t pay much attention to, and some I couldn’t make sense of. The author, in a chronicle-like manner, from a higher-dimensional perspective, closely following the dual dimensions of time and space, organically weaves together various threads, and reveals some little-known behind-the-scenes stories from that time. Through these hidden currents beneath the iceberg, when we re-examine this history, we can faintly trace some underlying patterns.

Read more »

Of all the scientists who have studied influence, Cialdini has had the greatest impact on me… This bestseller presents six to eight methods so that your clever little ideas will no longer prevent you from getting your best interests.
— Charlie Munger, Warren Buffett’s right-hand man

Humans all have certain mental sets or fixed patterns. Once triggered, we act as if someone pressed the play button, and subsequent behavior may be beyond our control. In fact, all of these stem from the human tendency to avoid thinking. Avoiding thinking indeed brought many benefits to humans in primitive times — for example, it saved time and reduced life-threatening consequences caused by indecision. But now, the instinct to avoid thinking has also become the easiest place for compliance experts to exploit.

In fact, regardless of the compliance tactic used, the core response method is the same — thinking. First, detach yourself from emotions; second, figure out what you truly want, and distinguish appearance from substance.

Read more »

Overview

Bw-tree is a data structure proposed in a paper published by Microsoft in 2013. Considering the increasing prevalence of multi-core machines and SSDs, and combining the characteristics of the two major storage engines B±tree and LSM-tree, it proposes a latch-free, delta-update, log-structured B-tree family — Bw-tree.

Since the original paper is vague on implementation details, several authors from CMU implemented an in-memory version of Bw-tree in 2015, and then published another paper to supplement some implementation details, open-sourcing the code on GitHub as open bwtree.

As usual, here is a summary (TL;DR). The main characteristics of Bw-tree are:

  1. Overall three layers: Bw-tree index layer, cache control layer, and Flash storage layer.
  2. Bw-tree is globally a B+ tree, while borrowing ideas from the B-link tree: each node has a side pointer to its right sibling.
  3. At the individual tree node level, Bw-tree resembles an LSM-tree-like log structure: each logical node consists of a base node + a delta records chain.
  4. The core data structure enabling latch-free operation in Bw-tree is called the Mapping Table. Installing is done via CAS; modifying a mapping entry can simultaneously update multiple logical pointers.
  5. The Bw-tree flash layer also uses a Log-Structure Store (append-only) to manage the physical storage of logical pages (base pages and delta records).
Read more »

In the tech world, distributed systems have become an increasingly unavoidable term. The reason is that the scale of data in this era no longer matches the storage and processing capacity of a single machine. Thus, there are two approaches: building larger machines and connecting machines together. The former is costly and inflexible, so the latter is gaining more favor. According to the law of conservation of cost, cost does not simply disappear—when hardware costs go down, software design costs go up. Distributed systems theory is the key to reducing this software cost.

What It Is

Leslie Lamport [1], a pioneer of distributed systems, wrote in one of his most important papers, “Time, Clocks, and the Ordering of Events in a Distributed System” [2]:

A system is distributed if the message transmission delay is not negligible compared to the time between events in a single process.

Lamport explained this with a thought process akin to relativity. Consider two time scales: the message transmission delay between processes, and the interval between events within a single process. If the former is not negligible compared to the latter, then this group of processes constitutes a distributed system.

Understanding this definition requires grasping a few key concepts (formal definitions are always like this, shrug): process, message, and event. To avoid infinite recursion, I won’t go too deep here, but here’s an intuitive analogy: a process is a worker responsible for getting things done; the work can be broken down into multiple steps, each step is an event, and messages are the way workers communicate with each other.

This also aligns with the definition of distributed computing [3] (distributed systems are also called distributed computing) given by Wikipedia:

  1. There are several autonomous computational entities ( computers or nodes ), each of which has its own local memory.
  2. The entities communicate with each other by message passing.

This involves the most important kinds of resources in computer systems: computation (computational), storage (memory), and the network (network) that connects them.

In summary, we can describe distributed systems from another angle:

Externally, a distributed system appears as a unified whole, providing specific functionality based on aggregate storage and computing power.

Internally, a distributed system appears as a group of individuals that communicate via network messages and collaborate through division of labor.

The design goal of a distributed system is to maximize overall resource utilization while handling local failures and maintaining external availability.

Read more »

Overview

Facebook TAO[1], short for The Associations and Objects, uses nodes (Objects) and edges (Associations) — the most fundamental abstractions in a “graph” — making the name fitting for Facebook’s graph storage.

In summary, TAO is Facebook’s solution for updating ultra-large data and reading associations in social scenarios. Its core characteristics are:

  1. Provides a graph API specialized for Facebook’s social feed scenarios, such as point queries, first-degree association queries, and time-based range queries.
  2. Two-layer architecture: MySQL as the storage layer and MemCache as the caching layer; the caching layer can be further divided into primary and secondary layers.
  3. Can be extended across multiple data centers, highly optimized for read performance, and provides only eventual consistency guarantees.
Read more »

Having walked many roads, as time passes, memories gradually fade. Fortunately, photography captures fleeting moments, evoking long-sealed sentiments. This afternoon, with some leisure time, I compiled a few, strolling through time, slightly soothing my restless mood.

Beijing

Studying in the capital for more than a decade, this is a place with too many memories, naturally warranting its own album. However, the photos were mostly taken in the last two years and cannot capture one-tenth of the capital’s beauty. The solemnity of the Forbidden City, the accumulation of the National Library’s Ancient Books Library, the leisure of Beihai, the solemnity of the Confucius Temple, the liveliness of the Tucheng ruins, the desolation of the Great Wall, the development of Haituo Mountain — too many to record.

Overlooking the Forbidden City from Jingshan

Read more »

Distributed systems have many classic patterns, also known as design patterns. Each pattern solves a classic category of problems; with enough accumulated knowledge, one can make variations and trade-offs to design architectures that fit specific requirements. However, there don’t seem to be many shared experiences in this area, so I plan to summarize some lessons from work and study—partly as notes, and partly hoping they may be helpful to others. Given space constraints and my own limitations, it is difficult to be exhaustive or precise. Corrections are welcome for any inaccuracies.

Each post will analyze a topic through background overview, architecture modules, and summary/extensions. This is the first post: the Master-Workers architecture.

Overview

The Master-Workers architecture (loosely translated as primary-secondary architecture) is a common organizational pattern in distributed systems, such as the Master and ChunkServers in GFS, and the Master and Workers in MapReduce. Faced with a cluster of separate machine resources in a distributed system, the primary-secondary architecture is the most natural and straightforward way to organize them—like a group of people with a leader who makes decisions and coordinates, thereby maximizing the group’s collective output.

This is also a manifestation of the common divide-and-conquer philosophy in computer systems. A complex system is broken down into several relatively high-cohesion, low-coupling sub-modules, with clearly defined functional boundaries and interaction interfaces, making the system easier to understand, maintain, and extend. In the primary-secondary architecture, the primary (Master) typically maintains cluster metadata and uses it for scheduling, while the secondaries (Workers) are usually responsible for reading and writing specific data shards (in storage systems) or serving as execution units for sub-tasks (in compute systems).

Read more »

Introduction

Paxos is an unavoidable algorithm in distributed systems, but it is notoriously difficult to understand. So I had always been avoiding it, but after doing so for a long time, I began to feel some regret. Thus, over the past week, I spent my spare time collecting a lot of materials and trying many different approaches, and finally got a preliminary grasp of it. While it’s still fresh in my mind, I’ll put my understanding down on paper as a brief summary for future reference.

The inventor of the Paxos algorithm, Leslie Lamport, is one of the founding figures of distributed systems, with many interesting anecdotes. You can catch a glimpse of this just from the name Paxos: Paxos is a fictional ancient Greek city-state that Lamport invented to introduce the consensus problem in distributed systems. After the initial related paper, The Part-Time Parliament, was published in 1998, many people said they couldn’t understand it. So in 2001, Lamport restated the core ideas using relatively concise language and logic, resulting in Paxos Made Simple.

The abstract of Lamport’s Paxos Made Simple paper contains only one sentence:

The Paxos algorithm, when presented in plain English, is very simple

Yet, I cannot understand this kind of simple.

Read more »

Zookeeper

This post introduces the paper by Patrick Hunt et al. published in 2010, still widely used today, positioned as a distributed system coordination component —— ZooKeeper: Wait-free coordination for Internet-scale systems. When programming with multiple threads and processes, we inevitably need synchronization and mutual exclusion; common methods include shared memory, message queues, locks, semaphores, etc. In distributed systems, different components also inevitably need similar coordination mechanisms, and thus Zookeeper was born. Combined with client libraries, Zookeeper can provide dynamic parameter configuration (configuration metadata), distributed locks, shared registers (shared register), service discovery, group membership, leader election, and a series of other distributed system coordination services.

Overall, Zookeeper has the following characteristics:

  1. Zookeeper is a distributed coordination kernel with relatively cohesive functionality, keeping the API simple and efficient.
  2. Zookeeper provides a set of high-performance, FIFO-guaranteed, event-driven non-blocking APIs.
  3. Zookeeper organizes data using a filesystem-like directory tree, offering powerful expressiveness and making it convenient for clients to build more complex coordination primitives.
  4. Zookeeper is a self-contained fault-tolerant system, using the Zab atomic broadcast (atomic broadcast) protocol to ensure high availability and consistency.

This article follows the order of the paper, briefly introducing Zookeeper’s service interface design and rough module implementation. For more details, please refer to the paper and the open source project homepage.

Read more »

Introduction

Nowadays, with the development of communication technology, the proliferation of mobile Internet, and the rise of IoT, connected vehicles, and AI, the amount of data generated daily is growing explosively. Data at this scale cannot be processed independently by traditional single-machine systems; it can only be handled by large-scale distributed systems. As a result, distributed systems have gradually become a prominent field of study. However, as a beginner in distributed systems, it is easy to feel overwhelmed when faced with the vast, unclassified sea of learning materials available online.

But distributed systems have their fundamental research areas and unique evolutionary threads, such as:

  1. Some fundamental research problems: ordering, consistency, fault tolerance, consensus algorithms, concurrency control, etc.
  2. Some fundamental theorems: CAP, PACELC, FLP
  3. Gradually evolving industrial systems: MapReduce, Spark, GFS, Dynamo, Cosmos

Therefore, by grasping distributed systems along the two dimensions of “time” and “space,” one can master the essentials and learn more clearly. “Time” refers to the evolutionary thread of distributed systems, which can be understood by reading papers from different periods in academia and industry. “Space” refers to the decomposition of fundamental problems studied in distributed systems, which can be understood by reading books to build a knowledge system. This article briefly summarizes some materials I collected during my study of distributed systems, categorized for your reference. The materials are listed in no particular order; please adopt them as needed.

Note: Most of the recommended materials are in English. If you have difficulty reading them, I recommend using the Chrome browser with the “Google Translate” extension installed, which allows one-click “Translate this page.”

Read more »

I had heard of LevelDB a while ago, but this time, on a whim, I went through the code with some reference materials, and it really lives up to its reputation. Whether you are interested in storage, want to use C++ elegantly, or want to learn how to architect a project, I highly recommend taking a look. Not to mention that the authors are Sanjay Ghemawat and Jeff Dean.
If I don’t write something after reading through it once, given my memory, I’ll surely forget it soon. So I wanted to write something about the wonders of LevelDB, but I didn’t want to take the usual approach: starting with an architectural overview and ending with module analysis. While reading the code, I kept thinking about where to start writing. When I was just finishing up, what impressed me most were actually all the exquisite data structures in LevelDB: tailored to the scenario, built from scratch, appropriately trimmed, and precisely coded. Why not start the LevelDB series with these little corner pieces?
This series mainly wants to share three classic data structures commonly used in engineering in LevelDB: the Skip List used for fast read/write to the memtable, the Bloom Filter used for fast filtering of sstables, and the LRUCache used for partially caching sstables. This is the third article, on LRUCache.

Introduction

LRU is a data structure commonly seen in engineering, often used in caching scenarios. In recent years, LRU has also become a hot interview question, first because it is widely used in engineering, second because the code volume is relatively small, and third because the data structures involved are very typical. There is a corresponding problem on LeetCode: lru-cache. Compared to real-world scenarios, the problem has been simplified: essentially, it requires maintaining a time-ordered key-value collection, where both keys and values are integers. The classic solution is to use a hash table (unordered_map) and a doubly linked list; the hash table solves the indexing problem, and the doubly linked list maintains the access order. Here is a solution I wrote at the time, which is characterized by using two helper functions and being able to return the node itself to support chaining, thereby simplifying the code.

Returning to the LevelDB source code, as an industrial-grade product, what optimizations and changes have been made to its LRUCache? Let’s break down the LRUCache used in LevelDB together and see what the differences are.

This article first clarifies how LRUCache is used, then gives an overview of the implementation ideas of LRUCache, and finally details the implementation of the relevant data structures.

Read more »

The patched-together May Day holiday is coming to an end, and the annual social media photo-sharing competition is about to kick off. Although smartphones’ built-in auto-enhancement features are getting increasingly powerful, they can’t satisfy us programmers’ desire for fine-grained control over everything. Below, I’ll share some of my humble color-grading experience, using a photo I took at Tianxin Pavilion in Changsha as an example, to teach you a dragon-slaying technique for landscape color grading that turns a fake into the real deal. Finally, I’ll give a brief summary of photo editing principles to explore the essence behind image color grading.

Before & After

To show that photo color grading really works wonders, here’s a side-by-side comparison for an intuitive feel:

对比图-调色前.jpg

对比图-调色后.jpg

This photo was taken at Tianxin Pavilion, the commanding height of old Changsha. That day, dark clouds gathered and a storm was approaching. Looking into the distance from the second floor of Tianxin Pavilion, yellow tiles and blue sky, bustling traffic — motion and stillness intertwined, as if with the momentum of thunder.

Below I’ll explain the color grading process and principles step by step, keeping it as concise as possible, with reference materials for those interested in digging deeper.

Read more »