木鸟杂记

大规模数据系统

On a clear afternoon in early winter, I set out from Ditan Park, passed through Jiaodaokou Santiao, crossed Yuanensi Hutong, and walked all the way to Shichahai. The weather that day was exceptionally good. Under the penetrating sunlight, the entire hutong district was basking lazily, and my heart felt light and carefree.

Read more »

When I graduated from grad school, I did an internship at NetEase Games and saved up some money. I decided to finally buy the programmer’s productivity tool I had been drooling over for so long — the MacBook Pro. The new version had just been released, so in late 2016, as soon as it hit the market, I ordered an entry-level Pro from the official website: 13-inch, two ports, no Touch Bar. But even the entry-level model cost over 10,000 RMB, and it only had two ports and no Touch Bar.

mac-buy-info.png

No Touch Bar was acceptable — after all, as a vim user, I still preferred a physical Esc key I could actually press. But only two ports — so I quietly went on JD.com and ordered a few dongles. At the time, USB-C wasn’t popular yet, so there weren’t many choices.

When it arrived, I opened it with joy and was completely amazed — the intoxicating new Space Gray color, the unprecedented thin and light design, the Unix-like system with a beautiful UI. All my previous little complaints were thrown to the back of my mind. Ah, so good.

Read more »

Introduction

In 2018, I worked on some 6.824 labs; my old notes are here. Unfortunately, I got stuck at Part 2C and could never pass the tests, so I put it aside. But in the days that followed, I often thought about this legendary course with a tinge of sadness. Putting it off until today, I can finally come back to fulfill that wish.

This time, there were three reasons I could pass all the tests: first, having done it once before, many of the principles were still fresh in my mind; second, over the past year or so I had gained a lot of hands-on experience with distributed systems at work; and third, my Go skills had improved somewhat. But during the process, I still encountered a great many vexing details. To make it easier to review them later, I’ve organized these details and recorded them here. If they happen to be of the slightest help to others working on this course, that would be another happy occasion.

6.824 and Raft

6.824 is an excellent open course on distributed systems. While doing the labs, I was constantly amazed by its exquisite design and the thoroughness of its materials. That the masters at MIT would share such an essential course with the world truly reflects the character of a great university and its scholars, and is a great fortune for us computer scientists.

Raft is a consensus protocol designed for understandability. Distributed consensus is a very, very classic problem in the distributed field, and also one of the hardest parts of distributed systems. Intuitively speaking, it’s like laying the foundation for a skyscraper on quicksand. Unreliable networks and failure-prone hosts create state changes so complex that they are truly beyond the ability of ordinary people to simulate in their heads. I am rather dull, and can only achieve some understanding through intuitive grasp plus accumulation of details. Returning to Raft, with Paxos as a predecessor in the same field, how did Raft still manage to stand out? I think it comes down to two key points:

  1. Easy to understand. Paxos is notoriously difficult to understand, and therefore hard to bring into common use. Raft reduces the dimensionality of algorithmic complexity by decoupling it into multiple modules, greatly lowering the difficulty of understanding for ordinary people. In addition, Raft has many elegant designs that avoid introducing complexity as much as possible, further reducing the mental burden.
  2. Easy to implement. Being easy to understand objectively leads to ease of implementation, but that doesn’t automatically mean you can produce an excellent system from it. If understanding remains at the intuitive level, implementation becomes a castle in the air. The brilliance of the Raft paper is that it has both an intuitive grasp and a detailed organization—it is almost a system design document, and a detailed one at that.

To do well on this lab, you need to consult a large amount of material. I’ve summarized what the lab mentioned and what I found at the end of this article. Of course, there is also the English barrier. Although I eventually passed all the test cases, there are still many points I didn’t implement well and many things I don’t fully understand.

Note: Later, in 2023, I did it again, and finally figured out most of the points.

Read more »

Introduction

Recently, while writing Go code, I needed to customize a string conversion method for a struct:

1
func (ms MyStruct) String() string

However, I got stuck when deciding whether to use a value method or a pointer method for the implementation.

Go’s syntactic sugar makes these two approaches consistent at the call site, which made it difficult for me to decide which was better. So I decided to dig deeper into the underlying principles so that I could write more idiomatic Go code in the future.

Read more »

Overview

Kafka (the paper was published in June 2011 [1]) is a system that combines the best of log processing and message queuing. With low latency, extremely high capacity, and throughput, it can be applied to both online services and offline business. To balance performance and scalability, Kafka made some design choices that seem counter-intuitive but are very practical in practice. Here is a routine summary of its design characteristics:

  1. Storage-oriented message queue: This means that, in near real-time scenarios, it can increase the storage capacity of traditional message queues by several orders of magnitude. The implementation makes full use of sequential disk writes and the OS’s own cache; furthermore, to improve disk access and transmission efficiency, it employs techniques such as file segmentation, segment-level indexes, zero-copy, and batch fetching.

  2. Flexible production and consumption patterns: Overall, it is a topic-based publish-subscribe architecture that supports both mutually exclusive consumption within a consumer group and duplicate consumption across different consumer groups. This involves two core design choices of message queues: pull-based consumption and client-side storage of consumption progress. Pull-based consumption may cause empty polling and slight latency, but the benefit is flexibility; client-side storage of consumption progress allows brokers to be stateless, enabling flexible scaling and fault tolerance. To simplify the implementation, each partition is consumed by at most one consumer at a time.

  3. Zookeeper for metadata storage: Using the distributed consistency component Zookeeper to store system metadata in a registry format, including broker and consumer liveness, consumer-to-partition mapping, consumption progress for each partition, etc. Zookeeper, as a highly available component that organizes KV pairs in a prefix-tree form and supports publish-subscribe, can meet Kafka’s collaborative needs for consumption coordination and progress persistence.

  4. Partition-level multi-replica design: This was not yet implemented in the paper; it was likely added later during the open-source evolution of the system. This feature enables fault tolerance for brokers.

  5. Simple yet powerful consumer API: Kafka clients generally provide two layers of API abstraction. One is a high-level simple read/write interface that does not require attention to partition and offset information; the other is a low-level interface that allows flexible control over partition assignment and consumption progress. The paper only mentions the former to demonstrate its simplicity.

Read more »

bazel-golang.png

Introduction

Bazel is an excellent open-source build system from Google. Its positioning, in the official words, is:

a fast, scalable, multi-language and extensible build system

In plain language:

A blazingly fast, scalable, multi-language, and extensible build system

To build Golang projects with Bazel, in addition to Bazel’s own features, you also need to understand the Golang-specific extension package rules_go. Additionally, you can use bazel gazelle to automate some of the boilerplate work.

Read more »

Overview

RDD, whose full name is Resilient Distributed Dataset, is an abstraction over dataset shapes. Based on this abstraction, users can execute a series of computations on a cluster without persisting intermediate results to disk. This was exactly a major pain point of the earlier MapReduce abstraction — every step required disk writes, leading to high unnecessary overhead.

For distributed systems, fault tolerance support is essential. To support fault tolerance, RDD only supports coarse-grained transformations. That is, the input dataset is immutable (or read-only), and each computation produces a new output. Fine-grained update operations on a dataset are not supported. This constraint greatly simplifies fault tolerance support and can satisfy a large class of computation needs.

Read more »

python-generator.png

Introduction

Once during an interview, I asked a candidate: What is a generator in Python? The answer was: A function with the yield keyword. But in my impression, the value returned by such a function is the generator, not the function itself. For example:

1
2
3
4
5
6
7
8
9
10
11
In [1]: def get_nums(n): 
...: for i in range(n):
...: yield i
...:
In [2]: type(get_nums)
Out[2]: function

In [3]: nums = get_nums(10)

In [4]: type(nums)
Out[4]: generator

However, seeing the candidate so confident, I had a vague feeling that something was off, which led to the following exploration.

Read more »

Introduction

Binary Tree is a fascinating data structure with lots of interesting properties. Binary Search Tree (BST for short below; also known as binary search tree, search binary tree, etc.) is one of the commonly used variants, and it has many interesting properties:

  1. Left children are all smaller, right children are all larger.
  2. In-order traversal yields a sorted sequence.
  3. Projection is in ascending order.

Of course, adding balance introduces even more properties, but let’s set that aside for now. Today, let’s start with a small problem.

Read more »

In late autumn, all kinds of fruits arrive in abundance. Hawthorn, which we call “shanlihong” in my hometown, seems to be called “hongguo” in Beijing. A famous old Beijing dish is “Chao Hongguo” (stir-fried hawthorn). One day at the vegetable market, I found that this year’s hawthorns were large and of good quality, so I quickly bought some.

Read more »

After building a blog using GitHub Pages + Hexo + NexT for Hexo Blog and using it for a while, I wanted to further customize and beautify it, so I’m recording my notes here. During the process, I found that the English documentation is much more detailed than the Chinese documentation. Basically every setting involved in thems/next/_config.yaml is explained, so if your English is good enough, just read the English docs: https://theme-next.org/docs/ .

In addition, remember to promptly deploy and preview changes locally at http://localhost:4000/ using hexo s after each modification to see whether it meets your expectations.

Read more »

The previous post covered how tasks awaiting scheduling are organized. This post continues with the softer topic: node resource abstraction and scheduling policies.

Introduction

Since Ray supports explicit resource constraints for tasks, it needs to abstract the resources of all nodes in a hardware-agnostic way, managing all resources uniformly to enable logical addition and removal of resources. When a node joins, its total resource capacity must be sensed; when a task is scheduled, a node satisfying the constraints must be found; when a task is successfully scheduled, the remaining available resources can be obtained, and so on.

In addition to standard resources like CPU and GPU, Ray also supports scheduling of user-defined labeled resources. When starting a node (ray start --resources <resources>), users specify the total amount of some category of resource that the node has (e.g., memory, bandwidth, a certain model of GPU, etc.). When defining a remote function, users specify how much of that category the task requires. Ray’s scheduler will then dispatch the task to a specific machine according to the user’s custom resource requirements. This is an interesting design for interaction between user code and the scheduler.

For scheduling policies, due to Ray’s decentralized scheduling, inconsistent states can easily arise. The simplest approach in practice turns out to be statistically optimal—for each task, find nodes that satisfy the resource constraints and randomly select one to dispatch the task to.

Read more »