木鸟杂记

大规模数据系统

The originator of the “industrial assembly line,” the motor assembly of the Ford Model T, broke the assembly process into 29 steps, reducing assembly time from an average of twenty minutes to five minutes — a fourfold efficiency gain. The image below is sourced from.

T-model-car.png

This assembly-line philosophy is ubiquitous in data processing. Its core concepts are:

  1. Standardized data collections: Corresponding to the object to be assembled, this is a consistent abstraction for the inputs and outputs of every stage in data processing. Consistency means that the output of any processing stage can serve as the input to any other processing stage.
  2. Composable data transformations: Corresponding to a single assembly step, this defines an atomic operation that transforms data. By combining various atomic operations, one can achieve powerful expressiveness.

Thus, the essence of data processing is: for different requirements, read and standardize the data collection, then apply different combinations of transformations.

Read more »

Recently, while going through DuckDB’s execution engine related slides (Push-Based-Execution), I came across this paper: Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age. I had seen it referenced several times in articles about execution engines; moreover, NUMA architecture is very important for modern database architecture design, but my understanding of it was still shallow, so I decided to read it.

As can be seen from the title, the paper has two main keywords:

  1. NUMA-Aware
  2. Morsel-Driven

Based on this, here is a rough summary of the paper’s central idea:

  1. In the many-core era, due to the binding relationship between some CPUs and some memory, CPU memory access is non-uniform (NUMA). That is, for a given CPU core, some local memory has lower access latency, while other memory has higher latency.
  2. The traditional volcano model uses the Exchange operator for parallelism. Other operators are not aware of multithreading, and therefore cannot schedule computation close to memory (hardware affinity). That is, they are not NUMA-local.
  3. To solve this problem, the paper proposes, on the data dimension: horizontally partitioning the dataset, with one NUMA-node processing one data partition; vertically splitting each partition into segments (Morsel), and performing concurrent scheduling and preemptive execution at the Morsel granularity.
  4. On the computation dimension: pre-allocating one thread per CPU; during scheduling, each thread only accepts tasks whose data chunks (Morsel) are allocated to its local NUMA-node; when the execution progress of sub-tasks among threads is unbalanced, faster threads will “steal” tasks that were supposed to be scheduled to other threads, thereby ensuring that multiple sub-tasks of a Query finish approximately at the same time, without a “long tail” partition.
Read more »

This was my first live stream sharing session on Bilibili. The recording is available at https://www.bilibili.com/video/BV1Fz4y1u79v. I mainly talked about key points for landing an infra job from the perspective of my own job search and interview experiences, and answered some live questions from viewers.

🐎 Job Search Preparation

Soft Skills

💁‍♂️ Communication. During interviews, whether describing experiences, doing system design, or even writing code, communication is the top priority—it is the prerequisite for the smooth progress of all interview stages.

Explain something clearly:

  1. Context: First align the context with the interviewer. Don’t assume they know more than you.
  2. Organization: Background → Requirements → Solution → Challenges → Results
  3. Conciseness: Just like writing an article—go through it multiple times.

🧠 Thinking. Mainly abstraction and association.

  1. Abstraction. Also called induction, or knowledge clustering. Using a tree as an analogy, it means traversing several child nodes, abstracting the common traits of their parent node, and then deducing new child nodes. For example, scheduling problems (CPU scheduling, distributed task scheduling).
  2. Association. Also called connection, or cross-domain linking. Using a graph as an analogy, it means for several connected subgraphs, establishing pathways between them in a new dimension. For example, text is a serialization of thought.
Read more »

RocksDB is the underlying storage engine for many distributed databases, such as TiKV, CRDB, NebulaGraph, and more. Artem Krylysov, who works at DataDog, wrote an article that provides a popular science introduction to RocksDB—easy to understand. Here is my translation to share with everyone.

Introduction

In recent years, the adoption of RocksDB has risen sharply, making it the go-to choice for embedded key-value stores (hereinafter referred to as KV stores).

Currently, RocksDB runs in production environments at companies such as Meta, Microsoft, Netflix, and Uber. At Meta, RocksDB serves as the storage engine for MySQL deployments and provides storage support for the distributed graph database TAO.

Large tech companies are not the only users of RocksDB. Several startups, including CockroachDB, Yugabyte, PingCAP, and Rockset, are built on top of RocksDB.

I worked at Datadog for 4 years, building and running a series of RocksDB-based services in production. This article provides a high-level overview of how RocksDB works.

Read more »

Many computer science programs in domestic universities tend to focus heavily on the “instilling” of fundamentals and theory (based on my own experience back then; things may be better now). While there are some course projects to build coding skills, they are often insufficient. As a result, many students feel less than confident about their coding abilities before entering the workforce. Below are some suggestions based on my own coding journey.

Read more »

Overview

Facebook Velox is a C++ library for SQL runtime, designed to unify Facebook’s various computation flows, including Spark and Presto, using a push-based model with support for vectorized execution.

Velox takes an optimized PlanNode tree and slices it into linear Pipelines. The Task is responsible for this transformation, with each Task targeting one PlanTree segment. Most operators are translated one-to-one, but some special operators usually appear at the split points of multiple Pipelines. Typically, these split points correspond to branching points in the plan tree, such as HashJoinNode, CrossJoinNode, and MergeJoinNode, which are usually translated into XXProbe and XXBuild. However, there are also exceptions, such as LocalPartitionNode and LocalMergeNode.

Read more »

I don’t know why, but this year an unusually large number of friends shared year-end summaries on Moments. I quite like this format. For one, I enjoy reading others’ year-end reviews — beyond the stories, I get to see different paths people have taken. For another, a regular annual review does help organize my thoughts and make simple outlooks.

In ancient times when knowledge was monopolized, only emperors and generals could have biographies. Today, in this era of information explosion, everyone can record their story and speak for themselves. In the short term, as I grow older and my thoughts deepen, many things are forgotten within months if not written down. Through an annual review, I can later trace the trajectory of my intellectual changes — a way of self-reflection, hoping for new insights. In the long term, we will all pass away someday. If I can leave even a fragment behind via the internet, bringing a smile to future readers, it would be like a goose leaving its call in the sky.

Read more »

There is a question on Zhihu: how to implement a database? I couldn’t resist writing another piece. Using the most common way of analyzing and understanding problems in computer science, we can think about how to implement a database from two dimensions: logical and physical.

Logical Dimension

Data Model (External, User-Facing)

If you want to implement a database, first you need to define what kind of data model to provide to users. In earlier years, this might not have been an issue—back then, database roughly equaled relational data, which roughly equaled Oracle/SQL Server/MySQL/PostgreSQL. But as data volumes continue to grow and user demands become increasingly refined, the relational model can no longer be a one-size-fits-all solution.

Read more »

The DDIA reading group shares the book chapter by chapter, supplemented with some details based on my experience with distributed storage and databases in industry. We share roughly every two weeks—welcome to join us. The schedule and all transcripts are here. We also have a corresponding distributed systems & database discussion group; notifications are sent there before each session. If you’d like to join, add me on WeChat: qtmuniao, give a brief self-introduction, and note “distributed systems group.” Also, my public account “Muniao Notes” has more articles on distributed systems, storage, and databases—feel free to follow.
The first part of this book discusses single-machine data systems; the second part discusses multi-machine data systems.

Replication means keeping multiple copies of the same data on different machines connected via a network. Its benefits are:

  1. Reduced latency: can be geographically close to users in different regions at the same time.
  2. Increased availability: can continue to provide service when parts of the system fail.
  3. Increased read throughput: smoothly scale out the machines available for queries.

This chapter assumes all data in our data system can fit on a single machine, so we only need to consider multi-machine replication. What if data exceeds single-machine scale? That’s what the next chapter addresses.

Read more »

There was a question on Zhihu: how can you tell a programmer’s skill level? Based on my experience reviewing code in recent years, I can’t help but rant a bit about the topic of engineering craftsmanship—just in time for the second post in the “Writing Good Code” series.

Mental Framework

Less skilled programmers often fall short in “abstraction”.

What is abstraction ability? In short, it’s the capacity to categorize and draw analogies. Through extensive practice and reading, problems are broken down orthogonally into atomic knowledge, which is highly reusable; then, by combining and reasoning with this atomic knowledge, new problems can be solved creatively. In other words: induction and deduction.

Read more »

Overview: Why Flow Control Matters

The benefit of moving to the cloud lies in pooling resources, enabling multi-tenant sharing, and allocating on demand, thereby reducing costs. But:

  1. Multi-tenant isolation: Users demand that they can use the capacity they’ve purchased without being affected by other tenants.
  2. Resource sharing: Resources can only be logically separated, not physically separated; otherwise, dynamic allocation (overcommitment) cannot be fully achieved.

These two are a pair of contradictory requirements, and I believe they are the most critical problems that cloud-native databases must solve. If this problem is not solved well, the database will:

  1. Either the platform won’t make money: With static resource reservation, users may be satisfied because they can always use the resource quota sold to them, but this leads to massive resource waste—either the price is high, or users won’t pay.
  2. Or users will be dissatisfied: Multiple tenants share physical resources, but it is very easy for them to affect each other, preventing users from using the quota promised by the platform.

Starting from static allocation, DynamoDB gradually evolved a combined global and local admission control mechanism, thereby achieving physical resource sharing while logically providing quota isolation to users, realizing true cloud-native databases. Below, based on the details disclosed in the paper Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service, I summarize the evolution of its flow control mechanism for your reference.

Due to limited expertise, any errors are welcome to be pointed out at any time.

Read more »

Google LevelDB is a canonical implementation of LSM-Tree. However, after being open-sourced, in order to maintain a lightweight and concise style, besides fixing bugs, it has not seen major updates. To make it capable of meeting diverse workloads in industrial environments, Facebook (Meta) forked LevelDB and made optimizations in many aspects. On the hardware side, it can more effectively utilize modern hardware such as flash memory, fast disks, and multi-core CPUs; on the software side, it has made numerous optimizations for read/write paths and Compaction, such as SST indexing, index sharding, prefix Bloom Filter, column families, etc.

This series of articles, based on the RocksDB blog series, combined with source code and some practical experience, shares some interesting optimization points, hoping to inspire everyone. Due to limited expertise, any inaccuracies are welcome to be discussed in the comments.

This is the first article in the RocksDB optimization series. To optimize deep lookup performance, SST files at different levels are indexed in a certain way.

Read more »