木鸟杂记

大规模数据系统

I changed jobs a while ago for some reasons and interviewed for several distributed storage-related positions. I noticed there are relatively few experience-sharing posts on this topic, so I decided to share mine. Due to company privacy policies, I won’t list questions by company; instead, I’ll briefly summarize the general directions and contents of the interviews. Given my limited experience and expertise, please feel free to point out any mistakes in the comments.

Distributed storage positions cover a wide range, and can generally be categorized by direction as follows:

  1. Distributed File Storage
  2. Object Storage
  3. Distributed KV or Cache
  4. Distributed Database (NewSQL)
  5. Table Storage
  6. Block Storage

Their positioning and directions also differ slightly:

Distributed File Storage. Supports POSIX semantics or trimmed POSIX. It can serve as a storage base for disaggregated storage and compute, or be used directly by applications, such as deep learning training and intermediate storage for big data processing. Common products include Pangu File System, Polarfs, JuiceFS, etc.

Object Storage. Generally stores unstructured data such as images and videos, usually compatible with Amazon’s S3 API. Common products include Amazon S3, Alibaba Cloud OSS, and Tencent Cloud COS.

Distributed KV or Cache. Usually compatible with the Redis interface, or a more simplified KV interface. Generally seeks speed, based on memory or SSD, or even new hardware like persistent memory. Used for low-latency business caching or as the base for disaggregated storage and compute systems. Products include ByteDance’s ABase, Alibaba Cloud’s Tair, and PingCAP’s TiKV.

Distributed Database (or NewSQL). Usually provides a SQL interface and unlimited horizontal scalability. Common products include PingCAP’s TiDB, Alibaba Cloud’s PolarDB, and Tencent Cloud’s TDSQL.

Table Storage. The classic interface can refer to column-oriented HBase, which is widely used in the big data field. Products include HBase and ByteDance’s ByteTable.

Block Storage. Provides a block device interface, generally used for the system disk of cloud hosts. Products include SmartX’s hyper-converged solution.

Read more »

boltdb is one of the few pure-Go, single-node KV libraries on the market. boltdb is based on Howard Chu’s LMDB project, with a clean implementation; excluding unit tests and adapter code, the core code is just a little over four thousand lines. A simple API and minimalist implementation—that is precisely the author’s intent. Due to the author’s limited bandwidth, the original boltdb has been archived and is no longer updated. If you want to improve it and submit new PRs, the etcd-maintained fork, bbolt, is recommended.

For convenience, this series of guide articles is still based on the original, unchanging repo. The project is small but complete: with just over four thousand lines of code, it implements a single-node KV engine based on B+ tree indexing and supporting single-writer, multiple-reader transactions. The code itself is simple and unadorned, with adequate comments. If you are a Go enthusiast or interested in KV stores, boltdb is absolutely a repo not to be missed.

This series is planned as three articles, sequentially focusing on data organization, index design, and transaction implementation—the three main aspects of boltdb source code analysis. Since these three aspects are not fully orthogonal and decoupled, the narrative will inevitably interweave them. If something is unclear, just skip it for now and come back to sort it out once you have the full picture. This article is the third: boltdb transaction implementation.

Introduction

Before analyzing boltdb’s transactions, it is necessary to define the concept of a transaction to clarify the scope of our discussion. A database transaction (abbreviated: transaction) is a logical unit of work in the execution process of a database management system, consisting of a finite database operation sequence[^1]. The Wikipedia definition is a bit convoluted; you only need to grasp a few key points when understanding it:

  1. Execution: at the computation level
  2. Logical unit: implies indivisibility
  3. Finite operation sequence: generally not too large in granularity

Why do we need transactions? Transactions emerged in the 1970s to free database users from cognitive burden: transactions help users organize a set of operations and automatically clean up when something goes wrong. Later, NoSQL and some distributed storage systems sacrificed full transaction support for higher performance. However, history spirals upward—the convenience of transactions led NewSQL and other next-generation distributed databases to bring them back.

Speaking of transactions, the most famous concept is the four ACID properties. In fact, ACID is more like an easy-to-remember slogan than a strict description, because the properties are not very symmetric conceptually and depend on contextual interpretation. This article will still analyze boltdb’s transaction support from these four aspects, but at the beginning of each section, I will first draw on Martin Kleppmann’s talk[^2] to try to explain their meaning from different angles; then I will analyze how boltdb implements them.

Read more »

There are an awful lot of wild peach trees on the mountains around Beijing, scattered densely across the slopes and huddled in the valleys like an invasive species. During other seasons, I didn’t have any particular feelings about them while hiking, but in spring alone, I was greatly delighted. From afar, they looked like wisps of pink smoke, ethereal and faint. However, in the early spring of the capital, just as the pandemic was settling down, the smog returned. I watched the mountains covered in peach blossoms with my own eyes, yet could only imagine the beautiful scenery under a clear sky.

Although the peach blossoms are gorgeous, they cannot compete with the old smog

Read more »

This article is based on the OSDI 2020 “Virtual Consensus in Delos” paper presentation. It explores the control-plane storage system in distributed systems and proposes a distributed architecture based on layered abstraction. The core idea is to introduce a logical protocol layer, allowing the physical layer to be implemented and migrated on demand—somewhat analogous to how virtual memory relates to physical memory in a single-machine system.

Read more »

Introduction

In college, I didn’t study databases very deeply. What I still remember are only a few conceptual ideas on the usage side, such as SQL, ER diagrams, normal forms, and transactions. I never had a systematic understanding of their implementation. But since I’ve embarked on the storage path, I definitely need to fill in my database knowledge. Previously, I saw many people on Zhihu recommending this cmu15445 course, and I had already bookmarked the course homepage, but never found the time to look at it. “Constant thinking leads to a response” — during this holiday, coinciding with a job change, I finally have some large blocks of time to make a start.

Syllabus

A brief introduction to the syllabus of cmu15445. This course uses Database System Concepts as a supplementary textbook, and covers all aspects of the design and implementation of Database Management Systems (DBMS), including:

  1. Data models (relational, document, key-value)
  2. Storage models (n-ary, decomposition, i.e., row-oriented and column-oriented)
  3. Query languages (SQL, stored procedures)
  4. Storage structures (heaps, log-structured)
  5. Index design (sorted trees, hash tables)
  6. Transaction processing (ACID, concurrency control)
  7. Data recovery (logging, snapshots)
  8. Execution engines (joins, sorting, aggregation, optimization)
  9. Concurrency architectures (multi-core, distributed)

As you can see, the content is very comprehensive. The course uses an open-source educational database as a case study to deeply explore the trade-offs made in database design across all these aspects. This course places great emphasis on programming practice, with a series of interconnected yet concise programming assignments.

Read more »

cmu15445 is a classic open course on the design and implementation of Database Management Systems (DBMS). The course uses Database System Concepts as the textbook, providing lecture slides, notes, and videos, and carefully prepares several interconnected projects. The course places great emphasis on system design and programming implementation. In the words of the professor Andy Pavlo, this is a course that you can put on your resume and that can help you land a good offer.

Having some free time this vacation, I dug out this course and was immediately impressed by its rich content and elegant organization. Unfortunately, time is limited, so I can only follow the projects as the main thread, supplemented by lecture slides and notes, and briefly go through them. If I have more time, I will skim the textbook and videos. Starting from project one, after each project’s autograder run, I will write a note for future reference. Professor Andy Pavlo suggests not making the project code repository public, so I will try to minimize code snippets and focus more on the thought process.

This post is about project one: managing the cache of file system pages in memory — the buffer pool manager.

Overview

The target system of this project, BusTub, is a disk-oriented DBMS, but data on disk does not support byte-level access. Therefore, an intermediate layer for managing pages is needed. However, Professor Andy Pavlo insists on not using mmap to delegate page management to the operating system. Thus, the goal of Project 1 is to proactively manage the cache of pages from disk in memory, thereby minimizing the number of disk accesses (in time) and maximizing the contiguity of related data (in space).

This project can be broken down into two relatively independent sub-tasks:

  1. Maintaining the replacement policy: LRU replacement policy
  2. Managing the buffer pool: buffer pool manager

Both components are required to be thread-safe.

This article first analyzes the project content from basic concepts and core data flows, then goes through the two sub-tasks respectively.

Read more »

Overview

In Golang, a slice is very similar to arrays in other languages, yet differs in many ways. As a result, beginners often develop misunderstandings and unwittingly fall into various pitfalls when using slices. This short article first starts from the official Go blog, laying out the slice-related syntax provided by the official documentation. Next, it presents a model for understanding slices using diagrams. Finally, it summarizes and analyzes some special usage scenarios, aiming to provide a clearer profile of slices from multiple perspectives.

If you do not wish to read the lengthy narrative, you can jump directly to the summary at the end.

go-slice-view-derive.png

Read more »

boltdb is one of the few pure-Go, single-node KV stores out there. It is based on Howard Chu’s LMDB project, with a refreshingly clean implementation: stripping away unit tests and adapter code, the core is only about four thousand lines. Simple APIs and a minimalist implementation are exactly what the author intended. Due to limited maintainer bandwidth, the original boltdb has been archived and is no longer updated. If you want to contribute improvements or open new PRs, the etcd-maintained fork bbolt is the recommended place.

For convenience, this series of guided-reading articles still uses the original, now-frozen repo as its foundation. Though small, the project is fully featured: in just over four thousand lines of code it implements a single-node KV engine with a B±tree index and single-writer/multi-reader transactions. The code itself is plain and well-commented. If you are a Go enthusiast or interested in KV stores, boltdb is absolutely a repo you shouldn’t miss.

This series is planned as three articles, analyzing the boltdb source code around data organization, index design, and transaction implementation. Because these three aspects are not fully orthogonal, the narrative will inevitably intertwine; when you don’t understand something, just skip it for now and come back to sort it out once you have the full picture. This is the first article: boltdb data organization.

Overview

There are two commonly used index designs in databases: the B±tree and the LSM-tree. The B±tree is a classic—for example, traditional single-node databases like MySQL use B±tree indexes, which are friendly to fast reads and range queries. The LSM-tree has become popular in recent years; Bigtable, LevelDB, and RocksDB all bear its imprint; as mentioned in a previous article, LSM-trees use WAL and multi-level data organization, trading off some read performance for powerful random write performance. Thus, this is also a classic trade-off.

BoltDB logically organizes data into buckets. A bucket can be seen as a namespace—a collection of KV pairs, similar to the bucket concept in object storage. Each bucket corresponds to one B±tree, and namespaces can be nested, so BoltDB allows buckets to be nested as well. In terms of implementation, the child bucket’s root node page id is stored on the parent bucket’s leaf node to achieve nesting.

Each db file is a set of B±trees organized in a tree shape. As we know, in a B±tree branch nodes are used for lookup and leaf nodes store data.

  1. The top-level B±tree is special and is called the root bucket; all its leaf nodes store the page ids of child bucket B±tree roots.
  2. Other B±trees may be called data buckets; their leaf nodes may contain normal user data, or the page ids of child bucket B±tree roots.

boltdb-buckets-organised.png

Compared to ordinary B±trees, boltdb’s B±tree has several special characteristics:

  1. The number of branches in a node is not a fixed range; instead, it is limited by the total size of the elements it stores, with the upper bound being the page size.
  2. The key stored in a branch node for each branch is the minimum key of the branch it points to.
  3. All leaf nodes are not linked together in a chain.
  4. It does not guarantee that all leaf nodes are at the same level.

In terms of code organization, the boltdb source files related to indexing are as follows:

  1. bucket.go: High-level encapsulation of bucket operations, including CRUD of KV pairs, CRUD of child buckets, and B±tree split and merge.
  2. node.go: Operations related to the elements stored in a node and relationships between nodes. Adding and deleting elements within a node, loading and persisting, accessing child/sibling elements, and the detailed logic of split and merge.
  3. cursor.go: Implements an iterator-like function that can walk freely over the leaf nodes of the B±tree.

This article is divided into three parts, revealing BoltDB’s index design from the local to the global. First, we will dissect the basic unit of the tree; second, analyze the bucket traversal implementation; finally, analyze the tree’s growth and balancing process.

Read more »

boltdb is one of the few pure-Go, single-node KV libraries on the market. boltdb is based on Howard Chu’s LMDB project, with a clean and concise implementation. Excluding unit tests and adapter code, the core code is only about four thousand lines. Simple APIs and a minimalist implementation are exactly what the author intended. Due to the author’s limited bandwidth, the original boltdb has been archived and is no longer updated. If you want to make improvements or submit new PRs, the fork maintained by etcd, bbolt, is recommended.

For convenience, this series of guide articles will still use the original, no-longer-changing repo as the basis. Though small, the project is fully featured: in just over four thousand lines of code, it implements a single-node KV engine based on B+ tree indexing with single-writer, multi-reader transaction support. The code itself is plain and well-commented. If you are a Go enthusiast or interested in KV stores, boltdb is absolutely a repo you should not miss.

This series is planned as three articles, analyzing boltdb’s source code around data organization, index design, and transaction implementation. Since these three aspects are not completely orthogonal, the narrative will inevitably intertwine; if something is unclear, just skip it for now and come back to sort it out once you have the full picture. This article is the first one: boltdb data organization.

Introduction

The bottom-most layer of a storage engine is how it organizes data on various physical media (e.g., on disk, in memory). These data organization choices also reflect the design trade-off philosophy of the storage engine.

At the file-system level, boltdb adopts a page-based organization, aligning everything to pages; in memory, boltdb organizes data as a B+ tree, whose basic unit is the node. An in-memory tree node corresponds to one or more contiguous pages on the file system. These are the only two core abstractions in boltdb’s data organization, making the design remarkably simple. Of course, such simplicity necessarily comes with a cost, which will be analyzed in detail in later articles.

This article first gives an overall explanation of the relationship between nodes and pages, then analyzes the formats of the four page types and their in-memory representations one by one, and finally walks through the DB file growth process and the memory loading strategy across the DB lifecycle.

Read more »

I had heard about LevelDB long ago. This time, on a whim, I skimmed through the code together with some reference materials, and it truly lives up to its reputation. If you are interested in storage, if you want to use C++ elegantly, or if you want to learn how to architect a project, I recommend taking a look. Not to mention that the authors are Sanjay Ghemawat and Jeff Dean.

If I don’t produce something after reading it, given my memory, I will surely forget it soon. So I wanted to write something about the wonders of LevelDB, but I didn’t want to take the usual path: starting with an architectural overview and ending with module analysis. These days of reading code, I kept thinking about where to start. Just as I finished reading, what impressed me most was actually the various exquisite data structures in LevelDB: fitting the scenario, built from scratch, appropriately tailored, and concise in code. Why not start the LevelDB series with these small corner components?

This series mainly wants to share three classic data structures commonly used in engineering in LevelDB: the Skip List used for fast read/write of memtable, the Bloom Filter used for fast filtering of sstable, and the LRUCache used for partially caching sstable. This is the second post, Bloom Filter.

Introduction

LevelDB is a single-machine KV storage engine, but it does not use a traditional balanced search tree to balance read/write performance. Instead, it uses the LSM-tree structure to organize data, sacrificing some read performance in exchange for higher write throughput. Let’s use a diagram to introduce how LSM-tree is organized across different storage media.

leveldb.png

LevelDB divides data into two main parts, stored in memory and the file system respectively. The main data modules include WAL log, memtable, immutable memtable, and sstable. In order of data flow:

  1. When LevelDB receives a write request put(k, v), it first appends the operation log to the log file (WAL) for recovery in case of unexpected node crashes.
  2. After writing the WAL, LevelDB inserts this kv pair into the lookup structure in memory: memtable.
  3. After memtable accumulates to a certain degree, it rotates into a read-only memtable, i.e., immutable memtable; at the same time, a new memtable is generated for writing.
  4. When memory is under pressure, the immutable memtable is sequentially written to the file system, generating a level0 sstable (sorted strings table) file. This process is called a minor compaction.
  5. Since query operations need to traverse memtable, immutable, and sstables layer by layer. As more and more sstable files are generated, query performance will inevitably degrade, so different sstables need to be merged, which is called major compaction.

LevelDB Hierarchical Organization

All sstable files in the file system are logically organized by LevelDB into multiple levels (usually 7 levels), satisfying the following properties:

  1. The larger the level, the earlier its data was written. That is, data is first “placed” in the upper level (minor compaction), and when the upper level is “full” (reaches capacity limit), it “overflows” to the lower level for merging (major compaction).
  2. Each level has a limit on total file size, growing exponentially. For example, the total size limit of level0 is 10MB, level1 is 100MB, and so on. The highest level (level6) has no limit.
  3. Since each sstable file in level0 is directly flushed from memtable, the key ranges of multiple sstable files may overlap. For other levels, multiple sstable files are guaranteed not to overlap through certain rules.

For a read operation in LevelDB, it needs to first search memtable and immutable memtable, and then search each level in the file system in turn. It can be seen that compared to write operations, read operations are rather inefficient. We call the phenomenon where a single read request from the client is turned into multiple read requests inside the system read amplification.

To reduce read amplification, LevelDB takes several measures:

  1. Minimize sstable files through major compaction
  2. Use fast filtering methods to quickly determine whether a key is in a certain sstable file

And to quickly determine whether a key is in a key set, LevelDB uses exactly the Bloom filter. Of course, the Bloom filter can only quickly determine that a key is definitely not in a certain sstable, thereby skipping certain sstables during layer-by-layer lookups. The reason will be detailed later; we will leave it here for now.

Read more »

go-context-tree-construction.png

Overview

Context is a somewhat unique yet commonly used concept in Go. When used well, it often yields twice the result with half the effort. But when abused without understanding its internals, it becomes “writing new words for forced sorrow”—at best affecting code structure, at worst burying numerous bugs.

Golang constructs Contexts using a tree-like derivation approach, passing deadline and cancel signals across different goroutines [1] to manage the lifecycle of a group of goroutines involved in processing a task, preventing goroutine leaks. It also allows passing/sharing data across an entire request via Values attached to the Context.

Context is most often used to track the lifecycle of long-running, cross-process IO requests such as RPC/HTTP, allowing the outer caller to actively or automatically cancel the request, thereby instructing child goroutines to reclaim all used goroutines and related resources.

Context is essentially a mechanism for propagating signals during tree-like nested API calls. This article will analyze Context from several aspects: interface, derivation, source code analysis, and usage.

Read more »

I had heard of LevelDB long ago. This time, on a whim, I went through the code roughly with some reference materials, and it truly lives up to its reputation. If you are interested in storage, want to use C++ elegantly, or want to learn how to architect a project, I highly recommend studying it. Not to mention that the authors are Sanjay Ghemawat and Jeff Dean.

If I don’t produce something after reading it, given my memory, I will surely forget it soon. So I wanted to write something about the beauty of LevelDB, but I didn’t want to take the usual route, starting with an architectural overview and ending with module analysis. While reading the code, I kept thinking about where to start. Just as I was finishing, what impressed me most was actually the various ingenious data structures in LevelDB: tailored to the scenario, built from scratch, appropriately trimmed, and precisely coded. Why not start the LevelDB series with these small corner components?

This series mainly wants to share three classic data structures commonly used in engineering in LevelDB: the Skip List used for fast read/write of memtable, the Bloom Filter used for fast filtering of sstable, and the LRUCache used for partial caching of sstable. This is the first article, on the Skip List.

Requirements

LevelDB is a single-machine KV storage engine. A KV engine essentially only provides three interfaces for data entries (key, val): Put(key, val), Get(key) val, and Delete(key). In implementation, when LevelDB receives a delete request, it does not actually delete the data, but writes a special mark for that key, so that when reading, it finds that the key does not exist, thus converting Delete into Put, and reducing the three interfaces to two. After this simplification, what remains is a trade-off between Put and Get performance. LevelDB’s choice is: sacrifice some Get performance for powerful Put performance, and then optimize Get as much as possible.

As we know, in the Memory hierarchy, memory access is much faster than disk, so LevelDB made the following design choices to achieve its goals:

  1. Write (Put): Make all writes happen in memory, and then flush to disk in batches when a certain size is reached.
  2. Read (Get): As data is continuously written, there will be a small portion of data in memory and the remaining majority on disk. When reading, if it’s not found in memory, it needs to be looked up on disk.

To ensure write performance while optimizing read performance, the in-memory storage structure needs to support efficient insertion and lookup simultaneously.

When I first heard of LevelDB, my most natural thought was that the in-memory structure (memtable) was a self-balancing binary search tree, such as a red-black tree or AVL tree, which could guarantee lg(n) time complexity for both insertion and lookup. Only after reading the source code did I learn that it uses a skip list. Compared to balanced trees, the advantage of skip lists is that they greatly simplify the implementation while guaranteeing read/write performance.

In addition, to periodically dump data to disk, the data structure also needs to support efficient sequential traversal. To summarize, the requirements for LevelDB’s in-memory data structure (memtable) are:

  1. Efficient lookup
  2. Efficient insertion
  3. Efficient sequential traversal
Read more »