It’s been a long time since my last post. After spending a month on transaction-related materials and sharing, I’m using this article to wrap things up today. Hopefully it can provide some inspiration.
When it comes to database transactions, most people’s first reaction is ACID. However, these properties differ in importance and complexity. Among them, the hardest to understand is Isolation (I). One important reason for this confusion is that historically, the definitions and implementations of isolation levels have been tightly coupled, and different vendors’ terminology and implementations often don’t match reality. This article attempts to sort out several common isolation levels from the perspective of locks, using relatively imprecise descriptions to build an intuitive and perceptual understanding.
Dissecting the Storage Layer of AlloyDB, Google's Cloud-Native PostgreSQL-Compatible Database
At the Google I/O 2022 conference, Google Cloud released AlloyDB, a cloud-native database compatible with the PostgreSQL standard (note: “Alloy” means an alloy of metals). It claims to be twice as fast as Amazon’s competing product (Aurora?), a slogan that probably isn’t enough to convince existing users to migrate, but does hold some appeal for new users.
Since the author mainly works on storage, this article will analyze the AlloyDB storage layer architecture based on Google’s blog post introducing it, to see what highlights its design has to offer.
Overall Architecture
Overall, AlloyDB is divided into a Database layer and a Storage layer. The DB layer is responsible for being compatible with the PostgreSQL protocol, parsing SQL statements, and converting them into read/write requests sent to the storage layer. The storage layer can be further subdivided into three layers:
- log storage layer: The DB layer converts writes into operation logs, or WAL, which are written to the storage layer. Log storage is responsible for the efficient writing and storage of these log records.
- LPS layer: Log Processing Service, LPS, is the log processing service layer that consumes WAL from the log storage layer and generates Blocks. Essentially, it is a materialization process.
- block storage layer: Corresponding to the block layer of a standalone PostgreSQL instance, it serves queries, provides parallelism through sharding, and guarantees cross-zone fault tolerance through replication.
Meta's Chain-Replication Object Store — Delta
Source: https://engineering.fb.com/2022/05/04/data-infrastructure/delta/
TL;DR: I came across a new post on the Meta Engineering Blog about a highly available, strongly consistent, chain-replicated object storage system shared by someone in a group chat. Having worked on object storage myself and previously written about Facebook’s small-file storage systems — Haystack (batching) and F4 (hot/cold separation, erasure coding) — I was immediately intrigued to see what new insights Meta had to offer this time.
After reading it, I found it quite interesting. Below is a brief summary of the key points; interested readers can check out the original blog post and a related video featuring a Chinese engineer’s explanation.
What is Delta?
Delta is a simple, reliable, scalable, and low-dependency object storage system that provides only four basic operations: put, get, delete, and list. In its architectural design, Delta trades off read/write latency and storage efficiency for simplicity and reliability.
Delta is not:
- A general-purpose storage system. Delta only aims for elasticity, reliability, and minimal dependencies.
- A file system. Delta is simply an object store and does not provide POSIX semantics.
- A system optimized for storage efficiency. Delta does not optimize for storage efficiency, latency, or throughput; instead, it focuses on simplicity and elasticity.
Peeling Back the Layers of Database Storage
I’ve been working on databases for some time now. Recently, some students asked me what the storage layer of a distributed database actually does in practice. I gave them a brief answer, but thought others might have the same question, so I decided to summarize it here—just a humble brick to throw out to attract jade. Given my limited experience, there may be mistakes; feedback is welcome.
Note: To scope the discussion, we will focus on distributed databases with separated storage and compute, share-noting architecture, and only discuss the storage layer.
The storage layer involves a vast and complex set of components. To explain it clearly, we need a suitable angle. The most essential function of a database is to store data and provide external interfaces for querying and writing data. So, let’s first trace the various modules along these two lines, and then supplement with some components that don’t fall neatly into either.
'DDIA Reading Notes (Chapter 4): Encoding and Evolution'
The DDIA Reading Group shares chapter by chapter, combining some of my experience in distributed storage and databases in the industry to supplement details. Sharing occurs approximately every two weeks; welcome to join. The schedule and all transcripts are here. We have a corresponding distributed systems & database discussion group; notifications will be sent in the group before each sharing session. If you would like to join, you can add my WeChat: qtmuniao, briefly introduce yourself, and note: Distributed Systems Group.
Chapter 3 discussed storage engines; this chapter continues to dig deeper, exploring encoding-related issues.
All places involving cross-process communication require data to be encoded (Encoding), or serialized (Serialization). Because persistent storage and network transmission are both byte-stream-oriented. Serialization is essentially a “dimensionality reduction” operation, reducing high-dimensional data structures in memory to a single-dimensional byte stream, so that underlying hardware and related protocols only need to handle one-dimensional information.
DDIA Reading Notes (3): TP, AP, and Column Storage
DDIA reading group. We will go through the book chapter by chapter, supplementing details based on my experience with distributed storage and databases in industry. Sharing every two weeks or so. Welcome to join us. The schedule and all transcripts are here. We have a corresponding distributed systems & database discussion group, and notifications will be sent in the group before each sharing session. If you would like to join, you can add me on WeChat: qtmuniao. Please briefly introduce yourself and mention: Distributed Systems Group.
Transactional or Analytical
The term OL (Online) mainly refers to interactive queries.
The term transaction has some historical origins. Early database users were mostly involved in commercial trades, such as buying, selling, paying salaries, and so on. But as database applications continued to expand, “transaction” remained as a noun.
Transactions don’t necessarily have ACID properties. Transactional processing is mostly random reads and writes with low latency. In contrast, analytical processing is mostly periodic batch processing with higher latency.
'DDIA Reading Notes (Chapter 3): B-Trees and LSM-Trees'
DDIA Reading Club: I will share chapter by chapter, adding some details based on my experience in distributed storage and databases in industry. Sharing roughly every two weeks—welcome to join! The schedule and all transcripts are here. We have a corresponding distributed systems & databases discussion group, and I will notify in the group before each session. If you’d like to join, add my WeChat: qtmuniao, briefly introduce yourself, and note: distributed systems group.
Chapter 2 discussed the upper-level abstractions: data models and query languages.
This chapter goes a bit deeper, focusing on how databases handle queries and storage at the low level. There is a logical chain here:
Use case → Query type → Storage format.
DDIA Reading Notes (Chapter 2): Data Models and Query Languages
DDIA Reading Group: We will share chapter by chapter, combining my experience in distributed storage and databases in the industry with additional details. Sharing roughly every two weeks — welcome to join! The schedule and all transcripts are here. We have a corresponding distributed systems & database discussion group; notifications go out before each session. If you’d like to join, add me on WeChat: qtmuniao, briefly introduce yourself, and mention “Distributed Systems Group.”
Overview
This section revolves around two main concepts.
How to analyze a data model:
- Basic considerations: the fundamental elements of data, and the relationships between elements (one-to-many, many-to-many).
- Compare several common models: the (most popular) relational model, the (tree-like) document model, and the (highly flexible) graph model.
- Schema patterns: strong schema (write-time constraints); weak schema (read-time parsing).
How to evaluate query languages:
- How they relate to and match the data model.
- Declarative vs. imperative.
Distributed Execution of SQL in CockroachDB and TiDB
Computation pushdown is actually a common idea: push computation to where the data resides. In databases, logically, computation usually sits above the storage layer, so pushing some operators down to the storage layer is called computation pushdown. This is especially important in distributed databases.
Below are the solutions from CockroachDB and TiDB, based on documentation and blog posts, so they may not be consistent with the logic in the latest code.
CockroachDB
Basic Concepts
The corresponding module in CockroachDB is called DistSQL, whose idea originates from Sawzall, somewhat similar to MapReduce. The supported operators are called aggregators, which are essentially a generalization of SQL aggregation operators.
'DDIA Reading Notes (Chapter 1): Reliability, Scalability, Maintainability'
DDIA reading group: we will share chapter by chapter, supplemented with some details based on my experience in distributed storage and databases in industry. Sharing roughly every two weeks—welcome to join. Schedule and all transcripts are here. We have a corresponding distributed systems & database discussion group; notifications will be sent before each session. To join, add my WeChat: qtmuniao, briefly introduce yourself, and note: Distributed Systems Group.
Chapter 1 is easily skipped because it is concept-heavy and tends to speak in generalities. Yet the three concepts it introduces are indeed unavoidable focal points when building systems.
P.S. The open-source Chinese translation is not very idiomatic in places, which can make for uncomfortable reading—though this is unavoidable in any translation.
'Writing Good Code: Naming — Scrutinize Every Name'
When Jia Dao first went to the capital for the imperial examination, one day on his donkey he composed a line: “Birds roost by the pond-side tree, a monk knocks at the moonlit door.” At first he wanted to use “push,” then he wanted to use “knock,” and could not decide after much refinement. So he kept chanting it on his donkey, occasionally reaching out to make the gesture of pushing and knocking.
—— Song · Hu Zi, Tiaoxi Yuyin Conghua, Vol. 19, quoting Liu Gong Jiahua
Naming is the most crucial thing in coding; to a program, it is like a face to a young girl. Good naming can clearly convey the intent of the code, and even possess a sense of rhythmic beauty. Slack and casual naming, on the other hand, makes one feel as if fallen into a fog, unbearable to read, consuming the maintainer’s energy again and again. Moreover, a chaotic naming system can effortlessly hide bugs and cause harm far and wide.
Therefore, when we write code, it is necessary to spend a little time scrutinizing key names — for the convenience of others and ourselves. The longer the lifespan of a project, the more obvious the benefits. So how should we scrutinize? Let me share some of my thoughts, drawing on my experience in writing code, reading code, and reviewing code.
I’ve been writing more Go recently, so the examples use Go, but they are all pseudocode; some examples do not strictly follow the syntax. Moreover, most examples are made up on the spot, so they may not be particularly apt.
Cuckoo Hash and Cuckoo Filter
Introduction
The essence of hashing is mapping from a larger space to a smaller space. Therefore, after inserting enough data, according to the pigeonhole principle, there will inevitably be position conflicts. Common hash tables (Hash Table or dictionary) handle conflicts through chaining, open addressing, and other methods. Single-bucket multi-function cuckoo hashing is a type of hash table that uses open addressing to handle conflicts; the difference is that when a conflict occurs, instead of linearly searching for a new position, it uses additional hash functions to find one.