Peeling Back the Layers of Database Storage

I’ve been working on databases for some time now. Recently, some students asked me what the storage layer of a distributed database actually does in practice. I gave them a brief answer, but thought others might have the same question, so I decided to summarize it here—just a humble brick to throw out to attract jade. Given my limited experience, there may be mistakes; feedback is welcome.

Note: To scope the discussion, we will focus on distributed databases with separated storage and compute, share-noting architecture, and only discuss the storage layer.

The storage layer involves a vast and complex set of components. To explain it clearly, we need a suitable angle. The most essential function of a database is to store data and provide external interfaces for querying and writing data. So, let’s first trace the various modules along these two lines, and then supplement with some components that don’t fall neatly into either.

Author: 木鸟杂记 https://www.qtmuniao.com/2022/05/04/distributed-database-storage-components Please indicate the source when reposting.

Query

Query requests entering the storage layer generally manifest as pushed-down execution plans, which are then translated into point queries and range queries against the underlying storage engine. To accelerate queries, a cache layer is usually placed on top of the storage engine. For each storage node, to handle a large number of concurrent requests, IO optimization is needed.

Execution Plan

This is the entry point of the storage layer, the interface exposed by the storage layer to the query layer.

After a query statement goes through several steps in the query layer—syntax analysis (Parser), semantic checking (Validator), plan generation (Planner), plan optimization (Optimizer), and execution (Executor)—the operators that need to be pushed down to the storage layer are dispatched to the nodes where the corresponding partitions (Partition) reside.

For the volcano model, we can understand the execution plan as a DAG composed of basic operators (Executors), or even more simply, as a tree. Some small subtrees in the lower layers of the tree can be directly pushed to the corresponding nodes in the storage layer for execution. These pushdownable operators usually include: TableScan, Filter, Project, Limit, TopN, and so on.

After the storage layer receives these execution plans, it deserializes them and organizes them into in-memory execution plans, using the iterator model or vectorized model to perform scan, filter, sort, projection, aggregation, and other operations on the data, then returns the result set to the query layer.

Result sets can be returned in several ways:

Full return in one go
Streaming return
Paged return

Computation pushdown has many benefits:

It fully utilizes the distributed nodes of the storage layer for pre-computation.
It reduces bandwidth consumption for data transfer from the storage layer to the query layer.
It improves the processing speed and dataset capacity limit of the query layer.

Cache

To optimize queries, for read-heavy and write-light scenarios, a cache layer is usually placed on top of the storage engine. If the architecture uses a shared storage layer—for example, the storage layer is in the cloud—then the cache layer is essential.

When designing the cache, two main aspects need to be considered: cache granularity and lifecycle.

Cache granularity. To maintain consistency between the cache and the backend data, locking is inevitably required, and cache granularity is closely related to locking granularity. Whether caches of different Partitions on the same node should share a single cache pool is also a question that cache granularity needs to address.
Lifecycle. When to write to the backend and when to invalidate the cache involves cache control policies—whether to use synchronous read/write-through or asynchronous updates—all of which need to be determined based on actual requirements.

RPC IO Optimization

Any service is similar: when a large number of requests arrive, various means such as thread pools, asynchrony, and coroutines must be used to optimize, improve concurrency, thereby increasing throughput and reducing latency.

Some RPC frameworks can solve these problems. For example, some RPC frameworks have built-in coroutine models, supporting M:N models, work stealing, and so on. If the RPC framework doesn’t handle this, additional thread pool libraries, async libraries (promise, future), and coroutine libraries are needed to manually control the concurrent execution flow of requests.

Write

Distributed systems generally use multiple replicas to store data. During writes, to ensure that all replicas see a consistent write order, consensus algorithms are introduced. Consensus algorithms usually maintain a logically endless operation log, and each replica applies the logical log to its local state machine—the storage engine. When writing data, user data needs to undergo data encoding to be converted into binary strings before being written to the storage engine. For scenarios with strict consistency requirements (distinct from consistency among replicas; here it refers to consistency in concurrent execution of multiple statements), the database needs to provide users with atomic execution guarantees for multiple statements, i.e., distributed transactions.

Consensus Algorithm

For share-nothing architectures, to ensure high availability, multiple replicas (Replication) are used and placed on different machines with different fault tolerance thresholds. Using multiple replicas naturally introduces the problem of data consistency across replicas, which is generally solved using consensus algorithms (Raft, MultiPaxos).

Using consensus algorithms, for each data partition (Partition), a multi-machine consistent operation log (operation log, WAL) can be maintained: that is, all write operations are serialized into operation log records and appended in a unique order across all replicas. With a consistent operation log, we can then apply it to the local state machine (i.e., the storage engine), supplemented with log IDs, to provide a consistent read-write view externally.

Storage Engine

This refers to the single-node storage engine, that is, the state machine mentioned above. The problem it solves is how to organize data within a single machine’s storage system, using the least amount of space to efficiently handle writes and reads for specific scenarios. It is generally divided into several sub-modules: data encoding, index organization, concurrency control, and so on.

Storage engines are mainly divided into two schools: the in-place update B-Tree school and the append-based LSM-Tree school. Here are two recommended projects for learning: for B-Tree, check out BoltDB; for LSM-Tree, check out LevelDB. But in practice, more complex and powerful variants are used, such as RocksDB.

For AP (analytical processing) scenarios, columnar storage is generally used, which makes data compression and vectorized computation more convenient.

Data Encoding

Data encoding/decoding solves the problem of how to efficiently (low latency, small footprint) encode a logical record (such as a Row in a relational database) into a binary string for writing to the storage engine.

During encoding, the correspondence with the Schema (what fields the row has, what the field types are) needs to be considered, as well as how to ensure backward compatibility when reading data after Schema changes (adding fields, deleting fields, changing field types).

Distributed Transaction

One of the most important functions of a database is its guarantee of transactions. By leveraging the various guarantees of the transaction model (ACID), the complexity of using the database on the user side can be greatly reduced. Of course, this usually comes at the cost of performance, which is especially pronounced in distributed databases.

There are many industry solutions for ensuring atomicity and isolation among distributed transactions. The most basic framework is two-phase commit paired with a global clock (there are various solutions such as physical clocks, logical clocks, hybrid clocks, and TSO—another big topic). A classic example is Google’s Percolator model.

Other Modules

In addition to components that can be directly classified into the read/write flow, there are other modules that interact frequently with the storage layer and some background resident processes.

Schema Management

How to divide namespaces and organize different Schemas involves the logical management of Schemas, such as using a tree structure.

In addition, the correspondence between Schema and data needs to be maintained. But in distributed systems, how to modify Schema non-blockingly without affecting concurrent data writes is a very difficult problem. A common solution is Google’s F1 online DDL.

Cluster Metadata

Cluster metadata is mainly divided into two major parts:

Logical. Logical dataset organization and partitioning, such as Database and Table. That is, dividing datasets by namespace at an appropriate granularity, and then applying different configurations to different datasets, along with corresponding multi-tenant isolation and access control.
Physical. Organization and partitioning of physical nodes, such as Zone and Node. That is, physically organizing different nodes with appropriate fault tolerance thresholds, and then handling node failures and data balancing among different nodes and thresholds.

Managing the mapping from logical data to physical nodes is one of the most important aspects of distributed systems: scheduling.

Scheduling usually happens at two major moments: one is when datasets are created, and the other is during replica rebalancing (rebalancing, including rebalancing caused by machine failures and new node additions).

We schedule different shards of datasets based on various node attributes (fault tolerance threshold, remaining capacity, etc.). When moving data, adding and deleting multiple replicas of a shard is involved. To ensure consistency, this also needs to be done through consensus protocols.

Data Import and Export

One of the most important peripheral tools for a database is supporting data import and export in rich formats and at high speeds.

This can be further subdivided into several categories:

Data backup and recovery. That is, both the data producer and consumer are the database itself. In this case, there is no need to consider supporting different data formats (i.e., custom encoding can be used, as long as the database itself understands it, so efficiency can be prioritized). Instead, the focus is on supporting different data backends: local, cloud, shared file systems, and so on. At the same time, support for both full and incremental backups needs to be considered.
Import from other systems. It is necessary to consider supporting multiple data sources and different data formats, preferably using distributed import via some computing frameworks (such as Spark, Flink, Kafka). It is also best to support connections to mainstream databases, such as MySQL, Postgres, and so on.
Data export. Exporting data into various general-purpose formats, such as csv, json, sql statements, and so on.

This was written in haste; omissions are inevitable. Feel free to add them in the comments. If you find it helpful, please share it with more people.

我是青藤木鸟，一个喜欢摄影、专注大规模数据系统的程序员，欢迎关注我的公众号：“木鸟杂记”，有更多的分布式系统、存储和数据库相关的文章，欢迎关注。关注公众号后，回复“资料”可以获取我总结一份分布式数据库学习资料。回复“优惠券”可以获取我的大规模数据系统付费专栏《系统日知录》的八折优惠券。

我们还有相关的分布式系统和数据库的群，可以添加我的微信号：qtmuniao，我拉你入群。加我时记得备注：“分布式系统群”。另外，如果你不想加群，还有一个分布式系统和数据库的论坛（点这里），欢迎来玩耍。