Deep Dive Into the Paper Behind MemGraph: Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems (Part 1)

Memgraph is an in-memory graph database that uses OpenCypher as its query language, focusing on small-data, low-latency graph scenarios. Since Memgraph is open source (repo here, implemented in C++), we can take a peek at its internals. According to this comment, the inspiration for its in-memory structure implementation mainly comes from the paper: Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems.

This series is mainly divided into two parts: paper interpretation and code walkthrough, each split into several posts as needed. This post is the first part of the paper interpretation, covering the paper’s overview and how linked lists are cleverly used to store multiple versions and control visibility. Parts two and three of the paper interpretation will cover how serializability is achieved and how multi-version data is reclaimed.

Overview

As the paper’s title suggests, it aims to implement a high-performance data structure for main-memory databases, based on multi-version concurrency control (MVCC), supporting the serializable isolation level. The core ideas are:

Use columnar storage
Reuse the Undo Buffer data structure
Use a doubly linked list to chain together multiple versions of data
Cleverly design timestamps to control data visibility
Use a predicate tree (PT) to determine whether a transaction’s read set has been modified

Unlike typical multi-version schemes, this paper performs in-place updates, then “pushes” the old version data into a linked list. I use “push” because the list uses head insertion: data near the head is newer, while data near the tail is older. The head of each data item’s linked list is maintained by a data structure called VersionVector; if a row has no old data, the corresponding position is null.

Author: Muniao’s Notes https://www.qtmuniao.com/2024/02/06/memory-mvcc-serial. Please indicate the source when reposting.

MVCC-memory.png

From now on, we will use the example in the figure above to aid understanding. It is an example of Sally continuously transferring money to others. At the start (T0), everyone has ten dollars. Then Sally transfers 1 dollar to someone else each time, for a total of three transfers. At the current moment, the first two have already been committed:

Sally → Wendy, commit timestamp T3
Sally → Henry, commit timestamp T5

The third transfer is in progress:

Sally → Mike, transaction ID is Ty, start timestamp is T6

In between, there are two full-table scan transactions (summing the balances of all accounts), Tx and Tz, with start timestamps T4 and T7 respectively. Both have started but not yet finished.

Version Management

Each transaction obtains two timestamps (uint64) when entering the system:

transactionID: The transaction ID is also a timestamp (auto-incrementing from 2^63), denoted as Tx, Ty, Tz in the figure above.
startTime-stamp: An auto-incrementing timestamp (starting from 0), denoted as T4, T6, T7 in the figure above.

As mentioned earlier, all updates are performed in-place, but old values are saved in the undo buffer. The old version data serves two purposes:

Before-image value, as part of the transaction’s undo log.
As an old value for multi-versioning of that field.

For snapshot isolation and serializable isolation levels, the in-place updated value is not visible to other transactions. In the next subsection, we will explain how visibility is controlled.

When a transaction commits, it obtains another timestamp: commitTime-stamp, which shares the same auto-incrementing counter as startTime-stamp.

During a transaction’s execution, all old values in the Undo Buffer are tagged with the transactionID timestamp (the third transfer in the figure: Ty); when the transaction commits, they are uniformly replaced with the commitTime-stamp (the first two transfers in the figure: T3 and T5).

Version Visibility

When a transaction accesses a field’s value, it first performs an in-place access, then traverses the linked list pointed to by that value’s corresponding VersionVector, stopping when the following condition is met:

// pred denotes the next node in the linked list
// TS denotes the timestamp associated with the corresponding node
// T denotes the current transaction and its transaction ID
v.pred == null || v.pred.TS == T || v.pred.TS < T.startTime

Let’s examine the three sub-conditions one by one:

v.pred == null: Holds when the value has no multi-versions, or when the end of the linked list is reached.
v.pred.TS == T: An active transaction accessing data it has updated itself.
v.pred.TS < T.startTime: Accesses an already-committed older version via the transaction’s start timestamp.

The above conditions are somewhat abstract, so let’s look at them with the example. Sally’s multiple transfers form the following linked list:

1 2	7(in-place) -pred-> (Ty, Bal, 8) -pred-> (T5, Bal, 9) -pred-> (T3, Bal, 10) -pred-> null

Now let’s look at the visibility of Sally’s Bal (Balance) data for different transactions:

Transaction Ty: (Ty is a value > 2^63), so it will stop at the successor node satisfying: pred == (Ty, Bal, 8) (condition 2, Ty == Ty). At this point, the accessed value is 7, which is the value updated by transaction Ty.
Transaction Tx: Start timestamp is T4, so it will stop at the successor node satisfying pred == (T3, Bal, 10) (condition 3, T3 < T4). At this point, the value of Sally’s account is 9, meaning one transfer had just been completed at that time — the one with commit timestamp T3.
Transaction Tz: Start timestamp is T7, so it will stop at the successor node satisfying pred == (T5, Bal, 9) (condition 3, T5 < T7). At this point, the value of Sally’s account is 8, meaning two transfers had been completed, and the third transfer is not yet complete and thus invisible to Tz.

As you can see, the linked list divides the timeline into four segments:

1	Temporary value (7) ----Ty--(8)---T5---(9)---T3---(10)---T0

Comparing the transaction start timestamp with the successor node’s timestamp gives condition 1:

T0 ~ T3: the visible value is 10
T3 ~ T5: the visible value is 9
T5 ~ ∞: the visible value is 8

Among these, Ty (transaction ID) is effectively infinity relative to the start timestamp. This is the clever use of splitting the uint64 space in half that we mentioned in the previous subsection:

Start and commit timestamps: 0 ~ 2^63 - 1
Transaction IDs: 2^63 ~ 2^64 - 1

Additionally, null is equivalent to T0, which is condition 1.

Finally, to allow a transaction to see its own updates, condition 2 is added.

In the next post, we will explain in detail how to achieve serializable isolation level based on the above data structure.

References

Open source project: https://github.com/memgraph/memgraph
Related paper: https://db.in.tum.de/~muehlbau/papers/mvcc.pdf
Interpretation blog post: https://wangziqi2013.github.io/paper/2018/07/10/Fast-Serializable-Multi-Version-Concurrency-Control-for-Main-Memory-Database-Systems.html

This article is from my Xiaobot column. The second interpretation has also been updated in the column. Friends who like my articles are welcome to subscribe and support me, which encourages me to produce more high-quality content. Subscription details can be found here.

我是青藤木鸟，一个喜欢摄影、专注大规模数据系统的程序员，欢迎关注我的公众号：“木鸟杂记”，有更多的分布式系统、存储和数据库相关的文章，欢迎关注。关注公众号后，回复“资料”可以获取我总结一份分布式数据库学习资料。回复“优惠券”可以获取我的大规模数据系统付费专栏《系统日知录》的八折优惠券。

我们还有相关的分布式系统和数据库的群，可以添加我的微信号：qtmuniao，我拉你入群。加我时记得备注：“分布式系统群”。另外，如果你不想加群，还有一个分布式系统和数据库的论坛（点这里），欢迎来玩耍。