Memgraph is an in-memory graph database that uses OpenCypher as its query language, focusing on small-data, low-latency graph scenarios. Since Memgraph is open source (repo here, implemented in C++), we can take a peek at its internals. According to this comment, the inspiration for its in-memory structure implementation mainly comes from the paper: Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems.
This series is mainly divided into two parts: paper interpretation and code walkthrough, each split into several posts as needed. This post is the first part of the paper interpretation, covering the paper’s overview and how linked lists are cleverly used to store multiple versions and control visibility. Parts two and three of the paper interpretation will cover how serializability is achieved and how multi-version data is reclaimed.
Overview
As the paper’s title suggests, it aims to implement a high-performance data structure for main-memory databases, based on multi-version concurrency control (MVCC), supporting the serializable isolation level. The core ideas are:
- Use columnar storage
- Reuse the Undo Buffer data structure
- Use a doubly linked list to chain together multiple versions of data
- Cleverly design timestamps to control data visibility
- Use a predicate tree (PT) to determine whether a transaction’s read set has been modified
Unlike typical multi-version schemes, this paper performs in-place updates, then “pushes” the old version data into a linked list. I use “push” because the list uses head insertion: data near the head is newer, while data near the tail is older. The head of each data item’s linked list is maintained by a data structure called VersionVector; if a row has no old data, the corresponding position is null.
Author: Muniao’s Notes https://www.qtmuniao.com/2024/02/06/memory-mvcc-serial. Please indicate the source when reposting.
MVCC-memory.png
From now on, we will use the example in the figure above to aid understanding. It is an example of Sally continuously transferring money to others. At the start (T0), everyone has ten dollars. Then Sally transfers 1 dollar to someone else each time, for a total of three transfers. At the current moment, the first two have already been committed:
- Sally → Wendy, commit timestamp T3
- Sally → Henry, commit timestamp T5
The third transfer is in progress:
- Sally → Mike, transaction ID is Ty, start timestamp is T6
In between, there are two full-table scan transactions (summing the balances of all accounts), Tx and Tz, with start timestamps T4 and T7 respectively. Both have started but not yet finished.
Version Management
Each transaction obtains two timestamps (uint64) when entering the system:
- transactionID: The transaction ID is also a timestamp (auto-incrementing from 2^63), denoted as Tx, Ty, Tz in the figure above.
- startTime-stamp: An auto-incrementing timestamp (starting from 0), denoted as T4, T6, T7 in the figure above.
As mentioned earlier, all updates are performed in-place, but old values are saved in the undo buffer. The old version data serves two purposes:
- Before-image value, as part of the transaction’s undo log.
- As an old value for multi-versioning of that field.
For snapshot isolation and serializable isolation levels, the in-place updated value is not visible to other transactions. In the next subsection, we will explain how visibility is controlled.
When a transaction commits, it obtains another timestamp: commitTime-stamp, which shares the same auto-incrementing counter as startTime-stamp.
During a transaction’s execution, all old values in the Undo Buffer are tagged with the transactionID timestamp (the third transfer in the figure: Ty); when the transaction commits, they are uniformly replaced with the commitTime-stamp (the first two transfers in the figure: T3 and T5).
Version Visibility
When a transaction accesses a field’s value, it first performs an in-place access, then traverses the linked list pointed to by that value’s corresponding VersionVector, stopping when the following condition is met:
1 | // pred denotes the next node in the linked list |
Let’s examine the three sub-conditions one by one:
v.pred == null: Holds when the value has no multi-versions, or when the end of the linked list is reached.v.pred.TS == T: An active transaction accessing data it has updated itself.v.pred.TS < T.startTime: Accesses an already-committed older version via the transaction’s start timestamp.
The above conditions are somewhat abstract, so let’s look at them with the example. Sally’s multiple transfers form the following linked list:
1 | 7(in-place) -pred-> (Ty, Bal, 8) -pred-> (T5, Bal, 9) |
Now let’s look at the visibility of Sally’s Bal (Balance) data for different transactions:
- Transaction Ty: (Ty is a value > 2^63), so it will stop at the successor node satisfying:
pred == (Ty, Bal, 8)(condition 2, Ty == Ty). At this point, the accessed value is 7, which is the value updated by transaction Ty. - Transaction Tx: Start timestamp is T4, so it will stop at the successor node satisfying
pred == (T3, Bal, 10)(condition 3, T3 < T4). At this point, the value of Sally’s account is 9, meaning one transfer had just been completed at that time — the one with commit timestamp T3. - Transaction Tz: Start timestamp is T7, so it will stop at the successor node satisfying
pred == (T5, Bal, 9)(condition 3, T5 < T7). At this point, the value of Sally’s account is 8, meaning two transfers had been completed, and the third transfer is not yet complete and thus invisible to Tz.
As you can see, the linked list divides the timeline into four segments:
1 | Temporary value (7) ----Ty--(8)---T5---(9)---T3---(10)---T0 |
Comparing the transaction start timestamp with the successor node’s timestamp gives condition 1:
- T0 ~ T3: the visible value is 10
- T3 ~ T5: the visible value is 9
- T5 ~ ∞: the visible value is 8
Among these, Ty (transaction ID) is effectively infinity relative to the start timestamp. This is the clever use of splitting the uint64 space in half that we mentioned in the previous subsection:
- Start and commit timestamps: 0 ~ 2^63 - 1
- Transaction IDs: 2^63 ~ 2^64 - 1
Additionally, null is equivalent to T0, which is condition 1.
Finally, to allow a transaction to see its own updates, condition 2 is added.
In the next post, we will explain in detail how to achieve serializable isolation level based on the above data structure.
References
- Open source project: https://github.com/memgraph/memgraph
- Related paper: https://db.in.tum.de/~muehlbau/papers/mvcc.pdf
- Interpretation blog post: https://wangziqi2013.github.io/paper/2018/07/10/Fast-Serializable-Multi-Version-Concurrency-Control-for-Main-Memory-Database-Systems.html
This article is from my Xiaobot column. The second interpretation has also been updated in the column. Friends who like my articles are welcome to subscribe and support me, which encourages me to produce more high-quality content. Subscription details can be found here.
