木鸟杂记

大规模数据系统

'Boltdb Source Code Guide (1): Boltdb Data Organization'

boltdb is one of the few pure-Go, single-node KV libraries on the market. boltdb is based on Howard Chu’s LMDB project, with a clean and concise implementation. Excluding unit tests and adapter code, the core code is only about four thousand lines. Simple APIs and a minimalist implementation are exactly what the author intended. Due to the author’s limited bandwidth, the original boltdb has been archived and is no longer updated. If you want to make improvements or submit new PRs, the fork maintained by etcd, bbolt, is recommended.

For convenience, this series of guide articles will still use the original, no-longer-changing repo as the basis. Though small, the project is fully featured: in just over four thousand lines of code, it implements a single-node KV engine based on B+ tree indexing with single-writer, multi-reader transaction support. The code itself is plain and well-commented. If you are a Go enthusiast or interested in KV stores, boltdb is absolutely a repo you should not miss.

This series is planned as three articles, analyzing boltdb’s source code around data organization, index design, and transaction implementation. Since these three aspects are not completely orthogonal, the narrative will inevitably intertwine; if something is unclear, just skip it for now and come back to sort it out once you have the full picture. This article is the first one: boltdb data organization.

Introduction

The bottom-most layer of a storage engine is how it organizes data on various physical media (e.g., on disk, in memory). These data organization choices also reflect the design trade-off philosophy of the storage engine.

At the file-system level, boltdb adopts a page-based organization, aligning everything to pages; in memory, boltdb organizes data as a B+ tree, whose basic unit is the node. An in-memory tree node corresponds to one or more contiguous pages on the file system. These are the only two core abstractions in boltdb’s data organization, making the design remarkably simple. Of course, such simplicity necessarily comes with a cost, which will be analyzed in detail in later articles.

This article first gives an overall explanation of the relationship between nodes and pages, then analyzes the formats of the four page types and their in-memory representations one by one, and finally walks through the DB file growth process and the memory loading strategy across the DB lifecycle.

Author: 木鸟杂记 https://www.qtmuniao.com/2020/11/29/bolt-data-organised, please indicate the source when reposting

Overview

This article mainly involves the two source files page.go and freelist.go, analyzing the on-disk formats of various boltdb pages and their representations after being loaded into memory.

Top-Level Organization

Boltdb’s data organization, from top to bottom, is as follows:

  1. Each DB corresponds to one file.
  2. Logically:
    • A DB contains multiple buckets, equivalent to multiple namespaces; buckets can be nested without limit.
    • Each bucket corresponds to one B+ tree.
  3. Physically:
    • A DB file is stored sequentially in page-sized units.
    • A page size is consistent with the OS page size (usually 4 KB).

Pages and Nodes

Pages are divided into four types:

  • Meta pages: globally there are exactly two meta pages, saved in the file; they are the key to boltdb’s transaction implementation.
  • Free-list pages: a special kind of page that stores the list of free page IDs; they appear as contiguous runs of pages in the file.
  • Two data page types: all remaining pages are data pages, of two types corresponding to branch nodes and leaf nodes in the B+ tree.

The relationship between pages and nodes is:

  1. A page is the basic storage unit of the DB file; a node is the basic building block of the B+ tree.
  2. One data node corresponds to one or more contiguous data pages.
  3. A contiguous sequence of data pages, when deserialized and loaded into memory, becomes one data node.

To summarize: data pages organized linearly on the file system are logically structured into a two-dimensional B+ tree through in-page pointers. The tree root is saved in the meta page, while the IDs of all other unused pages are saved in the free-list page.

Page Format and In-Memory Representation

Boltdb has four page types: meta page, free-list page, branch node page, and leaf node page. They are marked with constant enumerations:

1
2
3
4
5
6
const (
branchPageFlag = 0x01
leafPageFlag = 0x02
metaPageFlag = 0x04
freelistPageFlag = 0x10
)

Each page consists of a fixed-length header and a data part:

boltdb page structureboltdb page structure

Here ptr points to the data part of the page. To avoid serialization and deserialization when loading into memory and writing to the file system, boltdb makes extensive use of pointer operations from Go’s unsafe package.

1
2
3
4
5
6
7
8
type pgid uint64
type page struct {
id pgid
flags uint16 // page type, one of the four types
count uint16 // number of elements in the corresponding node, e.g., KV pairs
overflow uint32 // number of overflow pages for the corresponding node, i.e., using overflow+1 pages to store the node
ptr uintptr // points to the byte array of the data; when overlay>0 it spans multiple contiguous pages; however, multiple physical pages are represented by only one page struct in memory
}

Meta Page (metaPage)

Boltdb has exactly two meta pages, stored at the beginning of the DB file (pageid = 0 and 1). However, in the meta page ptr does not point to an element list but to the individual fields of the whole DB’s meta information.

meta-page.pngmeta-page.png

The in-memory data structure after loading the meta page is as follows:

1
2
3
4
5
6
7
8
9
10
11
type meta struct {
magic uint32
version uint32
pageSize uint32 // page size of this DB, obtained via syscall.Getpagesize(), usually 4k
flags uint32 //
root bucket // tree composed of the roots of all sub-buckets
freelist pgid // starting page id where the free list is stored
pgid pgid // largest page id currently used, i.e., number of pages used
txid txid // transaction version number, used for transaction-related functionality
checksum uint64 // checksum, used to verify whether the meta page was written completely
}

Free-List Page (freelistPage)

The free-list page is a contiguous run of pages (one or more) in the DB file used to store the list of page IDs freed during DB modifications.

freelist-page.pngfreelist-page.png

In memory it is represented in two parts: one is the list of allocatable free page IDs (ids), and the other is the list of newly freed pages grouped by transaction ID.

1
2
3
4
5
6
7
// represents the list of currently released pages
// and pages just freed by a write transaction
type freelist struct {
ids []pgid // all free and available free page ids.
pending map[txid][]pgid // mapping of soon-to-be free page ids by tx.
cache map[pgid]bool // fast lookup of all free and pending page ids.
}

The pending part is recorded separately mainly for MVCC transactions:

  1. When a write transaction rolls back, the pending free pages of the corresponding transaction must be removed from the pending entry.
  2. When a write transaction (e.g., txid=7) has already committed, some read transactions (e.g., txid <= 7) may still be using the pages it just freed, so they cannot be allocated immediately.

This part will be explained in detail in the boltdb transaction article; for now, just keep it in mind.

Converting Free List to Page

The free list persists itself to a given page via the write function at transaction commit time. When writing, it merges pending and ids before writing, because:

  1. The write function is called at write-transaction commit time. Write transactions are serialized, so the write transactions corresponding to pending have already committed.
  2. Writing to the file is for crash recovery, and upon restart there are no read operations, so there is no need to worry about read transactions still using freshly freed pages.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
func (f *freelist) write(p *page) error {
// set page type
p.flags |= freelistPageFlag

// page.count is of type uint16, whose range is [0, 64k-1]. If the free page ID list length exceeds this range, another approach is needed.
// Here a trick is used: set page.count to 64k (0xFFFF), then store the actual count in the first element of the data part (as type pgid, i.e., uint64).
lenids := f.count()
if lenids == 0 {
p.count = uint16(lenids)
} else if lenids < 0xFFFF {
p.count = uint16(lenids)
// copyall merges pending and ids and sorts them
f.copyall(((*[maxAllocSize]pgid)(unsafe.Pointer(&p.ptr)))[:])
} else {
p.count = 0xFFFF
((*[maxAllocSize]pgid)(unsafe.Pointer(&p.ptr)))[0] = pgid(lenids)
f.copyall(((*[maxAllocSize]pgid)(unsafe.Pointer(&p.ptr)))[1:])
}

return nil
}

Note that this step only converts freelist into an in-memory page structure; additional operations such as tx.write() are required to actually persist the corresponding pages to the file.

Loading Free List from Page

When the database restarts, it first recovers a valid meta information from the first two meta pages. Then, according to the freelist field in the meta information, it finds the starting address of the freelist pages and restores it into memory.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
func (f *freelist) read(p *page) {
// count == 0xFFFF indicates the actual count is stored in the first element pointed to by ptr
idx, count := 0, int(p.count)
if count == 0xFFFF {
idx = 1
count = int(((*[maxAllocSize]pgid)(unsafe.Pointer(&p.ptr)))[0])
}

// copy the free list from the page into the in-memory freelist struct
if count == 0 {
f.ids = nil
} else {
ids := ((*[maxAllocSize]pgid)(unsafe.Pointer(&p.ptr)))[idx:count]
f.ids = make([]pgid, len(ids))
copy(f.ids, ids)

// ensure ids is sorted
sort.Sort(pgids(f.ids))
}

// rebuild the freelist.cache map.
f.reindex()
}

Free-List Allocation

The original author’s free-list allocation is exceptionally simple. The allocation unit is the page, and the allocation strategy is first-fit: from the sorted free page list ids, find the first run of contiguous free pages equal to the requested length, then return the starting page ID.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// if n contiguous free pages can be found, return the starting page id
// otherwise return 0
func (f *freelist) allocate(n int) pgid {
if len(f.ids) == 0 {
return 0
}

// iterate to find contiguous free pages and check if they equal n
var initial, previd pgid
for i, id := range f.ids {
if id <= 1 {
panic(fmt.Sprintf("invalid page allocation: %d", id))
}

// if not contiguous, reset initial
if previd == 0 || id-previd != 1 {
initial = id
}

if (id-initial)+1 == pgid(n) {
// when exactly the first n pages in ids are allocated, simply advance the f.ids slice.
// although this may temporarily waste space, during append/free operations on f.ids
// space will be reallocated on demand, and these wasted spaces will be reclaimed.
if (i + 1) == n {
f.ids = f.ids[i+1:]
} else {
copy(f.ids[i-n+1:], f.ids[i+1:])
f.ids = f.ids[:len(f.ids)-n]
}

// delete corresponding page ids from cache
for i := pgid(0); i < pgid(n); i++ {
delete(f.cache, initial+i)
}

return initial
}

previd = id
}
return 0
}

This GC strategy is quite simple and direct, with linear time complexity. Alibaba seems to have made a patch that groups all free pages by their contiguous length.

Leaf Page (leafPage)

This type of page corresponds to a leaf node in the B+ tree. A leaf node contains two kinds of elements: ordinary KV data and sub-buckets.

For the former, the basic element stored in the page is a piece of user data in some bucket. For the latter, an element in the page is a sub-bucket in the DB.

1
2
3
4
5
6
7
// single element in the byte array pointed to by page ptr
type leafPageElement struct {
flags uint32 // ordinary kv (flags=0) or subbucket (flags=bucketLeafFlag)
pos uint16 // distance between the kv header and the corresponding kv
ksize uint32 // number of bytes of key
vsize uint32 // number of bytes of val
}

Its detailed structure is as follows:

leaf-page-element.pngleaf-page-element.png

As can be seen, when organizing data, the leaf page stores element headers (leafPageElement) and the elements themselves (key value) separately. The advantage is that leafPageElement is fixed-length and can be accessed by index. During binary search for a given key, only the corresponding page needs to be loaded into memory on demand (pages are accessed via mmap, so data is actually loaded from the file system into memory only when accessed).

1
2
3
4
inodes := p.leafPageElements()
index := sort.Search(int(p.count), func(i int) bool {
return bytes.Compare(inodes[i].key(), key) != -1
})

If element headers and their corresponding elements were stored adjacent to each other, all pages corresponding to the leafPageElement array would have to be read sequentially and fully loaded into memory before binary search could be performed.

Another small optimization is that pos stores the relative offset from the start address of the element header to the start address of the element, rather than the absolute offset from the ptr pointer. This allows a relatively small number of bits (pos is uint16) to represent a relatively long distance.

1
2
3
4
func (n *branchPageElement) key() []byte {
buf := (*[maxAllocSize]byte)(unsafe.Pointer(n)) // buf is the start address of the element header
return (*[maxAllocSize]byte)(unsafe.Pointer(&buf[n.pos]))[:n.ksize]
}

Branch Page (branchPage)

The branch page structure is largely the same as the leaf page. The difference is that the value saved in the page is a page ID, i.e., the page that each branch of this branch node at certain keys points to.

branch-element.pngbranch-element.png

The key in branchPageElement stores the starting key of the page it points to.

Conversion Flow

Boltdb uses mmap to map the DB file into memory space. While building the tree and accessing it, corresponding pages are loaded into memory on demand, leveraging the OS page cache policy for replacement.

File Growth

When we open a DB, if the DB file is found to be empty, four pages are initialized in memory (4 * 4K = 16K): two meta pages, one empty free-list page, and one empty leaf node page. They are then written to the DB file, after which the normal open flow proceeds.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
func (db *DB) init() error {
// set page size consistent with the OS
db.pageSize = os.Getpagesize()

buf := make([]byte, db.pageSize*4)
// create two meta pages in buffer.
for i := 0; i < 2; i++ {
p := db.pageInBuffer(buf[:], pgid(i))
p.id = pgid(i)
p.flags = metaPageFlag

// initialize meta pages.
m := p.meta()
m.magic = magic
m.version = version
m.pageSize = uint32(db.pageSize)
m.freelist = 2
m.root = bucket{root: 3}
m.pgid = 4
m.txid = txid(i)
m.checksum = m.sum64()
}

// write an empty free list at pgid=2.
p := db.pageInBuffer(buf[:], pgid(2))
p.id = pgid(2)
p.flags = freelistPageFlag
p.count = 0

// write an empty leaf element at pgid=3.
p = db.pageInBuffer(buf[:], pgid(3))
p.id = pgid(3)
p.flags = leafPageFlag
p.count = 0

// write the four pages from buffer to the data file and flush to disk
if _, err := db.ops.writeAt(buf, 0); err != nil {
return err
}
if err := fdatasync(db); err != nil {
return err
}

return nil
}

As data is continuously written, new pages need to be allocated. Boltdb first looks in the freelist for any reusable pages; if none are available, it must re-mmap (unmap then mmap) to expand the DB file. Each expansion doubles the size (so starting from 16K * 2 = 32K), and after reaching 1 GB, each subsequent expansion adds another 1 GB.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
func (db *DB) mmapSize(size int) (int, error) {
// from 32KB up to 1GB.
for i := uint(15); i <= 30; i++ {
if size <= 1<<i {
return 1 << i, nil
}
}

// Verify the requested size is not above the maximum allowed.
if size > maxMapSize {
return 0, fmt.Errorf("mmap too large")
}

// align to maxMmapStep = 1G
sz := int64(size)
if remainder := sz % int64(maxMmapStep); remainder > 0 {
sz += int64(maxMmapStep) - remainder
}

// align to db.pageSize
pageSize := int64(db.pageSize)
if (sz % pageSize) != 0 {
sz = ((sz / pageSize) + 1) * pageSize
}

// must not exceed maxMapSize
if sz > maxMapSize {
sz = maxMapSize
}

return int(sz), nil
}

On 32-bit machines the file must not exceed maxMapSize = 2 GB; on 64-bit machines the upper limit is 256 TB.

Read/Write Flow

When opening an existing DB, the DB file is first mapped into memory space, then the meta pages are parsed, and finally the free list is loaded.

When reading from the DB, pages along the access path are loaded into memory on demand and converted to nodes for caching.

When modifying the DB, the COW principle is used: all modifications are not done in place, but a copy is made before any change. If a leaf node needs to be modified, all nodes on the path from the root bucket to that node must also be modified. These nodes all need newly allocated space and then persistence; this is closely related to transaction implementation and will be explained in detail in the transaction article of this series.

Summary

Boltdb uses only two concepts in data organization: page and node. Each database corresponds to one file, and each file contains a series of linearly organized pages. Page sizes are fixed; depending on their nature, there are four types: meta page, free-list page, leaf page, and branch page. When opening the database, the following operations are performed in sequence:

  1. Use mmap to map the database file into memory space.
  2. Parse the meta pages to obtain the free-list page ID and the root bucket page ID.
  3. According to the free-list page ID, load all free pages into memory.
  4. According to the root bucket starting page address, parse the root bucket root node.
  5. According to read/write needs, starting from the tree root, traverse and load data pages (branch pages and leaf pages) along the access path into memory on demand to become nodes.

As can be seen, nodes are of two types: branch node (branch node) and leaf node (leaf node).

Another thing to note is that the existence of nested buckets makes this part slightly harder to understand. In the next article on boltdb’s index design, we will analyze in detail how boltdb organizes multiple buckets and the B+ tree index within a single bucket.

References

  1. github, boltdb repo
  2. 我叫尤加利, boltdb 源码分析

我是青藤木鸟,一个喜欢摄影、专注大规模数据系统的程序员,欢迎关注我的公众号:“木鸟杂记”,有更多的分布式系统、存储和数据库相关的文章,欢迎关注。 关注公众号后,回复“资料”可以获取我总结一份分布式数据库学习资料。 回复“优惠券”可以获取我的大规模数据系统付费专栏《系统日知录》的八折优惠券。

我们还有相关的分布式系统和数据库的群,可以添加我的微信号:qtmuniao,我拉你入群。加我时记得备注:“分布式系统群”。 另外,如果你不想加群,还有一个分布式系统和数据库的论坛(点这里),欢迎来玩耍。

wx-distributed-system-s.jpg