木鸟杂记

大规模数据系统

Cmu15445 Database Systems Project 1: Buffer Pool Manager

cmu15445 is a classic open course on the design and implementation of Database Management Systems (DBMS). The course uses Database System Concepts as the textbook, providing lecture slides, notes, and videos, and carefully prepares several interconnected projects. The course places great emphasis on system design and programming implementation. In the words of the professor Andy Pavlo, this is a course that you can put on your resume and that can help you land a good offer.

Having some free time this vacation, I dug out this course and was immediately impressed by its rich content and elegant organization. Unfortunately, time is limited, so I can only follow the projects as the main thread, supplemented by lecture slides and notes, and briefly go through them. If I have more time, I will skim the textbook and videos. Starting from project one, after each project’s autograder run, I will write a note for future reference. Professor Andy Pavlo suggests not making the project code repository public, so I will try to minimize code snippets and focus more on the thought process.

This post is about project one: managing the cache of file system pages in memory — the buffer pool manager.

Overview

The target system of this project, BusTub, is a disk-oriented DBMS, but data on disk does not support byte-level access. Therefore, an intermediate layer for managing pages is needed. However, Professor Andy Pavlo insists on not using mmap to delegate page management to the operating system. Thus, the goal of Project 1 is to proactively manage the cache of pages from disk in memory, thereby minimizing the number of disk accesses (in time) and maximizing the contiguity of related data (in space).

This project can be broken down into two relatively independent sub-tasks:

  1. Maintaining the replacement policy: LRU replacement policy
  2. Managing the buffer pool: buffer pool manager

Both components are required to be thread-safe.

This article first analyzes the project content from basic concepts and core data flows, then goes through the two sub-tasks respectively.

作者:木鸟杂记 https://www.qtmuniao.com/2021/02/10/cmu15445-project1-buffer-pool/, 转载请注明出处

Project Analysis

When I first started writing the project code, I felt there were many details, and it was easy to miss something during implementation. But as I implemented and thought deeper, I gradually grasped the full picture and realized that as long as a few basic concepts and core data flows are clear, one can grasp the essentials.

Basic Concepts

The basic unit of operation for the buffer pool is a logically contiguous byte array. On disk, it is represented as a page, with a unique identifier page_id; in memory, it is represented as a frame, with a unique identifier frame_id. To keep track of which frame stores which page, a page table is needed.

In the following text, I may use page and frame interchangeably, because both concepts are the basic units of data managed by the buffer pool, generally 4k. The differences are as follows:

  1. The page id is the global identifier for this unit of data, while the frame id is just the index of a page in the memory pool (frame array).
  2. A page in the file system is a logically contiguous byte array; in memory, we attach some metadata to it: pin_count_, is_dirty_

基本概念基本概念

Generally speaking, the memory pool for managing frames is much smaller than the disk. Therefore, after the memory pool is full, when loading a new page from disk into the memory pool, some replacement policy (replacer) is needed to evict some no-longer-used pages from the memory pool to make room.

Core Data Flow

To state the conclusion first, the implementation core of the buffer pool manager lies in managing the state of all frames in the memory pool. Therefore, if we can sort out the state machine of frames, we can grasp the core data flow.

The buffer pool maintains a frame array. Each frame has three states:

  1. free: initial state, not holding any page
  2. pinned: holding a page that is being used by a thread
  3. unpinned: holding a page, but the page is no longer being used by any thread

And the functions to be implemented:

1
2
3
4
FetchPageImpl(page_id)
NewPageImpl(page_id)
UnpinPageImpl(page_id, is_dirty)
DeletePageImpl(page_id)

are the actions that drive the above state changes in the state machine. The state machine is as follows:

frame 状态机frame 状态机

Corresponding to the data structures in the implementation:

  1. The frame array that stores page data is pages_
  2. The indexes (frame_id) of all free frames are stored in free_list_
  3. The indexes of all unpinned frames are stored in replacer_
  4. The indexes of all pinned frames and unpinned frames are stored in page_table_, and the two states are distinguished by the pin_count_ field in the page.

In the figure above, NewPage1 and NewPage2 indicate that in the NewPage function, each time a free frame is obtained, it first tries to take a free frame from the free list (freelist_), and only if that fails will it evict an unpinned frame from the replacer_ and then use it. This reflects one goal of the buffer pool manager implementation: minimizing disk accesses, the reason for which is analyzed later.

Project Components

Having grasped the basic concepts and core data flow of this project, let’s analyze the two sub-tasks.

TASK #1 - LRU REPLACEMENT POLICY

I had written a similar implementation on LeetCode before, so I naturally drew on that experience. But I soon found that these two interfaces are somewhat different.

What LeetCode provides is a kv store interface, which maintains recency order during get/set, and automatically replaces the oldest KV when the memory pool is full.

But what this project provides is a replacer interface, which maintains a list of unpinned frame_ids, adds the frame_id to the list and maintains recency order when Unpin is called, removes the frame_id from the list when Pin is called, and returns the oldest frame_id when Victim is called.

Of course, the essence is the same, so for this project I also used unordered_map and doubly linked list as the data structure. Implementation details will not be elaborated on. One thing to note is that if Unpin finds that the frame_id is already in the replacer, it returns directly without changing the recency order of the list. Because logically, the same frame_id cannot be Unpinned multiple times, so we only need to consider the first Unpin of a frame_id.

In a broader context, essentially, the replacer is a recycle bin that maintains an eviction order; that is, we don’t directly delete pages with pin_count_ = 0 from memory, but put them into the recycle bin. According to the temporal locality of reference, a page that was just accessed is likely to be accessed again. Therefore, when we have to truly delete (Victim) a frame from the recycle bin, we delete the oldest one. When later we want to access a piece of data that was just put into the recycle bin, we only need to fish the page out of the recycle bin, thus saving a disk access. This achieves the goal of minimizing disk accesses.

TASK #2 - BUFFER POOL MANAGER

The core logic was mostly covered in the project analysis section. Here, I simply list some issues I encountered during my implementation.

The scope of page_table_. In the initial implementation, after drawing the frame state machine, I felt it was perfect to only put pinned frame ids in page_table_: it would allow frame ids to be distributed mutually exclusively according to state among free_list_, replacer_, and page_table_. But later I realized that if unpinned frame ids were not saved in page_table_, it would be impossible to reuse pages with pin_count_ = 0, and the replacer would lose its meaning.

Dirty page flush timing. There are two strategies: one is to flush every time Unpin is called, which flushes more frequently but can ensure that content is not lost after an unexpected power failure and restart; the other is to lazily flush when the replacer victimizes a frame, which ensures the minimum number of flushes. This is a trade-off between performance and reliability. Considering only this project, either approach should pass.

NewPage should not read from disk. This was a bug I wrote. After all, when NewPage is called, there is no corresponding page content on disk at all, so the following error is reported:

1
2
2021-02-18 16:53:47 [autograder/bustub/src/storage/disk/disk_manager.cpp:121:ReadPage] DEBUG - Read less than a page
2021-02-18 16:53:47 [autograder/bustub/src/storage/disk/disk_manager.cpp:108:ReadPage] DEBUG - I/O error reading past end of file

Clear metadata when reusing a frame. When reusing a frame evicted from the replacer, special care must be taken to clear fields like pin_count_\is_dirty_ before use. Of course, when DeletePage is called, care must also be taken to set page_id_ to INVALID_PAGE_ID and clear the above fields. Otherwise, when used again, if pin_count_ is not 0 after Unpin, it will cause the page to fail to be deleted during DeletePage.

Lock granularity. The most brute-force approach is to add a lock at the function scope level. If optimization is needed later, the lock granularity can be refined.

Project Code

Taking FetchPageImpl as an example to emphasize some implementation details, note that the project has already provided an implementation framework through comments.

I have added comments to mark some points I think need attention.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Page *BufferPoolManager::FetchPageImpl(page_id_t page_id) {
// a. 使用自动获取和释放锁
std::scoped_lock<std::mutex> lock(latch_);

// 1. Search the page table for the requested page (P).
// 1.1 If P exists, pin it and return it immediately.
auto target = page_table_.find(page_id); // b. 判断存在与访问数据只用一次查找
if (target != page_table_.end()) {
frame_id_t frame_id = target->second;
// c. 通过指针运算获取 frame_id 处存放的 Page 结构体
Page *p = pages_ + frame_id;
p->pin_count_++;
replacer_->Pin(frame_id); // d. 将对应 page 从“回收站”中捞出
return p;
}

// 1.2 If P does not exist, find a replacement page (R) from either the free list or the replacer.
// Note that pages are always found from the free list first.
frame_id_t frame_id = -1;
Page *p = nullptr;
if (!free_list_.empty()) {
frame_id = free_list_.back(); // e. 在结尾处操作效率高一点
free_list_.pop_back();
assert(frame_id >= 0 && frame_id < static_cast<int>(pool_size_));
p = pages_ + frame_id;

// f. 从 freelist 中获取的 dirty page 已经在 delete 时写回了
} else {
bool victimized = replacer_->Victim(&frame_id);
if (!victimized) {
return nullptr;
}
assert(frame_id >= 0 && frame_id < static_cast<int>(pool_size_));
p = pages_ + frame_id;

// 2. If R is dirty, write it back to the disk.
if (p->IsDirty()) {
disk_manager_->WritePage(p->GetPageId(), p->GetData());
p->is_dirty_ = false;
}
p->pin_count_ = 0; // g. 将元信息 pin_count_ 清空
}

// 3. Delete R from the page table and insert P.
page_table_.erase(p->GetPageId()); // h. 时刻注意区分 p->GetPageId() 与 page_id 是否相等,别混用
page_table_[page_id] = frame_id;

// 4. Update P's metadata, read in the page content from disk, and then return a pointer to P.
p->page_id_ = page_id;
p->ResetMemory();
disk_manager_->ReadPage(page_id, p->GetData());
p->pin_count_++;
return p;
}

The autograder related to the project can be found in FAQ for the registration address and invitation code. When submitting code, it is best not to submit a GitHub repository address, as there will be many formatting issues. You can follow the instructions on the project page each time, and package the relevant files into a zip archive according to the directory structure for submission.

提交事项提交事项

Read the project description carefully. Things to note before submission:

  1. Run make format in the build directory for automatic formatting.
  2. Run make check-lint in the build directory to check for some syntax issues.
  3. Design some tests for each function locally, write them into the relevant file (for this project, buffer_pool_manager_test.cpp), turn on the test switch, compile make buffer_pool_manager_test in the build folder, and run ./test/buffer_pool_manager_test

Here’s a project1 autograder result:

autograder-resultautograder-result

Summary

This is the first project of cmu15445, implementing a buffer pool manager that moves pages between disk and memory on demand. The key to this project is to grasp the basic concepts, sort out the core data flow, and on this basis pay attention to some implementation details.


我是青藤木鸟,一个喜欢摄影、专注大规模数据系统的程序员,欢迎关注我的公众号:“木鸟杂记”,有更多的分布式系统、存储和数据库相关的文章,欢迎关注。 关注公众号后,回复“资料”可以获取我总结一份分布式数据库学习资料。 回复“优惠券”可以获取我的大规模数据系统付费专栏《系统日知录》的八折优惠券。

我们还有相关的分布式系统和数据库的群,可以添加我的微信号:qtmuniao,我拉你入群。加我时记得备注:“分布式系统群”。 另外,如果你不想加群,还有一个分布式系统和数据库的论坛(点这里),欢迎来玩耍。

wx-distributed-system-s.jpg