木鸟杂记

大规模数据系统

Cmu15445 Course Introduction

Introduction

In college, I didn’t study databases very deeply. What I still remember are only a few conceptual ideas on the usage side, such as SQL, ER diagrams, normal forms, and transactions. I never had a systematic understanding of their implementation. But since I’ve embarked on the storage path, I definitely need to fill in my database knowledge. Previously, I saw many people on Zhihu recommending this cmu15445 course, and I had already bookmarked the course homepage, but never found the time to look at it. “Constant thinking leads to a response” — during this holiday, coinciding with a job change, I finally have some large blocks of time to make a start.

Syllabus

A brief introduction to the syllabus of cmu15445. This course uses Database System Concepts as a supplementary textbook, and covers all aspects of the design and implementation of Database Management Systems (DBMS), including:

  1. Data models (relational, document, key-value)
  2. Storage models (n-ary, decomposition, i.e., row-oriented and column-oriented)
  3. Query languages (SQL, stored procedures)
  4. Storage structures (heaps, log-structured)
  5. Index design (sorted trees, hash tables)
  6. Transaction processing (ACID, concurrency control)
  7. Data recovery (logging, snapshots)
  8. Execution engines (joins, sorting, aggregation, optimization)
  9. Concurrency architectures (multi-core, distributed)

As you can see, the content is very comprehensive. The course uses an open-source educational database as a case study to deeply explore the trade-offs made in database design across all these aspects. This course places great emphasis on programming practice, with a series of interconnected yet concise programming assignments.

Author: Muniao’s Notes https://www.qtmuniao.com/2021/02/15/cmu15445-introduction/ , please indicate the source when reposting

Plan

The main goal of this learning endeavor is the projects, while also reading some lecture notes and the textbook. I’ll leave the videos to chance for now, otherwise the timeline will stretch too long and I won’t finish anything in the end. There are five projects in total:

  1. Environment setup: C++ Primer
  2. Buffer pool management: Buffer Pool Manager
  3. B+ tree index: B+Tree Index
  4. Query engine: Query Execution
  5. Concurrency control: Concurrency Control

These five projects together form a simple relational database for teaching purposes — BusTub. The project format basically involves implementing specified interfaces and passing the provided test cases. It should be noted that the test cases given in the code are very simple, mostly only testing the main paths. Therefore, passing the test cases doesn’t necessarily mean your code is correct. This requires you to thoroughly understand the relationships between the various interfaces in the projects and the key points where trade-offs can be made during implementation. To achieve this, after completing and passing the test cases, you can find some reference implementations online to compare and learn from.

The initial plan is to write a summary after completing each project (except the first environment setup one), discussing some of the problems and interesting points encountered during implementation.

Resources

All course-related materials can be found on the course website. I plan to do the Fall 2020 projects, but the videos seem to only be from 2019.

If I come across any good blogs or resources during implementation, I will gradually add them here.

Lecture Notes Interpretation

I have translated and interpreted the course lecture notes chapter by chapter, all published in my large-scale data systems column “System Thinking Daily”. Welcome to subscribe.

  • [Daily Database Learning] Lecture #12: Execution Models
  • [Daily Database Learning] Lecture #11: Join Algorithms
  • [Daily Database Learning] Lecture #10: Sorting and Aggregation Algorithms
  • [Daily Database Learning] Lecture #09: Concurrent Safety of Indexes
  • [Daily Database Learning] Lecture #09: Locks and Latches
  • [Daily Database Learning] Lecture #08: Trade-offs and Optimizations of B+ Trees
  • [Daily Database Learning] Lecture #07: Hash Table Overview
  • [Daily Database Learning] Lecture #07: Hashing Schemes
  • [Daily Database Learning] Lecture #08: Tree Indexes
  • [Daily Database Learning] Lecture #06: Memory Management
  • [Daily Database Learning] Lecture #05: Data Compression
  • [Daily Database Learning] Lecture #05: Workload Types and Storage Models
  • [Daily Database Learning] Lecture #04: Data Encoding
  • [Daily Database Learning] Lecture #04: Log-Structured Storage
  • [Daily Database Learning] Lecture #03: Data Layout
  • [Daily Database Learning] Lecture #03: Database and OS
  • [Daily Database Learning] Lecture #03: Storage Hierarchy
  • [Daily Database Learning] Lecture #01: Relational Algebra
  • [Daily Database Learning] Lecture #01: Relational Model
  • [Daily Database Learning] Lecture #01: Data Models

我是青藤木鸟,一个喜欢摄影、专注大规模数据系统的程序员,欢迎关注我的公众号:“木鸟杂记”,有更多的分布式系统、存储和数据库相关的文章,欢迎关注。 关注公众号后,回复“资料”可以获取我总结一份分布式数据库学习资料。 回复“优惠券”可以获取我的大规模数据系统付费专栏《系统日知录》的八折优惠券。

我们还有相关的分布式系统和数据库的群,可以添加我的微信号:qtmuniao,我拉你入群。加我时记得备注:“分布式系统群”。 另外,如果你不想加群,还有一个分布式系统和数据库的论坛(点这里),欢迎来玩耍。

wx-distributed-system-s.jpg