木鸟杂记

大规模数据系统

I’m talking about my large-scale data systems column, System Daily Notes — someone once asked, after reading your column articles, it might be a long time before they get used they never get used at all, so why buy? Besides, umbrellas are a must-have on rainy days, but this column certainly isn’t during an interview. So this really isn’t a good “business” — narrow audience, low-frequency use cases, multiply the two, and you get my dismal sales.

That’s also why successful columns rack up tens of thousands of purchases, while I only sold a measly 250 and yet dare to share my experience. If you happen to have similar dangerous ideas, this might serve as a reference.

Read more »

This article is mainly “compiled” from Chapter 40 of the book Operating Systems: Three Easy Pieces, a very accessible and insightful book recommended to anyone feeling lost about operating systems. This filesystem is based on a very small disk space, using data structures and read/write workflows as the main thread to derive each basic component from scratch, helping you quickly build an intuition for filesystems.

Filesystems are basically built on top of block storage. Of course, some modern distributed filesystems, such as JuiceFS, are based on object storage underneath. But whether block storage or object storage, the essence is addressing and data exchange in terms of “data blocks”.

We will first examine the data structure of a complete filesystem on disk, i.e., its layout; then we will connect the various sub-modules through open/close and read/write workflows, thereby covering the key points of a filesystem.

Read more »

Memgraph is an in-memory graph database that uses OpenCypher as its query language, focusing on small-data, low-latency graph scenarios. Since Memgraph is open source (repo here, implemented in C++), we can take a peek at its internals. According to this comment, the inspiration for its in-memory structure implementation mainly comes from the paper: Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems.

This series is mainly divided into two parts: paper interpretation and code walkthrough, each split into several posts as needed. This post is the first part of the paper interpretation, covering the paper’s overview and how linked lists are cleverly used to store multiple versions and control visibility. Parts two and three of the paper interpretation will cover how serializability is achieved and how multi-version data is reclaimed.

Overview

As the paper’s title suggests, it aims to implement a high-performance data structure for main-memory databases, based on multi-version concurrency control (MVCC), supporting the serializable isolation level. The core ideas are:

  1. Use columnar storage
  2. Reuse the Undo Buffer data structure
  3. Use a doubly linked list to chain together multiple versions of data
  4. Cleverly design timestamps to control data visibility
  5. Use a predicate tree (PT) to determine whether a transaction’s read set has been modified

Unlike typical multi-version schemes, this paper performs in-place updates, then “pushes” the old version data into a linked list. I use “push” because the list uses head insertion: data near the head is newer, while data near the tail is older. The head of each data item’s linked list is maintained by a data structure called VersionVector; if a row has no old data, the corresponding position is null.

Read more »

IMG_8879.JPG

2023 passed in the blink of an eye. In hindsight, if I had to sum it up in one word, it would be — adversity spurs change.

By “poor” I don’t mean literally destitute, unable to afford food; it’s closer to the “hardship” in the saying “when in adversity, attend to your own virtue in solitude.” Many beliefs I had long relied on could no longer be sustained, so I went through a painful process of reshaping. This wasn’t necessarily a bad thing, but the passive mental tug-of-war at the time still feels agonizing in retrospect. Of course, it was precisely these adversities that forced me to break free from “groupthink” and seek new paths (think different). Though painful in the moment, there was something to be gained — after all, everyone must find their own way.

So what exactly changed this year?

Read more »

Y Combinator (YC) is a well-known American startup accelerator dedicated to helping startups succeed since its founding in 2005. As a leader in the startup world, YC is characterized not only by providing funding but also offering mentorship, resources, and networks to help startups stand out in a competitive market. YC’s success stories include Airbnb, Dropbox, and Reddit, all of which are now giants in their respective fields.
YC’s “Requests for Startups” (RFS) is a forward-looking call to the global startup community based on its deep understanding of market trends, technological advancements, and global challenges. We believe it can provide plenty of inspiration for entrepreneurs and those looking to join startups. The 2024 RFS covers 20 directions in total; this is the first part, covering the first ten. If enough people read it, I will continue translating the remaining 10. Here is the full text.

Introduction

Although the best startup ideas we’ve invested in were often not what we were initially looking for, but rather happy accidents.

Nevertheless, we are very excited about several categories of startups. Below is our latest 2024 version of Requests for Startups (RFS), briefly outlining some entrepreneurial directions we are paying attention to.

But this doesn’t mean you can only apply to Y Combinator by choosing these directions. In fact, most of our investments still focus on the internet and mobile sectors we’ve always paid attention to. So if you already have a startup idea in a related direction before reading this, please keep going.
Similarly, just because we’ve listed these directions doesn’t mean you have to start a company based on them. The purpose of RFS is: if you happen to already have a similar idea, you are welcome to apply to us.
Also, if you want to know what types of nonprofits we are looking to invest in, you can read this article.

Read more »

Compared to Paxos, one of Raft’s distinguishing features is that the algorithm is decomposed into relatively orthogonal parts — leader election, log replication, state persistence, log compaction, and membership changes. You can tell by looking at the course outline that, except for the last part, these modules correspond to what we implement in PartA ~ PartD of the course. The benefit of orthogonally decomposing the algorithm is that each module becomes relatively self-contained, making the whole easier to understand and implement — this is also the original design intent of the Raft algorithm.

Below, I do not intend to use a precise approach to explain each module — that is the job of the paper and the code implementation. Instead, in this chapter I will guide everyone to build an intuitive understanding of Raft’s basic concepts (term, election) and two major flows (leader election, log replication). Armed with this intuitive understanding, you can then carefully study the paper, and I believe you will be able to sort out the massive details of the Raft algorithm with twice the result for half the effort.

Read more »

This article is based on the transcript of Amazon S3 VP Andy Warfield’s keynote speech at FAST '23, summarizing some of their experiences in architecting and maintaining an object storage system of such massive scale —— S3. As we know, Amazon S3 is one of the most important storage infrastructures of the cloud era. Nowadays, object storage services from various cloud providers are basically compatible with the S3 API, and all cloud-native infrastructure, such as cloud-native databases, ultimately relies on object storage as their final persistence layer.

Read more »

Imagine you are the CTO of a startup, wanting to quickly launch a database product viable for the AP (analytical processing) market, and it must have differentiated features (otherwise who would use a new product). What would you do?

In 2022, Firebolt published a paper specifically on this topic: Assembling a Query Engine From Spare Parts. The core idea is: assemble a database from open-source components, just like building a desktop PC from off-the-shelf parts.

Read more »

Over the past year, for various reasons, I’ve needed to constantly learn new things. Thus, how to learn efficiently has become a natural question. Recently, I read some books and watched some open courses, including Scott H. Young’s Learn More, Study Less (hereinafter referred to as LMSL) and the Coursera open course Learning How to Learn (hereinafter referred to as LHL). I discovered some interesting ideas and, while they’re still fresh (though I haven’t finished either yet), I’m writing them down to organize my thoughts and hopefully offer some inspiration.

Both resources use analogy when explaining concepts.

LMSL proposes Holistic Learning, whose basic idea is: you cannot learn a concept in isolation; you can only integrate it into your existing conceptual system and understand its connotation and extension by characterizing it from different angles.

Read more »

Common Misconceptions in Database Interviews

Due to business needs, I have recently interviewed many database candidates. I have noticed that many candidates share some common misconceptions when preparing for interviews. I would like to take this opportunity to share some advice from my perspective as an interviewer. This article focuses on four misconceptions: weak coding fundamentals, weak engineering literacy, lack of communication and thinking skills, and fragmented knowledge frameworks.

Read more »

In our engineering practice, there are some small tricks for structuring code whose underlying ideas are also commonly seen in everyday life. This series is a set of such strange associations spanning life and engineering. This is the first one: multi-pass decomposition, that is, many things we are used to doing in one go can sometimes be much simpler and more efficient when broken down into multiple passes.

When I do code reviews, I often see novice developers trying to do too many things in a single for loop. This often leads to deeply nested code or gigantic for loop bodies. At this point, if performance is not significantly impacted, I usually suggest breaking down the tasks into multiple steps, with one for loop per step. You can even make each step a separate function.

Of course, all of this is from a maintenance perspective. Because humans can’t keep too many things in mind at once; doing things step by step, rather than mashing them together, makes each step’s logic much clearer. The latter, I often call the “pancake-spreading” style of code. This kind of code feels natural to write, but is painful to maintain—mixing details together always causes complexity to explode. The concept of a minimum viable prototype in software engineering follows a similar philosophy.

Read more »