木鸟杂记

大规模数据系统

ray-task-state-transfer.png

I previously wrote a translation of the Ray paper. Later, I spent some time reading Ray’s source code. To learn and retain knowledge, I plan to publish a series of source code reading articles. To sustain continuous updates, I will try to break down the modules into smaller pieces to keep each article short. Also, due to limited experience, my understanding of the source code may contain inaccuracies—feedback and discussion are welcome.

Read more »

write control and data flow

Introduction

GFS is a distributed large-file storage system custom-built by Google for its business needs, supporting elastic scaling and designed for massive data. It runs on clusters of inexpensive commodity servers, features automatic fault tolerance, and supports concurrent access from a large number of clients.

GFS is designed for large files and read-heavy workloads. Although it supports file modifications, it only optimizes for appends. It does not support POSIX semantics but implements a similar file operation API. It was a groundbreaking industrial-scale storage system developed by Google around the same time as MapReduce to solve large-scale data storage problems such as indexing.

Read more »

MR执行流

Introduction

MapReduce is a concept proposed in a paper published by Google in 2004 (Google internally wrote the first version in 2003). Although fifteen years have passed, looking back at the background, principles, and implementation of this ancestor-level concept of the big data era still yields many intuitive insights into distributed systems — as the saying goes, one can learn new things by reviewing the old.

In Google’s context, MapReduce is both a programming model and a distributed system implementation supporting that model. Its proposal allowed developers without a distributed systems background to easily harness large-scale clusters to process massive amounts of data with high throughput. Its problem-solving approach is well worth learning from: identify the pain points of the requirements (e.g., how to maintain, update, and rank massive indexes), perform high-level abstraction of key processing workflows (shard via Map, reduce on demand), and carry out efficient system implementation (tailored to fit). Among these, finding a suitable computational abstraction is the hardest part, requiring both an intuitive understanding of requirements and extremely high computer science literacy. Of course, and perhaps closer to reality, this abstraction is the tip of the iceberg above the water, evolved through constant trial and error based on requirements.

Read more »

Introduction

Following Spark, the UC Berkeley AMP Lab has launched another heavyweight high-performance AI compute engine — Ray, claiming to support millions of task schedules per second. How does it achieve this? After trying it out, here is a brief summary:

  1. Minimalist Python API: By adding the ray.remote decorator to function or class definitions and making some minor changes, single-machine code can be turned into distributed code. This means not only can pure functions be executed remotely, but a class can also be registered remotely (Actor model), maintaining a large amount of context (member variables) within it, and remotely calling its member methods to change that context.
  2. Efficient Data Storage and Transmission: On each node, a local object store is maintained through shared memory (multi-process access without copying), and data exchange between different nodes is performed using the specially optimized Apache Arrow format.
  3. Dynamic Graph Computation Model: Thanks to the first two points, the future handles returned by remote calls are passed to other remote functions or actor methods. That is, complex computation topologies are built through nested remote function calls, and dynamic triggered execution is based on the publish-subscribe model of the object store.
  4. Global State Maintenance: The global control state (rather than data) is maintained using Redis shards, allowing other components to easily scale smoothly and recover from failures. Of course, each Redis shard avoids single points of failure through chain-replica.
  5. Two-level Scheduling Architecture: Divided into local scheduler and global scheduler; task requests are first submitted to the local scheduler, which tries to execute tasks locally to reduce network overhead. When resource constraints, data dependencies, or load conditions do not meet expectations, they are forwarded to the global scheduler for global scheduling.

Of course, there are still areas for optimization, such as Job-level encapsulation (for multi-tenant resource allocation), the garbage collection algorithm to be optimized (for the object store, currently just a crude LRU), multi-language support (Java was recently supported, but it is unclear how well it works), etc. However, its flaws do not hide its merits; its architectural design and implementation ideas still have many aspects worth learning from.

Read more »

python-default-parameter.png

Introduction

After falling into the “pit” of Python’s default parameters several times, I decided to write a dedicated blog post about it. But recently I came across a great English article (Default Parameter Values in Python, Fredrik Lundh | July 17, 2008 | based on a comp.lang.python post), which is incisive and to the point. Since a gem already exists, there’s no need to show off my own writing. Of course, this is also a bit of laziness — here is a simple translation, hoping more people can see it.

The following is a translation, somewhat free, with some personal additions, not strictly consistent with the original text. Grammatical features are based on Python3.

Read more »

over BLOB Storage Architecture

Overview

First, let me explain what BLOB means. Its full English name is Binary Large OBjects, which can be understood as large objects in arbitrary binary format; in Facebook’s context, these are images, videos, and documents uploaded by users. This data has the characteristics of created once, read many times, never modified, and occasionally deleted.

Previously, I briefly translated Facebook’s predecessor work — Haystack. As the business grew and data volume increased further, the old approach no longer worked. If all BLOBs were stored using Haystack, due to its triple-replication implementation, the cost-effectiveness at this scale would be very low. However, completely using network mounts + traditional disks + Unix-like (POSIX) file systems for cold storage couldn’t keep up with reads. Thus, the divide and conquer approach, most commonly used in computer science, came into play.

They first counted the relationship between BLOB access frequency and creation time, then proposed the concept of hot and cold distribution in BLOB access over time (similar to the long-tail effect). Based on this, they proposed a hot/warm separation strategy: using Haystack as hot storage to handle frequently accessed traffic, and using F4 to handle the remaining less frequently accessed BLOB traffic. Under this assumption (F4 only stores data that basically doesn’t change much and has relatively low access volume), F4’s design can be greatly simplified. Of course, there is a dedicated routing layer above both to shield the underlying details, make decisions, and route requests.

For Haystack, seven years had passed since its paper was published (07~14). Relative to that time, a few minor updates were made, such as removing the Flag bit, and adding a journal file in addition to the data file and index file, specifically to record deleted BLOB entries.

For F4, the main design goal is to minimize the effective replication factor (effective-replication-factor) as much as possible while ensuring fault tolerance, to address the growing demand for warm data storage. Furthermore, it is more modular and has better scalability, meaning it can smoothly scale by adding machines to cope with continuous data growth.

To summarize, the main highlights of this paper are hot/warm separation, erasure coding, and geo-replication.

Read more »

serving a photo

Overview

The basic idea of Haystack is to keep index information in memory to avoid extra I/O. To achieve this, two main design aspects are employed:

  1. Aggregate small files into large files, reducing the number of files and thus the amount of metadata.
  2. Streamline file metadata, removing all POSIX metadata semantics that are unnecessary in Facebook’s scenario.

This reduces the data metadata to a scale that fits in memory, so that basically each data access can be completed with a single I/O, rather than several as before.

Read more »

python-learn.png

Introduction

Before using Logging, let’s first sort out our common logging output requirements. As the saying goes, design without considering requirements is hooliganism.

  1. Be able to locate the origin (code file & line number) and generation time of an Event, for debugging and tracing.
  2. A single log can be simultaneously sent to multiple destination outputs.
  3. Log output can be filtered by different levels or finer-grained conditions.
  4. Third-party module log output can be conveniently controlled.
  5. While achieving all of the above, configuration/setup should be as simple as possible.

Python’s Logging module perfectly achieves the above five points through its magical modular design, organized in a tree structure.

Read more »

python-learn.png

Introduction

Once when using Python’s socketserver, I came across ForkingMixIn and ThreadingMixIn. I was fascinated by this plug-in style syntactic sugar. Recently, while writing my own code, I also wanted to create some of these plug-and-play plugin code snippets, so I explored Python’s mix-in mechanism.

Simply put, it is a technique that leverages multiple inheritance to stylishly enhance an original class by plugging in additional code snippets.

Read more »

python-learn.png

Introduction

When I first learned about closures while studying JavaScript, I didn’t quite grasp the concept. For interviews, I memorized a vague “definition”: a function nested inside another function, with the inner function returned from the outer function, bringing the outer function’s environment along with it. At the time, I tried to understand closures from the literal meaning and some classic examples. The Chinese translation “闭包” (bìbāo, literally “closed package”) doesn’t easily convey the underlying principle, so I always felt a bit fuzzy about it. Recently, due to work requirements, I’ve been using Python and encountered closures again. This time, I came across some novel and interesting materials, which finally helped me connect several literal concepts (first-class functions, binding, scope, etc.) together and gain a deeper understanding of closures.

The reference materials are listed at the end; I highly recommend reading them.

Read more »

Building a Hexo Blog

One of my New Year’s resolutions is to push myself to write a blog post every week. As a fresh start, cleaning up the house is my usual style. Plus, I felt the Jekyll engine wasn’t very easy to use, so I wanted to switch to a new engine: Hexo. Last year, I noticed more and more blogs starting to use this engine, so I paid attention to it and found it indeed quite good (themes, modes, etc.). I decided to just do it. Thinking that as a CS major, following other people’s tutorials would be too lame, I went straight to the official docs to get started. Of course, pitfalls were inevitable; let’s talk about them below.

Read more »

Preface

After finishing lab2a, i.e., Raft leader election last time, I was stuck on log replication for a long time. Until last night, when I optimized appendEntries (skipping all log entries in the current term when prevLog doesn’t match), the long-troubling TestBackup2B magically passed. I ran it twice and still couldn’t quite believe it, so I deliberately changed it back and saw it fail as expected before I felt relieved — it seems the efficiency was too low and timed out.

While it’s still fresh, I might as well write down the blood, sweat, and tears of this period tonight.

Read more »