Life Engineering (I): Multi-Pass Decomposition

In our engineering practice, there are some small tricks for structuring code whose underlying ideas are also commonly seen in everyday life. This series is a set of such strange associations spanning life and engineering. This is the first one: multi-pass decomposition, that is, many things we are used to doing in one go can sometimes be much simpler and more efficient when broken down into multiple passes.

When I do code reviews, I often see novice developers trying to do too many things in a single for loop. This often leads to deeply nested code or gigantic for loop bodies. At this point, if performance is not significantly impacted, I usually suggest breaking down the tasks into multiple steps, with one for loop per step. You can even make each step a separate function.

Of course, all of this is from a maintenance perspective. Because humans can’t keep too many things in mind at once; doing things step by step, rather than mashing them together, makes each step’s logic much clearer. The latter, I often call the “pancake-spreading” style of code. This kind of code feels natural to write, but is painful to maintain—mixing details together always causes complexity to explode. The concept of a minimum viable prototype in software engineering follows a similar philosophy.

Author: Muniao’s Notes https://www.qtmuniao.com/2023/08/21/life-engineering-many-passes Please indicate the source when reprinting

This philosophy is also everywhere in “functional” programming, where when operating on a dataset, we apply a series of transformation functions in a chain, making the data flow clearly visible. In big data processing, this paradigm is even more common. For example, as mentioned in the Spark paper:

1
2
3

errors.filter(_.contains("HDFS"))
      .map(_.split(’\t’)(3))
      .collect()

SQL query engines also use a similar mechanism when implementing queries, converting a query statement into a series of multi-pass operator transformations applied to a two-dimensional dataset composed of rows and columns. As shown in the figure below.

Image source: CMU 15-445, Query Execution Lecture Notes.

I learned a little sketching in high school. Although I never really got into it, its multi-pass drawing techniques left a deep impression on me: first sketch the outline, then refine layer by layer. When hatching, you also work layer by layer, rather than finishing one area before moving to another. I often translate articles these days. At first, I always aimed to get the translation perfect in one pass. But the result was extremely slow, and I easily gave up. Later, I started using a multi-pass, layer-by-layer polishing method. First, I have ChatGPT help with a rough translation, then I check the original text to correct the semantics, and finally I do a pass to adjust word order and smooth out the sentences. As the saying goes, good writing comes from rewriting; it must be the same principle.

Professor Srinivasan Keshav of the University of Waterloo, in his “How to Read a Paper”, expounds the classic “three-pass approach” to reading papers, which follows a similar idea:

The first pass: a bird’s-eye skimming, focusing on the abstract, section headings, conclusions, and other key points.
The second pass: a bit more detailed, but don’t get bogged down in details.
The third pass: read carefully to achieve complete understanding.

You can stop at any step: this may not be the paper you need. But before, I often fell into a pitfall when reading papers, which I like to call the “carpet-bombing reading method”—going through every detail word by word. Including when I first started doing code reviews, I often fell into this trap too.

Doing things all at once, in sequence, is most people’s instinct, but this instinct is often inefficient. We must overcome it through constant practice. Speaking of which, when ordering at a restaurant, we also often use a two-pass method—the first pass, add everything you want to eat; the second pass, consider various constraints (preference strength, price level, whether you’ve had it before, etc.) to narrow down the dishes to a reasonable range.

I think the underlying reasons are:

Human attention is limited, so we are only good at focusing on one thing at a time.
Human cognition is also a process from shallow to deep, and layer-by-layer refinement takes advantage of this characteristic.

This article is from my paid Xiaobot column Daily Record of Systems, focusing on distributed systems, storage, and databases. It includes series on graph databases, code deep dives, translations of high-quality English podcasts, database learning, paper interpretations, and more. Friends who like my articles are welcome to subscribe 👉 Column to support me. Your support is very important for me to continue creating high-quality articles. Below is the current list of articles:

Graph Database Series

Graph Database Resources
Translation: Factorization & Great Ideas from Database Theory
Memgraph Series (Part 2): Serialization Implementation
Memgraph Series (Part 1): Multi-Version Data Management
Graph Database Series (Part 4): “Fate” and “Conflict” with the Relational Model
Graph Database Series (Part 3): Graph Representation and Storage
Graph Database Series (Part 2): A First Look at Cypher
Graph Database Series (Part 1): What Is the Property Graph Model and Its Shortcomings 🔥

Databases

Translation: Database Research Trends Over Fifty Years
Translation: Code Generation in Databases (Codegen in Databas…
Facebook Velox Runtime Mechanism Analysis
Distributed System Architecture (Part 2) — Replica Placement
Recommended Reading: Pipeline Construction in DuckDB
Translation: How Much Do You Know About the Currently Popular Vector Databases?
The Great Unification of Data Processing — From Shell Scripts to SQL Engines
Firebolt: How to Assemble a Commercial Database in Eighteen Months
Paper Review: NUMA-Aware Query Evaluation Framework
High-Quality Information Sources: Distributed Systems, Storage, Databases 🔥
Vector Database Milvus Architecture Analysis (Part 1)
The Modeling Philosophy Behind the ER Model
What Is a Cloud-Native Database?

Storage

Storage Engine Overview and Resources 🔥
Translation: How RocksDB Works
RocksDB Optimization Notes (Part 2): Prefix Seek Optimization
RocksDB Optimization Notes (Part 3): Async IO
Experiences Using RocksDB in Large-Scale Systems

Code & Programming

Three “Codes” That Influence How I Write Code 🔥
Folly Asynchronous Programming: Futures
On Interfaces and Implementations
C++ Private Function Override
ErrorCode or Exception?
Infra Interview Data Structures (Part 1): Blocking Queue
Data Structures and Algorithms (Part 4): Recursion and Iteration

Daily Database Learning Series

Daily Database Learning Lecture #06: Memory Management
Daily Database Learning Lecture #05: Data Compression
Daily Database Learning Lecture #05: Workload Types and Storage Models
Daily Database Learning Lecture #04: Data Encoding
Daily Database Learning Lecture #04: Log-Structured Storage
Daily Database Learning Lecture #03: Data Layout
Daily Database Learning Lecture #03: Database and OS
Daily Database Learning Lecture #03: Storage Hierarchy
Daily Database Learning Lecture #01: Relational Algebra
Daily Database Learning Lecture #01: Relational Model
Daily Database Learning Lecture #01: Data Models

Miscellaneous

Common Misconceptions in Database Interviews 🔥
Life Engineering (I): Multi-Pass Decomposition🔥
Some Interesting Conceptual Pairs in Systems
Simplicity and Completeness in System Design
The Cycle of Engineering Experience
On Borrowing “Names”
Cache and Buffer Are Both Caches — What’s the Difference?

我是青藤木鸟，一个喜欢摄影、专注大规模数据系统的程序员，欢迎关注我的公众号：“木鸟杂记”，有更多的分布式系统、存储和数据库相关的文章，欢迎关注。关注公众号后，回复“资料”可以获取我总结一份分布式数据库学习资料。回复“优惠券”可以获取我的大规模数据系统付费专栏《系统日知录》的八折优惠券。

我们还有相关的分布式系统和数据库的群，可以添加我的微信号：qtmuniao，我拉你入群。加我时记得备注：“分布式系统群”。另外，如果你不想加群，还有一个分布式系统和数据库的论坛（点这里），欢迎来玩耍。