Introduction
Nowadays, with the development of communication technology, the proliferation of mobile Internet, and the rise of IoT, connected vehicles, and AI, the amount of data generated daily is growing explosively. Data at this scale cannot be processed independently by traditional single-machine systems; it can only be handled by large-scale distributed systems. As a result, distributed systems have gradually become a prominent field of study. However, as a beginner in distributed systems, it is easy to feel overwhelmed when faced with the vast, unclassified sea of learning materials available online.
But distributed systems have their fundamental research areas and unique evolutionary threads, such as:
- Some fundamental research problems: ordering, consistency, fault tolerance, consensus algorithms, concurrency control, etc.
- Some fundamental theorems: CAP, PACELC, FLP
- Gradually evolving industrial systems: MapReduce, Spark, GFS, Dynamo, Cosmos
Therefore, by grasping distributed systems along the two dimensions of “time” and “space,” one can master the essentials and learn more clearly. “Time” refers to the evolutionary thread of distributed systems, which can be understood by reading papers from different periods in academia and industry. “Space” refers to the decomposition of fundamental problems studied in distributed systems, which can be understood by reading books to build a knowledge system. This article briefly summarizes some materials I collected during my study of distributed systems, categorized for your reference. The materials are listed in no particular order; please adopt them as needed.
Note: Most of the recommended materials are in English. If you have difficulty reading them, I recommend using the Chrome browser with the “Google Translate” extension installed, which allows one-click “Translate this page.”
Author: 木鸟杂记 https://www.qtmuniao.com/2021/05/16/distributed-system-material/, please indicate the source when reposting
Books
Dr. Martin Kleppmann. Designing Data-Intensive Applications
《构建数据密集型应用》, https://dataintensive.net/buy.html. The author provides a free English version for download, which can also be found online.
The book is divided into three major parts:
- Foundations of Data Systems
- Distributed Data
- Derived Data
The Foundations of Data Systems part explores some general aspects of data systems:
- Reliable, Scalable, and Maintainable Applications
- Data Models and Query Languages
- Storage and Retrieval
- Encoding and Evolution
The Distributed Data part discusses the principles and problems faced when building data systems distributed across multiple machines:
- Replication
- Partition
- Transactions
- The Trouble With Distributed Systems
- Consistency and Consensus
The Derived Data part actually explores the processing aspects of systems distributed across multiple machines, including:
- Batch Processing
- Stream Processing
- The Future of Data Systems
In recent years, streaming and batch systems have converged, allowing users to process and transform raw data more flexibly and efficiently.
The division of these chapters is excellent. By studying this book thoroughly, you will be able to decompose a new system into multiple components as skillfully as a master butcher, and understand the trade-offs behind each component.
M. van Steen and A.S. Tanenbaum, Distributed Systems, 3rd ed., distributed-systems.net, 2017.
《分布式系统》第三版, https://www.distributed-systems.net/index.php/books/ds3/. The author provides a free English PDF download link. Introduction:
The book is divided into nine chapters:
- Introduction
- Architecture
- Processes
- Communication
- Naming
- Coordination
- Consistency and replication
- Fault Tolerance
- Security
The author also provides Python sample code and figures for download.
Mikito Takada. Distributed System for fun and profit
A free booklet on distributed systems: http://book.mixu.net/distsys/. It introduces some key concepts and design considerations in distributed systems, helping you understand the design principles behind well-known commercial systems such as Dynamo, BigTable, MapReduce, and Hadoop. The author boils down the considerations of distributed programming to two aspects:
- Information is transmitted at the speed of light
- Separate components fail independently
Then the book is divided into five chapters:
- Basics: A coarse-grained introduction to some terms and concepts, exploring the goals of systems and the difficulty of achieving them
- Up and down the level of abstraction: Introduces the CAP theorem and FLP impossibility, then explores various consistency models
- Time and order. One of the keys to understanding distributed systems is understanding how dispersed components determine the order of events
- Replication: preventing divergence: How multiple replicas maintain consistency
- Replication: accepting divergence: How multiple replicas handle conflicts
Courses
MIT 6.824: Distributed Systems
One of the most classic distributed systems courses: https://pdos.csail.mit.edu/6.824/schedule.html.
Highlights of the course:
- A curated list of papers
- Well-designed labs
Very suitable for self-study.
Cambridge Concurrent and Distributed Systems
The University of Cambridge’s course on concurrent and distributed systems: https://www.cl.cam.ac.uk/teaching/2021/ConcDisSys/materials.html
Taught by DDIA author Martin Kleppmann.
CMU 15-440: Distributed Systems
CMU’s distributed systems course: https://www.cs.cmu.edu/~dga/15-440/S14/syllabus.html.
Stanford CS244b: Distributed System
Stanford’s distributed systems course: http://www.scs.stanford.edu/20sp-cs244b/
CS244b is a seminar course and also provides a list of classic papers.
UW CSE490H: Distributed Systems
The University of Washington’s distributed systems course: https://courses.cs.washington.edu/courses/cse490h/11wi/. The course has not been offered or made public in recent years; the most recent one is from 2011. It also provides a good paper reading list.
Open Source Projects
Storage
- Hadoop, https://github.com/apache/hadoop, Java: You can browse early code by tag; includes open-source implementations of MapReduce and GFS
- SeaweedFS, https://github.com/chrislusf/seaweedfs, Go: Referenced Facebook Haystack and F4
- MinIO, https://github.com/minio/minio, Go: A classic open-source object storage implementation
- TiDB, https://github.com/pingcap/tidb, Go: A distributed database providing a MySQL-compatible access interface
Consensus Algorithms
- etcd, https://github.com/etcd-io/etcd, Go: An implementation of Raft, used in Kubernetes. Can also be used for control-plane data storage in any distributed system
- ZooKeeper, https://github.com/apache/zookeeper, Java: Implements the Zab consensus protocol, originally used in Hadoop to store metadata, with a role similar to etcd
Computation
- Spark, https://github.com/apache/spark, Scala: A big data processing and analytics engine
- Flink, https://github.com/apache/flink, Java: A unified stream-batch data processing engine
- Ray, https://github.com/ray-project/ray, Python/C++: A general-purpose compute engine with powerful expressiveness
Blog Series
Notes on Distributed Systems for Young Bloods
Jeff Hodges https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/
The blogger provides a summary of lessons learned from working on distributed systems, which is very enlightening for newcomers entering the distributed systems field to shift their mindset. Key points include:
- Frequent failures are a distinctive feature of distributed systems compared to other systems
- Building robust distributed systems is far more difficult than building single-machine systems
- Open-source collaboration in distributed systems differs from that in single-machine systems
- Multi-machine coordination is very hard
- Slowness is very difficult to pinpoint in distributed systems
- Seek ways to make services partially available
- Make full use of locality
- Use the CAP theorem to examine your distributed system
- …
Distributed Systems Theory for the Distributed Systems Engineer
https://www.the-paper-trail.org/post/2014-08-09-distributed-systems-theory-for-the-distributed-systems-engineer/
The blogger provides an entry path and reference materials for distributed systems:
- First steps: Recommends some books
- Failure and Time: The two most important cornerstones of distributed systems; provides references to some classic papers
- The basic tension of fault tolerance: Redundancy is needed for fault tolerance, but excessive redundancy wastes performance
- Basic primitives: Links to papers on basic concepts in distributed systems, including election algorithms, consistent snapshots, consensus protocols, distributed state machines, broadcast, and chain replication
- A list of industrial system papers: Mostly from Google, with some from others as well
Meetups
Papers We Love
PapersWeLove computer paper sharing: https://www.zhihu.com/column/c_1353678180390162432
Microsoft-Distributed-System-Meetup
A distributed systems meetup organized by Microsoft folks, including studying 6.824 together, reading DDIA together, interesting keynote speeches, etc.: https://microsoft-distributed-system-meetup.github.io/home/
Distributed Systems Reading Group
A paper reading group organized by MIT folks in 2013: http://dsrg.pdos.csail.mit.edu/papers/
Including consensus protocols, data replication, transactions, concurrency issues, etc.
Computer Systems Study Group
@胡津铭 organized systems study group: https://learn-sys.github.io/cn/
The Last Thing
Finally, here is a classic awesome-series repo for distributed systems on GitHub: https://github.com/theanalyst/awesome-distributed-systems
