木鸟杂记

大规模数据系统

Distributed Systems Learning Resources

Introduction

Nowadays, with the development of communication technology, the proliferation of mobile Internet, and the rise of IoT, connected vehicles, and AI, the amount of data generated daily is growing explosively. Data at this scale cannot be processed independently by traditional single-machine systems; it can only be handled by large-scale distributed systems. As a result, distributed systems have gradually become a prominent field of study. However, as a beginner in distributed systems, it is easy to feel overwhelmed when faced with the vast, unclassified sea of learning materials available online.

But distributed systems have their fundamental research areas and unique evolutionary threads, such as:

  1. Some fundamental research problems: ordering, consistency, fault tolerance, consensus algorithms, concurrency control, etc.
  2. Some fundamental theorems: CAP, PACELC, FLP
  3. Gradually evolving industrial systems: MapReduce, Spark, GFS, Dynamo, Cosmos

Therefore, by grasping distributed systems along the two dimensions of “time” and “space,” one can master the essentials and learn more clearly. “Time” refers to the evolutionary thread of distributed systems, which can be understood by reading papers from different periods in academia and industry. “Space” refers to the decomposition of fundamental problems studied in distributed systems, which can be understood by reading books to build a knowledge system. This article briefly summarizes some materials I collected during my study of distributed systems, categorized for your reference. The materials are listed in no particular order; please adopt them as needed.

Note: Most of the recommended materials are in English. If you have difficulty reading them, I recommend using the Chrome browser with the “Google Translate” extension installed, which allows one-click “Translate this page.”

Author: 木鸟杂记 https://www.qtmuniao.com/2021/05/16/distributed-system-material/, please indicate the source when reposting

Books

Dr. Martin Kleppmann. Designing Data-Intensive Applications

《构建数据密集型应用》, https://dataintensive.net/buy.html. The author provides a free English version for download, which can also be found online.

The book is divided into three major parts:

  1. Foundations of Data Systems
  2. Distributed Data
  3. Derived Data

The Foundations of Data Systems part explores some general aspects of data systems:

  1. Reliable, Scalable, and Maintainable Applications
  2. Data Models and Query Languages
  3. Storage and Retrieval
  4. Encoding and Evolution

The Distributed Data part discusses the principles and problems faced when building data systems distributed across multiple machines:

  1. Replication
  2. Partition
  3. Transactions
  4. The Trouble With Distributed Systems
  5. Consistency and Consensus

The Derived Data part actually explores the processing aspects of systems distributed across multiple machines, including:

  1. Batch Processing
  2. Stream Processing
  3. The Future of Data Systems

In recent years, streaming and batch systems have converged, allowing users to process and transform raw data more flexibly and efficiently.

The division of these chapters is excellent. By studying this book thoroughly, you will be able to decompose a new system into multiple components as skillfully as a master butcher, and understand the trade-offs behind each component.

M. van Steen and A.S. Tanenbaum, Distributed Systems, 3rd ed., distributed-systems.net, 2017.

《分布式系统》第三版, https://www.distributed-systems.net/index.php/books/ds3/. The author provides a free English PDF download link. Introduction:

The book is divided into nine chapters:

  • Introduction
  • Architecture
  • Processes
  • Communication
  • Naming
  • Coordination
  • Consistency and replication
  • Fault Tolerance
  • Security

The author also provides Python sample code and figures for download.

Mikito Takada. Distributed System for fun and profit

A free booklet on distributed systems: http://book.mixu.net/distsys/. It introduces some key concepts and design considerations in distributed systems, helping you understand the design principles behind well-known commercial systems such as Dynamo, BigTable, MapReduce, and Hadoop. The author boils down the considerations of distributed programming to two aspects:

  1. Information is transmitted at the speed of light
  2. Separate components fail independently

Then the book is divided into five chapters:

  1. Basics: A coarse-grained introduction to some terms and concepts, exploring the goals of systems and the difficulty of achieving them
  2. Up and down the level of abstraction: Introduces the CAP theorem and FLP impossibility, then explores various consistency models
  3. Time and order. One of the keys to understanding distributed systems is understanding how dispersed components determine the order of events
  4. Replication: preventing divergence: How multiple replicas maintain consistency
  5. Replication: accepting divergence: How multiple replicas handle conflicts

Courses

MIT 6.824: Distributed Systems

One of the most classic distributed systems courses: https://pdos.csail.mit.edu/6.824/schedule.html.

Highlights of the course:

  1. A curated list of papers
  2. Well-designed labs

Very suitable for self-study.

Cambridge Concurrent and Distributed Systems

The University of Cambridge’s course on concurrent and distributed systems: https://www.cl.cam.ac.uk/teaching/2021/ConcDisSys/materials.html

Taught by DDIA author Martin Kleppmann.

CMU 15-440: Distributed Systems

CMU’s distributed systems course: https://www.cs.cmu.edu/~dga/15-440/S14/syllabus.html.

Stanford CS244b: Distributed System

Stanford’s distributed systems course: http://www.scs.stanford.edu/20sp-cs244b/

CS244b is a seminar course and also provides a list of classic papers.

UW CSE490H: Distributed Systems

The University of Washington’s distributed systems course: https://courses.cs.washington.edu/courses/cse490h/11wi/. The course has not been offered or made public in recent years; the most recent one is from 2011. It also provides a good paper reading list.

Open Source Projects

Storage

  1. Hadoop, https://github.com/apache/hadoop, Java: You can browse early code by tag; includes open-source implementations of MapReduce and GFS
  2. SeaweedFS, https://github.com/chrislusf/seaweedfs, Go: Referenced Facebook Haystack and F4
  3. MinIO, https://github.com/minio/minio, Go: A classic open-source object storage implementation
  4. TiDB, https://github.com/pingcap/tidb, Go: A distributed database providing a MySQL-compatible access interface

Consensus Algorithms

  1. etcd, https://github.com/etcd-io/etcd, Go: An implementation of Raft, used in Kubernetes. Can also be used for control-plane data storage in any distributed system
  2. ZooKeeper, https://github.com/apache/zookeeper, Java: Implements the Zab consensus protocol, originally used in Hadoop to store metadata, with a role similar to etcd

Computation

  1. Spark, https://github.com/apache/spark, Scala: A big data processing and analytics engine
  2. Flink, https://github.com/apache/flink, Java: A unified stream-batch data processing engine
  3. Ray, https://github.com/ray-project/ray, Python/C++: A general-purpose compute engine with powerful expressiveness

Blog Series

Notes on Distributed Systems for Young Bloods

Jeff Hodges https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/

The blogger provides a summary of lessons learned from working on distributed systems, which is very enlightening for newcomers entering the distributed systems field to shift their mindset. Key points include:

  1. Frequent failures are a distinctive feature of distributed systems compared to other systems
  2. Building robust distributed systems is far more difficult than building single-machine systems
  3. Open-source collaboration in distributed systems differs from that in single-machine systems
  4. Multi-machine coordination is very hard
  5. Slowness is very difficult to pinpoint in distributed systems
  6. Seek ways to make services partially available
  7. Make full use of locality
  8. Use the CAP theorem to examine your distributed system

Distributed Systems Theory for the Distributed Systems Engineer

https://www.the-paper-trail.org/post/2014-08-09-distributed-systems-theory-for-the-distributed-systems-engineer/

The blogger provides an entry path and reference materials for distributed systems:

  1. First steps: Recommends some books
  2. Failure and Time: The two most important cornerstones of distributed systems; provides references to some classic papers
  3. The basic tension of fault tolerance: Redundancy is needed for fault tolerance, but excessive redundancy wastes performance
  4. Basic primitives: Links to papers on basic concepts in distributed systems, including election algorithms, consistent snapshots, consensus protocols, distributed state machines, broadcast, and chain replication
  5. A list of industrial system papers: Mostly from Google, with some from others as well

Meetups

Papers We Love

PapersWeLove computer paper sharing: https://www.zhihu.com/column/c_1353678180390162432

Microsoft-Distributed-System-Meetup

A distributed systems meetup organized by Microsoft folks, including studying 6.824 together, reading DDIA together, interesting keynote speeches, etc.: https://microsoft-distributed-system-meetup.github.io/home/

Distributed Systems Reading Group

A paper reading group organized by MIT folks in 2013: http://dsrg.pdos.csail.mit.edu/papers/

Including consensus protocols, data replication, transactions, concurrency issues, etc.

Computer Systems Study Group

@胡津铭 organized systems study group: https://learn-sys.github.io/cn/

The Last Thing

Finally, here is a classic awesome-series repo for distributed systems on GitHub: https://github.com/theanalyst/awesome-distributed-systems



我是青藤木鸟,一个喜欢摄影、专注大规模数据系统的程序员,欢迎关注我的公众号:“木鸟杂记”,有更多的分布式系统、存储和数据库相关的文章,欢迎关注。 关注公众号后,回复“资料”可以获取我总结一份分布式数据库学习资料。 回复“优惠券”可以获取我的大规模数据系统付费专栏《系统日知录》的八折优惠券。

我们还有相关的分布式系统和数据库的群,可以添加我的微信号:qtmuniao,我拉你入群。加我时记得备注:“分布式系统群”。 另外,如果你不想加群,还有一个分布式系统和数据库的论坛(点这里),欢迎来玩耍。

wx-distributed-system-s.jpg