木鸟杂记

大规模数据系统

What Is a Distributed System

In the tech world, distributed systems have become an increasingly unavoidable term. The reason is that the scale of data in this era no longer matches the storage and processing capacity of a single machine. Thus, there are two approaches: building larger machines and connecting machines together. The former is costly and inflexible, so the latter is gaining more favor. According to the law of conservation of cost, cost does not simply disappear—when hardware costs go down, software design costs go up. Distributed systems theory is the key to reducing this software cost.

What It Is

Leslie Lamport [1], a pioneer of distributed systems, wrote in one of his most important papers, “Time, Clocks, and the Ordering of Events in a Distributed System” [2]:

A system is distributed if the message transmission delay is not negligible compared to the time between events in a single process.

Lamport explained this with a thought process akin to relativity. Consider two time scales: the message transmission delay between processes, and the interval between events within a single process. If the former is not negligible compared to the latter, then this group of processes constitutes a distributed system.

Understanding this definition requires grasping a few key concepts (formal definitions are always like this, shrug): process, message, and event. To avoid infinite recursion, I won’t go too deep here, but here’s an intuitive analogy: a process is a worker responsible for getting things done; the work can be broken down into multiple steps, each step is an event, and messages are the way workers communicate with each other.

This also aligns with the definition of distributed computing [3] (distributed systems are also called distributed computing) given by Wikipedia:

  1. There are several autonomous computational entities ( computers or nodes ), each of which has its own local memory.
  2. The entities communicate with each other by message passing.

This involves the most important kinds of resources in computer systems: computation (computational), storage (memory), and the network (network) that connects them.

In summary, we can describe distributed systems from another angle:

Externally, a distributed system appears as a unified whole, providing specific functionality based on aggregate storage and computing power.

Internally, a distributed system appears as a group of individuals that communicate via network messages and collaborate through division of labor.

The design goal of a distributed system is to maximize overall resource utilization while handling local failures and maintaining external availability.

Author: Muniao’s Notes https://www.qtmuniao.com/2021/10/10/what-is-distributed-system. Please indicate the source when reposting.

Characteristics

When building a distributed system, the following aspects should be considered logically:

  1. Scalability: Scalability is the most essential requirement of a distributed system, meaning the system design allows us to respond to ever-growing external demand simply by adding more machines.
  2. Fault Tolerance / Availability: This is a side effect of scalability. As the system scale grows, single-machine failures become the norm. The system needs to handle these failures automatically and maintain external availability.
  3. Concurrency: Without a global clock for coordination, dispersed machines naturally exist in “parallel universes.” The system needs to guide this concurrency into cooperation to decompose and execute cluster tasks.
  4. Heterogeneity (internal): The system needs to handle differences in hardware, operating systems, and middleware within the cluster, and be able to accommodate new heterogeneous components joining the system.
  5. Transparency (external): Shield the external world from the system’s complexity and provide logical singleness.

Types

When organizing a distributed system, the following physical architectures are possible:

  1. Master-Workers (master-workers): There is one machine in charge of coordination, and other machines do the actual work, such as Hadoop. The advantage is that design and implementation are relatively easy; the disadvantage is the single-point bottleneck and failure.
  2. Peer-to-Peer (peer-to-peer): All machines are logically equivalent, such as Amazon Dynamo. The advantage is no single-point failure; the disadvantage is that machine coordination is difficult and consistency is hard to guarantee. However, if the system is stateless, this architecture is very suitable.
  3. Multi-Tier (multi-tier): This is a composite architecture and also the most commonly used in practice. For example, the increasingly popular separation of storage and computation. Each tier can be designed according to different characteristics (IO-intensive, compute-intensive) and can even reuse existing components (cloud-native).

Pros and Cons

To reiterate, distributed systems are a reluctant choice made when the capacity of a single machine no longer matches the scale of data. Therefore, when designing a system, prioritize a single-machine system. After all, the complexity of distributed systems rises exponentially.

Now let’s summarize the pros and cons of distributed systems.

Advantages

High availability, high throughput, high scalability

  1. Unlimited Scalability: If well designed, you can respond to ever-growing demand by linearly increasing machine resources.
  2. Low Latency: Deploy in multiple regions and route user requests to the nearest data center for processing.
  3. High Availability, Fault Tolerance: Even if some machines fail, the system can still provide service externally.

Disadvantages

The biggest problem is complexity.

  1. Data Consistency. Given the large number of machine failures—crashes, restarts, shutdowns—data may be lost, stale, or corrupted. Making the system tolerate these issues and guarantee data correctness externally requires quite complex design.
  2. Network and Communication Failures. The unreliability of the network means messages may be lost, arrive early, arrive late, or hang, which brings tremendous complexity to coordination between machines. Basic network protocols like TCP can solve some problems, but more need to be handled at the system level itself. Not to mention possible message forgery on open networks.
  3. Management Complexity. When the number of machines reaches a certain scale, how to effectively monitor them, collect logs, and balance load are all significant challenges.
  4. Latency. Network communication latency is several orders of magnitude higher than intra-machine communication. The more components there are and the more network hops involved, the higher the latency, all of which ultimately affect the quality of service the system provides externally.

References

  1. Wikipedia, Leslie Lamport: https://en.wikipedia.org/wiki/Leslie_Lamport
  2. Leslie Lamport, Time, Clocks, and the Ordering of Events in a Distributed System: https://lamport.azurewebsites.net/pubs/time-clocks.pdf
  3. Wikipedia, Distributed Computing: https://en.wikipedia.org/wiki/Distributed_computing
  4. Confluent, The Complete Guide to Distributed Systems: https://www.confluent.io/learn/distributed-systems/
  5. Splunk, What Are Distributed Systems: https://www.splunk.com/en_us/data-insider/what-are-distributed-systems.html

我是青藤木鸟,一个喜欢摄影、专注大规模数据系统的程序员,欢迎关注我的公众号:“木鸟杂记”,有更多的分布式系统、存储和数据库相关的文章,欢迎关注。 关注公众号后,回复“资料”可以获取我总结一份分布式数据库学习资料。 回复“优惠券”可以获取我的大规模数据系统付费专栏《系统日知录》的八折优惠券。

我们还有相关的分布式系统和数据库的群,可以添加我的微信号:qtmuniao,我拉你入群。加我时记得备注:“分布式系统群”。 另外,如果你不想加群,还有一个分布式系统和数据库的论坛(点这里),欢迎来玩耍。

wx-distributed-system-s.jpg