In the tech world, distributed systems have become an increasingly unavoidable term. The reason is that the scale of data in this era no longer matches the storage and processing capacity of a single machine. Thus, there are two approaches: building larger machines and connecting machines together. The former is costly and inflexible, so the latter is gaining more favor. According to the law of conservation of cost, cost does not simply disappear—when hardware costs go down, software design costs go up. Distributed systems theory is the key to reducing this software cost.
What It Is
Leslie Lamport [1], a pioneer of distributed systems, wrote in one of his most important papers, “Time, Clocks, and the Ordering of Events in a Distributed System” [2]:
A system is distributed if the message transmission delay is not negligible compared to the time between events in a single process.
Lamport explained this with a thought process akin to relativity. Consider two time scales: the message transmission delay between processes, and the interval between events within a single process. If the former is not negligible compared to the latter, then this group of processes constitutes a distributed system.
Understanding this definition requires grasping a few key concepts (formal definitions are always like this, shrug): process, message, and event. To avoid infinite recursion, I won’t go too deep here, but here’s an intuitive analogy: a process is a worker responsible for getting things done; the work can be broken down into multiple steps, each step is an event, and messages are the way workers communicate with each other.
This also aligns with the definition of distributed computing [3] (distributed systems are also called distributed computing) given by Wikipedia:
- There are several autonomous computational entities ( computers or nodes ), each of which has its own local memory.
- The entities communicate with each other by message passing.
This involves the most important kinds of resources in computer systems: computation (computational), storage (memory), and the network (network) that connects them.
In summary, we can describe distributed systems from another angle:
Externally, a distributed system appears as a unified whole, providing specific functionality based on aggregate storage and computing power.
Internally, a distributed system appears as a group of individuals that communicate via network messages and collaborate through division of labor.
The design goal of a distributed system is to maximize overall resource utilization while handling local failures and maintaining external availability.
Author: Muniao’s Notes https://www.qtmuniao.com/2021/10/10/what-is-distributed-system. Please indicate the source when reposting.
Characteristics
When building a distributed system, the following aspects should be considered logically:
- Scalability: Scalability is the most essential requirement of a distributed system, meaning the system design allows us to respond to ever-growing external demand simply by adding more machines.
- Fault Tolerance / Availability: This is a side effect of scalability. As the system scale grows, single-machine failures become the norm. The system needs to handle these failures automatically and maintain external availability.
- Concurrency: Without a global clock for coordination, dispersed machines naturally exist in “parallel universes.” The system needs to guide this concurrency into cooperation to decompose and execute cluster tasks.
- Heterogeneity (internal): The system needs to handle differences in hardware, operating systems, and middleware within the cluster, and be able to accommodate new heterogeneous components joining the system.
- Transparency (external): Shield the external world from the system’s complexity and provide logical singleness.
Types
When organizing a distributed system, the following physical architectures are possible:
- Master-Workers (master-workers): There is one machine in charge of coordination, and other machines do the actual work, such as Hadoop. The advantage is that design and implementation are relatively easy; the disadvantage is the single-point bottleneck and failure.
- Peer-to-Peer (peer-to-peer): All machines are logically equivalent, such as Amazon Dynamo. The advantage is no single-point failure; the disadvantage is that machine coordination is difficult and consistency is hard to guarantee. However, if the system is stateless, this architecture is very suitable.
- Multi-Tier (multi-tier): This is a composite architecture and also the most commonly used in practice. For example, the increasingly popular separation of storage and computation. Each tier can be designed according to different characteristics (IO-intensive, compute-intensive) and can even reuse existing components (cloud-native).
Pros and Cons
To reiterate, distributed systems are a reluctant choice made when the capacity of a single machine no longer matches the scale of data. Therefore, when designing a system, prioritize a single-machine system. After all, the complexity of distributed systems rises exponentially.
Now let’s summarize the pros and cons of distributed systems.
Advantages
High availability, high throughput, high scalability
- Unlimited Scalability: If well designed, you can respond to ever-growing demand by linearly increasing machine resources.
- Low Latency: Deploy in multiple regions and route user requests to the nearest data center for processing.
- High Availability, Fault Tolerance: Even if some machines fail, the system can still provide service externally.
Disadvantages
The biggest problem is complexity.
- Data Consistency. Given the large number of machine failures—crashes, restarts, shutdowns—data may be lost, stale, or corrupted. Making the system tolerate these issues and guarantee data correctness externally requires quite complex design.
- Network and Communication Failures. The unreliability of the network means messages may be lost, arrive early, arrive late, or hang, which brings tremendous complexity to coordination between machines. Basic network protocols like TCP can solve some problems, but more need to be handled at the system level itself. Not to mention possible message forgery on open networks.
- Management Complexity. When the number of machines reaches a certain scale, how to effectively monitor them, collect logs, and balance load are all significant challenges.
- Latency. Network communication latency is several orders of magnitude higher than intra-machine communication. The more components there are and the more network hops involved, the higher the latency, all of which ultimately affect the quality of service the system provides externally.
References
- Wikipedia, Leslie Lamport: https://en.wikipedia.org/wiki/Leslie_Lamport
- Leslie Lamport, Time, Clocks, and the Ordering of Events in a Distributed System: https://lamport.azurewebsites.net/pubs/time-clocks.pdf
- Wikipedia, Distributed Computing: https://en.wikipedia.org/wiki/Distributed_computing
- Confluent, The Complete Guide to Distributed Systems: https://www.confluent.io/learn/distributed-systems/
- Splunk, What Are Distributed Systems: https://www.splunk.com/en_us/data-insider/what-are-distributed-systems.html
