This is the opening article in the series of articles on the topic of Distributed Systems. My intention is to do a series of relatively in-depth articles on topics: Distributed Systems, Microservices, Transactions, Event sourcing, CQRS, Domain Driven Design, and if I have time, I may mention Blockchain. again.
A “Distributed Systems” (now for permission to use the English word) is defined as a set of independent computational processes, interconnected by a system. network (a network) so that these processes can transmit information (‘process’ here is defined as a computational unit operated with a separate memory space, not overlapping with other processes). These processes work together as a single entity to the external user to perform a certain task. Based on this definition, multiple processes on the same computer can also be referred to as Distributed Systems (DS). Of course, in reality, people are interested in running multiple computers together, so we can implicitly mean that these processes run on separate computers.
So why do we need to learn about DS ? DS appears very popular in practice. Most applications today, especially Internet applications, are deployed in the form of DS. Deploying software, especially large systems, across multiple computer units (instead of just using a single computer) has many benefits, for example:
- Provide more resources when the system needs to handle a larger amount of work.
- Using only one computer unit means there is a risk of software crashing if it crashes. Using multiple machine units will allow you to continue operating the software even if a problem occurs.
- When your system becomes complex and requires a combination of many different components, using DS allows you to break a large system into many small units. Each unit can operate independently, and can even be developed by different teams with different specializations.
- System users can have geographic distribution around the globe. To ensure quality of service and limit latency, the system needs to be distributed so that it can be present as close to the user as possible.
What’s the challenge to build Distributed Systems?
Although it brings many of the above benefits, DS also makes development and operation more complicated, due to its distributed properties. To build a well-functioning DS system, we will have to solve many problems, depending on the specifics of that system. Some common problems can be named such as:
- The problem of troubleshooting (failure): With a mainframe system, the machine malfunctioning or malfunction is a frequent occurrence. Troubleshooting so as not to affect the system (or at least minimize the impact) is the most core issue in building heritage.
- Consensus issue: Building DS means that the data of the system will also be distributed. This causes problems if we want the system to work as a unified entity, since the machines will sometimes disagree on the data side.
- Time asynchronous problem: Each computer in the system has a separate clock, and is not consistent with each other in terms of time. The computer clock also often occurs phase deviation (clock drift). This asynchronousness can lead to logical deviation of the system. For example, if a messaging application has been timed inconsistently, the order of messages from user may be erroneous between user and user.
- Other issues like security problem, communication problem, data storage problem and data backup …
There is an interesting thing in the field of heritage, which is that almost no theoretical solution can meet the above requirements perfectly (discussed in more depth in the following articles). However, this does not mean that they do not have appropriate solutions in practice, depending on the specific system. Most of the recent developments in the field of heritage have been aimed at the construction of practical solutions for systems with specific purposes.
Failure types in Distributed Systems
As mentioned above, problems in system operation happen on a regular basis. Here is a summary of the translation of Jeff Dean , a chief engineer at Google:
In Google’s data center, each year a cluster will happen about 1000 machine failures, thousands of hard drive failures, an average power failure causes about 500-1000 servers to crash in around 6 hours. In addition, an average of 5 racks will malfunction, losing half of the packets being transmitted, affecting about 5% of servers at any given time.
Because the problem occurs so often, the heritage systems need to have a mechanism to automatically handle it when a bad situation is encountered, because it is not always possible for people to intervene in a timely manner. To solve this problem, we need to understand what types of problems the system may encounter.
Each physical computer may, for a variety of reasons, experience a problem in its operation. These incidents are divided into several main categories:
- Fail-stop : This is the type of problem that causes the process on the machine to stop working (stop calculating and transmitting). The cause of this problem can be caused by a crash (software error, operating system error …), hardware failure, or external causes (eg power outage). This is the most common type of problem, so when people talk about ‘failure’ without mentioning anything more it is often implied this type of problem. Most of the algorithms developed in DS deal with this type of problem.
- Fail-recover : The process may stop working for a certain time, but then resume the activity again. The cause of this type of problem may be due to the machine automatically reboot for some reason. Often when it comes to this type of problem, it is assumed that the machine has the ability to store information on the hard drive and recover it after the problem occurs.
- Byzantine failure : The problem that the computer is not working as required. For example, the machine can send messages at will, or change the state at will, unlike what is programmed. This is the most annoying type of problem, which can occur when the system has an unexplained malfunction (e.g. RAM can be corrupted causing a bit-flip condition), or when a malicious system is attacked. .
The computer network is also a physical product, and therefore problems may also occur. One common type of problem is the ” network partitioning” problem , shown in the figure above. This problem occurs when the transmission line of one or more servers is cut off from the rest of the system, causing the system to be divided into parts that cannot communicate with each other. In fact, in data centers, a server cluster is usually connected by one or more switches. Switch port or wiring failure can result in one or more servers being disconnected, leading to the partitioning situation described above.
In addition, some other network problems can be mentioned as latency increased (due to congestion control), or network adapter of servers malfunctioning …
Design Distributed Systems to be ready to handle the problems
Due to frequent and diverse breakdowns as mentioned above, the heritage system should be designed to have an automatic trouble-handling mechanism to ensure uninterrupted or error-free operation. Here I will quote a scientific article titled “About the design and deployment of Internet services” , by James Hamilton (then working at Microsoft, current chief engineer at Amazon Web Service):
System design for troubleshooting is a core concept when developing large-scale services, consisting of many small components. These components will frequently crash, and sometimes one problem can lead to another problem. When the system is deployed on about 10,000 servers and 50,000 hard disk units, the problem will occur many times a day. If any problem requires human intervention, the system will not be able to operate smoothly and save money. Therefore, the system should have an incident response mechanism that does not require human intervention. This mechanism should be checked on a regular basis. A simple way to check is to intentionally cause frequent crashes during the system’s operation. This may seem ridiculous at first glance, however, if the crash mechanism is not frequently used then they will not function as needed.
That said, when developing a DS system, the problem must always be considered as part of the problem, and we absolutely must not assume that the problem never occurs.
A few documents & links to learn more about Distributed Systems
Distributed Systems is a relatively extensive field and in a few articles it is difficult to cover all the topics in detail. Here are a few sources of information for you to learn more about this field:
- MIT Distributed Systems key recorded on YouTube: https://www.youtube.com/playlist?list=PLrw6a1wE39_tb2fErI4-WkMbsvGQk9_UB
- Distributed Systems Course channel on Youtube: https://www.youtube.com/user/cbcolohan
- Reliable Distributed Algorithms (Free) Course: Part 1 and Part 2
In the next section, I will discuss the consensus problem, the most basic problem in Distributed Systems.