The Four Horsemen of network communication

Jargon ahead
This post contains technical jargon and industry-specific terms. While I strive to explain concepts clearly, some familiarity with cloud computing and distributed systems may be necessary to fully understand the content.

As already discussed in On distributed systems, one of the hallmark characteristics of a distributed system is that it requires at least occasional network communication to get the job done.

The ability to perform computational (or any other, really) tasks in a distributed, but mostly coordinated and orderly manner is what makes distributed systems stand out, and what allows them to accomplish feats simply impossible for a single physical computer. This is a double-edged sword, however, and in this post we will explore how networking can sometimes scar the very masterpiece created with its powers. Unfortunately, networks have a number of traits that are less than desirable. More often than not, they make networking a liability just as much as a feature:

Unreliability

Networks can, and do go down at the least appropriate time and without notice. Loss of cellular network, having poor Wi-Fi signal or your Bluetooth headphones disconnecting whenever you wander too far away from your desk are great everyday examples, but wired network can also experience partial or full outages due to software and hardware errors, loss of power, damaged wires or simply being too overwhelmed by the traffic.

A full-fledged outage, where no network communication can effectively take place, is only one of many possible symptoms of network unreliability. Due to its absolute nature, it is likely one of the easiest to reason about and address in distributed systems. Far more frequent, yet rather subtle are partial outages and intermittent problems, where the network is mostly functional, but experiences all sorts of glitches - seemingly random and difficult to pin down:

All of this can be quite annoying when you are playing online multiplayer games, catching up with your favorite TikTokers or if just want to sit back on your sofa and have an all-nighter with Netflix. After all, instead of entertainment you see stream hanging up, have lags in an FPS where every tick matters, or lose a match due to key player being thrown back to lobby in a critical moment.

However, if you are building distributed systems, network unreliability takes problems and annoyances to a whole new level:

Some Software Engineers and Architects simply choose to pretend the problem does not exist, or that it is too unlikely to be taken seriously. Some genuinely believe that they can escape the grim reality of consistency vs availability dilemma by throwing enough hacks, workarounds and hotfixes at the problem, or by simply paying a hardware / software / cloud vendor who would solve the problem for them. Sadly, such wishful thinking only leads to systems that may not be too reliable, but at least they are expensive and nearly impossible to maintain.

Oblivious inconsistency

Consider this system:

Synchronous OrchestratorService AService BService C...Service J

The orchestrator performs some business operation that coordinates multiple services, from A through J. Unfortunately, it was not design with network failures in mind, and the orchestrator only covers the happy path when all the interactions go smoothly, with no room for failure. If the failure does occur, forever - perhaps Service C malfunctions, or the server it ran broke down, or there was a random network timeout - the orchestrator ends up in an undefined state, incapable of either completing the distributed transaction, nor of undoing the work already done - since this is a distributed transaction, there is no way to rollback automatically as if it ran within a single DB transaction.

What can Software and Support Engineers do in such a situation?

A mature Engineering team should make conscious decisions on when to strive for system consistency, and when to focus on system availability - balancing these two traits to meet product requirements.

Taking (dis)advantage of probability

In the example above, the fact that an orchestrator needs to coordinate communication with 10 different services only exacerbates the issue. Let us assume for a moment that the company behind this system mandates a 4-nines availability standard, meaning that the system must be available 99.99% of the time. Now, when do we consider this system to be available? With this architecture, all of the components must be available. From probability calculus, we can tell that if all conditions must be met for the outcome to be successful, then the probability of a successful outcome is a product of all partial probabilities. In our case, it’s the probability that each component of a system is available:

$ P_{\text{system}} = \prod_{i=1}^{n} P_i $

In our case, we have a single API gateway and 10 upstream services, so we can assume n=11. Now, let us see what happens if all components of the system follow the mandated requirement of 99.99% availability to the letter. Let us find the predicted availability of the entire system:

$ P_{\text{system}} = \prod_{i=1}^{n} P_i = \prod_{i=1}^{11} 0.9999 = 0.9999^{11} \approx 0.9989 $

Despite each of the 11 components does meet the availability requirements, the entire system does not - at 99.89% availability, it is far from meeting 4-nines requirements, and even missed a 3-nines standard (99.9%) by a hair. If we are generous, we can round up and say it is a 3-nines system, but realistically speaking it would be classified as 99.5%. In order for the entire system to meet the 4-nines requirement, translating to just under an hour of annual downtime, individual components would have to have an availability of 99.9995%, meaning a mere 2.5 minutes of downtime per year.

However, probabilistic calculus can be taken to our advantage as well. Consider the following scenario:

ServiceLoad BalancerInstance AInstance BInstance CInstance DInstance E

This time, let us assume we run multiple instances of our service, all of them behind a Load Balancer that routes the traffic to available instances. In order for this setup to be available, we require the following:

For the sake of this example, let us assume the Load Balancer is already Highly Available with pre-configured failover to alternative LB instance, and guarantees a 99.99995% availability - a bit over 6-nines standard. Let us also assume our instances only meet a 2-nines standard individually, but we run 5 of them while expecting 1 to be sufficient to run the system.

In case of replication, probability is our asset. Since we only need one of the instances to be available at any time, we can derive availability from the likelihood that all the instances experience downtime simultaneously and independently:

$ P_{\text{replicas}} = 1 - P_{\text{failure}} = 1 - \prod_{i=1}^{n} P_{\text{i,failure}} = 1 - \prod_{i=1}^{n} (1 - P_i) $

At 99% availability, the results would be as follows:

$ P_{\text{replicas}} = 1 - P_{\text{failure}} = 1 - \prod_{i=1}^{5} P_{\text{i,failure}} = 1 - \prod_{i=1}^{5} (1 - 0.99) = 1 - 0.01^5 = 1 - 10^{\text{-10}} = 0.9999999999 $

That is correct - the likelihood that all the instances would go down independently, at random is just absurdly low, and should a failure occur - it is highly unlikely to be the reason. Combined with the availability of the Load Balancer, we get the following outcome:

$ P_{\text{system}} = P_{\text{instances}} \times P_{\text{LB}} = 0.9999999999 \times 0.9999995 \approx 0.99999949 $

As you can see, the service instances and the Load Balancer can still meet the 6-nines standard, with room to spare. In practice, the availability would likely be further limited by other factors:

Key takeaway
Network unreliability is a severe design constraint that should never be overlooked - and with current technology, this trait of networks is going to stay with us for the foreseeable future. As long as distributed systems need to utilize unreliable networks for communication, their designs must take this trait into consideration and adapt. Otherwise, in an event of network errors the system could start to behave in unpredictable ways, or the customer data would be corrupted, or various parts of the system would strive to achieve mutually exclusive goals - preserving consistency or availability.

Bandwidth and throughput

Historically speaking, networks are invariably, inherently slow. Surely enough, you can arrange a ≥1 Gbps optic fibre connection with your ISP and even get close to this transfer on a good day, and in data centers around the world they probably have connections capable of 100 Gbps, if not more. On the other hand, a typical wired connection is going to be much slower than that, and cellular networks are typically even slower. The reason is that even if you have the best LTE or 5G signal out there, it is useful as long as the BTS providing you the service is not too overwhelmed with nearby devices. In fact, my friend had plenty of base stations around when he lived in Karpacz, Poland, and yet when throngs of tourists came over to Karpacz, or when the Economic Forum was held there, the network would become hopelessly overwhelmed - to the point calling emergency numbers would often be impossible.

Compare all of this with what computers are capable of internally. Even disk storage has become significantly faster in the last decade or so, first with widespread adoption of SSD - and more recently large-scale introduction of NVMe and PCIe consumer disks. Read and write speeds in excess of 1 GB/s (≈ 8 Gbps) are the norm for NVMe / PCIe disks, and this performance is quite consistent as opposed to network, where your IPS usually operates on a best-effort basis. High-end SSDs have already surpassed 12 GB/s (≈ 100 Gbps), besting the fastest optic fibre networks I have heard of so far, which are not available for an average Joe like us anyway. RAM and GPU buses are even faster, anyway, and the transfer speeds there are already approaching, if they have not yet exceeded 100 GB/s (≈ 800 Gbps), and even then MOBO-mounted RAM access is slower than what CPUs and GPUs can pull off with SoC memory and caches.

What is more, the hardware resources in a PC, laptop, or Smart TV are there for almost exclusive use of the owner. They get utilized by OS, programs and services you run on them - consciously or not - meaning that you have at least rudimentary control of these resources. If there are a lot of files to copy from a USB stick to your hard disk or vice versa, or if you are rendering videos, you can simply turn off less needed programs that would compete for the same resources and are not immediately needed. In case of networks, however, the resources are shared - if not with your co-workers in an office, then with your family in case of home WiFi, or people nearby whenever you use cellular network. What is more, even if you live alone and access the Internet over the wire - the Quality of Service is limited by the infrastructure your ISP has in your are, then on how well is your town or district connected to other areas, then how well connected your country is with the rest of the world. If an entire neighborhood is connected via a 10 Gbps optic fibre, it is physically impossible that everyone could enjoy their 1 Gbps internet and push it to its limits, simultaneously.

When is bandwidth a problem?

In my experience, there are several reasons why bandwidth could become a problem in distributed systems:

To begin with, server-to-server communication is quite privileged - in the sense that network bandwidth is usually good enough to not to have to worry about it, at least initially. We have grown used to using rather bloated, text-based data formats to exchange information, such as XML and JSON. Only recently we started appreciating binary formats again, and the likes of Protocol Buffers and communication based on gRPC are all the rage. An end user using a limited, low-bandwidth cellular network is in a far more precarious position - with a throughput measured in low megabits or even hundreds of kilobits per second, and oftentimes with limits on transfer usage imposed by their provider. In their case, excessive bandwidth usage is visible from day one, rather than after attaining a certain scale.

In day-to-day Software Engineer’s work, bandwidth can become problematic as various dependencies need to be downloaded on different occasions:

The best way to manage this problem is to be mindful about your project dependencies:

Lastly, one needs to be considerate about bandwidth usage induced by the systems they are building. This may be difficult to notice at a lower scale, but as the traffic keeps growing sending large JSON or XML payloads back and forth between servers can become problematic. To give one example, if a server send and receives 10MB payloads for each request, it means approximately 80 Mbps throughput in each direction just to handle a single request per second! At 100 rps, the throughput would already be 8 Gbps. It is quite likely that network bandwidth would become problematic - therefore, one could consider limiting payloads size. This can be done by limiting redundancy, increasing granularity (if feasible), introducing a more information-dense format, such as binary, and finally by utilizing compression.

Key takeaway
Networks can approach SSD drive transfers on a good day, though they will usually struggle to keep up for extended periods, and they are orders of magnitude slower than other media in a typical personal computer or server. This means that whenever large amounts of data need to be transferred from one destination to another, network is going to be the limiting factor. What is more, network infrastructure is shared with multiple participants, and overall usage of available bandwidth has significant impact on throughput attainable by individual participants.

Latency and jitter

Putting aside the sheer speed at which data can be transferred across a network, another limiting factor is how long it takes before a message of any kind reaches its recipient. This message can be as little as an ACK during SSL handshake, or as much as a SOAP response with 20 MB worth of XML payload - nevertheless, there is always a delay involved in network communication. In fact, it becomes more visible with growing frequency, not size of the messages involved - with multiple messages being exchanged between the participants, the delays can build up significantly before the communication is concluded.

The reasons for existence of such delays are at least twofold:

If you look closely at any professional latency report, you will notice these values are by no means absolute or definitive - instead, they are expressed in terms of latency at given percentile. The reason for this is jitter - because each individual latency is a result of a multitude factors, some of them rather volatile, the latency is subject to changes. Ten packets sent from the same sender to the same recipient may arrive after varying delays, not to mention packets sent from multiple senders or to multiple recipients. Because of this, we frequently read statements such as:

We observed a latency of 120ms at p50 and 200ms at p99 over the last 2 days.

What it means is that within the specified period, 50% of the time (p50) the latency was measured at 120ms or lower, while 99% of the time (p99) it would not exceed 200ms. We can also say that this connection - whatever it is - has a jitter that is not really that bad. It is not uncommon to see p99 latencies being many times higher than the median (p50)

Example - traceroute

As an exercise, let us see what output would we see when running traceroute command to trace packets sent to various hosts:

First, google.com:

~ traceroute google.com
traceroute to google.com (142.250.186.206), 30 hops max, 60 byte packets
 1  _gateway (192.168.0.1)  0.395 ms  0.485 ms  0.560 ms
 2  kra-bng2.neo.tpnet.pl (83.1.4.240)  6.608 ms  6.648 ms  6.683 ms
 3-12 (...)
13  209.85.252.117 (209.85.252.117)  23.511 ms 209.85.255.35 (209.85.255.35)  24.598 ms 142.250.239.81 (142.250.239.81)  23.634 ms
14  142.250.239.81 (142.250.239.81)  24.546 ms waw07s05-in-f14.1e100.net (142.250.186.206)  24.665 ms  24.620 ms

This host must be located not too far from Kraków - waw[...] prefix would even suggest it may be somewhere in Warsaw.

Then, let us try to hit a more remote server - I found out there is a Brazilian newspaper O Globo hosted on oglobo.globo.com:

~ traceroute oglobo.globo.com       
traceroute to oglobo.globo.com (201.7.177.244), 30 hops max, 60 byte packets
 1  _gateway (192.168.0.1)  0.422 ms  0.531 ms  0.617 ms
 2  kra-bng2.neo.tpnet.pl (83.1.4.240)  3.377 ms  3.417 ms  3.494 ms
3-9 (...)
10  boca-b3-link.ip.twelve99.net (62.115.123.29)  137.755 ms  137.776 ms  137.793 ms
11-29  * * *
30  * * *

Apparently, this server must be too far to trace its route, or perhaps behind some firewall preventing this information from reaching me. Even then, we can clearly see these packets were sent much further - the 10th hop took place after nearly 140ms, and there were 20 more before traceroute gave up.

Now, with that knowledge we can imagine what happens when two servers exchange messages, or when a user from Alaska has a video chat with friends from Germany - there is always a delay involved between the moment one device sends a packet, and the other receives it, and there is no way to escape this. Even in my vicinity, I can see it takes 4-7ms before a packet sent from my PC reaches the first server on my ISP’s side! Looking at AWS Latency Monitoring, this more or less coincides with what latency can be expected when operating within a single AWS region.

What does latency mean for distributed systems?

First of all, the more distributed they are in the geographical sense, the further distances their messages need to cover, contributing to greater latencies of the system. Furthermore, by co-locating components that need to communicate with each other we can significantly improve the systems performance - rather than hosting backend servers in the USA and databases in the UK, it is probably more efficient to have a few instances of each resource in each individual region or area of operation. It also happens to make compliance with all sorts of regulations easier, and allows you to provide customers with better user experience as everything they need is close to them - see google.com example above.

Moreover, even with minimal latencies - say, 3-5ms within a single AWS region or 1-2ms within data center premises - they still can, and do compound. Imagine these two systems:

Snake SystemsUserAPI GatewayPythons ServiceSnakes ServiceReptiles ServiceReptiles Database

In Snake Systems, there is a rather prevalent architecture where network requests need to make their way through multiple layers of microservices before reaching the actual data sources. Each server responds to the caller after receiving a response from the upstream server or database. With 5 requests followed by 5 responses, any latencies present in the system compound quite significantly. For example, if all of them had a constant latency of 100ms, the total latency would already be a staggering 1000ms.

Starfish SystemsUserAPI GatewayArm One ServiceArm Two ServiceArm Three ServiceArm Four ServiceArm Five ServiceArm One DatabaseArm Two DatabaseArm Three DatabaseArm Four DatabaseArm Five Databaseslowslowslowslow

In Starfish Systems, however, the architecture is entirely different - instead of going through layers and layers of servers, the architecture is pretty much flat with domain-specific services. In this scenario, let us imagine these services all need to be called by the API Gateway to respond to the caller, but since these services are queried independently, this can be done in parallel, and the gateway responds when the slowest service sends back a response. Now, let us assume that in this scenario, the latencies are also fixed at 100ms for each message, except for the slow ones - we will raise the bar for Starfish System and introduce a slow service and database, each experiencing latencies of 200ms instead. Even then, with this particular part slowing down the entire system - we can calculate that, again, it would take 1000ms for the initial caller to receive a response to their initial request.

Topology matters

When we compare the systems from the example above, they both attained their total latency of 1000ms, though in vastly different circumstances:

Real-life distributed systems are usually a mix of all three scenarios:

If preserving low latency for end users is crucial regardless of internal complexity of the task, it is worth considering if any parts of the lengthy process can be done in the background - asynchronously, without the need to keep the user waiting seconds to receive any response from the system. It is better to provide with them with an incomplete, but still useful answer early on, and let them await for the process to complete while keeping them updated somehow. One example could be responding to an online shopper that their payment is being processed, and giving them a tracker URL or sending email notification. Even though the payment could still be rejected by the bank or card provider, it is still a better user experience than watching a blank screen or an animated spinner until the payment is accepted.

Key takeaway
Distributed systems are inherently susceptible to the negative effects of high and variable network latency. Even in settings which generally enjoy negligible latencies, abuse of network communication result in dangerous compounding that could affect overall system performance. If there is no way to limit or streamline network communication to reduce compounding, or when the latencies are inherently high and cannot be immediately addressed, introduction of asynchronous processing can be helpful to improve user experience.

Security

As we have already established, networks are primarily shared resources. As such, we cannot always control who is using the network with us, and even when we do - we do not always know their intentions. Control over a network is not given forever, too - with new threats emerging every other day, employees falling victim to malware and being phished or even blackmailed to take actions they would not take under normal circumstances. What is more, your organization’s network is likely connected to the external world in one way or another, and with multiple offices it is likely that some traffic needs to cross the public networks in order to reach one destination from another.

Most organizations these days already have at least basic security practices in place, with their premises networks and remote employees connected via Virtual Private Network to ensure secure and encrypted traffic as it is being transmitted over untrusted media. The networks are protected by firewalls and SIEM solutions looking out for suspicious activity. Nevertheless, I am not qualified enough to discuss these topics in-depth - let us explore other, everyday aspects of network (in)security in the context of distributed systems.

Encryption frenzy

Let us face it, HTTP is not a secure protocol, as virtually everything sent over HTTP is sent in plaintext - including sensitive information such as credentials and authentication tokens. This is why a browser will warn you if you attempt to visit an HTTP rather than HTTPS website - if you happen to send your login and password there, and someone happens to be sniffing the network traffic, they could capture these credentials without anyone noticing.

Therefore, it is virtually mandatory to use encrypted HTTPS traffic as an alternative, except for situations where all communication is contained within a hermetic, trusted environment - say, a Virtual Private Cloud, inaccessible from the outside world except for a limited number of ingresses where SSL/TLS termination occurs. Otherwise, using secure protocols is a must even within a somewhat trusted company network, providing another layer of defense in case the network itself is compromised.

In case of distributed systems consisting of multiple components, handling HTTPS termination at each individual application could be problematic and would require far more maintenance. Additionally, it could become problematic with legacy applications which may simply not support HTTPS at all. Because of that, SSL/TLS termination before the traffic actually enters the application is often - but not always - an acceptable compromise.

Impostors

Another category of threats are all sorts of malicious actors, who can attempt to cause harm in various ways:

The first two groups are typically addressed by means of authentication - such as client and service tokens - supporting scopes or permissions for authorization purposes, and verifying if the caller authenticating themselves is in fact permitted to perform an action - before executing it. OAuth 2.0 is an example of industry-standard protocols built with exactly this purpose in mind.

MITM is more difficult to address, as the attacker can assume the identity of a genuine client communicating with a genuine server, while staying concealed. The feasibility from the attack stems from a simple fact - that we focus on establishing a secure connection, to the point we forget to double-check who are we connecting to. Simply switching from HTTP to HTTPS does not solve this problem - or, at least, no by itself. Since anyone can issue a certificate for themselves, or even sign it with a moderately trustworthy Root CA, it becomes crucial for both client and server to verify each other’s identity. A good example of this is mTLS, which involves both parties presenting each other with a certificate they already know and trust. Since both already hold a list of trusted certificates, they can verify if the other party’s identity can be trusted, and if yes they would encrypt traffic with the trusted public certificate. The recipient of such encrypted traffic needs the original private key in order to decrypt it, leaving little room for tampering.

While it would be difficult to require individual users to issue certificates for themselves, they can still install trusted public certificates, so that the identity of the websites they visit can be verified. In some scenarios, mutual authentication based on certificates or public/private key pairs can still be attained with individual users - an example of this is using SSH keys as an authentication method for GitHub. For service-to-service communication, mTLS is a viable option worth considering - especially when communication needs to leave trusted environments such as a single Kubernetes cluster. However, in my experience mTLS may prove labour-intensive and error prone if configured in the applications directly.

Blind spots

As a result of their sheer complexity, distributed systems are often ridden with security blind spots, which can later be exploited by the attackers:

The end result is that we leave attack surfaces quite needlessly, and often do not even realize that. This is especially dangerous when a system is highly complex, with a large number of legacy components which barely receive any maintenance. The situation is only exacerbated if the system has a large number of components that require security patching - those who were involved in patching Log4Shell in their systems certainly remember how frantic it was to patch and deploy a multitude of services in the most busy time of the year for multiple companies.

Key takeaway
Distributed systems are susceptible to a multitude of security threats, especially if they cannot be fully disconnected from external networks - not necessarily public. The fact they must communicate over often insecure networks, and usually have multiple components requiring attention and maintenance, it is no wonder they may fall victim to a plethora of attacks.

Summary

In this post, we have explored some of the most significant challenges for distributed systems, stemming from the fact they are bound to use network communication. One needs to keep in mind that this list is not even complete - in fact, due to the usage of networks and typical complexity of distributed systems, the list of things that can possibly go wrong is virtually endless.