On distributed systems

Photo of server racks full of colorful Ethernet wires and green diodes
Photo by Taylor Vick on Unsplash

What are these distributed systems, after all?

Briefly speaking, a software or computer system could be considered a distributed system as long as parts of said system run on individual devices, and at least some of them need to communicate over a network rather than via local loopback, shared memory and / or data storage.

Distributed systems distinguish themselves from regular ones with below characteristics:

The processes involved may be in fact running the same program to share the workload, or different programs to complete various tasks of the system, and frequently both approaches are applied in conjunction. The two quite extreme ends of “distributed systems” world could be a simple mobile application with a web backend, and a complex enterprise system with hundreds, if not thousands of microservices. Likewise, a monolithic server application with multiple instances running to ensure resilience, and backed with a replicated database is quite a common, yet sometimes overlooked example of a system that is already distributed.

Mobile application example

In case of a mobile application, the system is distributed because the app communicates with its backend over a network. The backend server continues to serve traffic to other users even when one of them experiences an application crash, and it is not uncommon for mobile applications to offer limited functionality in offline mode or when the server is down.

User AMobile ApplicationUser BBackend ServerMobile ApplicationgRPC requestresponsefailed gRPC requestusesuses

Microservices example

A system in which hundreds of microservices are involved is rather self-explanatory. In such systems, each microservice is expected to have its distinct responsibility (though duplication happens) and typically has more than one instance running at a time to increase its availability. To further increase resilience and to ensure sufficient computing resources for such a system, they usually run on multiple servers and operating systems - either directly, on virtual machines or in an orchestrated environment, such as Kubernetes.

KubernetesResourcesNamespace ANamespace ZNodesMicroservice AService Mesh DaemonDatabase AMicroservice ZService Mesh DaemonDatabase ZQueue ZWorker NodeWorker NodeGPU Accelerated Worker NodeControl Plane

Replicated / sharded monolith example

When a monolithic application has multiple instances, its individual processes can only exchange information via shared state in a database (or distributed cache for that matter) or by exchanging messages (such as events via message broker, or HTTP requests, or RPC calls) which almost invariably are accessed over a network; Likewise, if a database is replicated, any changes made to one of its instances (leader in case of single- / multi-leader DBs, or node in case of leaderless ones) need to be propagated to the remaining instances before they become aware of the change.

Monolithic applicationReverse ProxyCountries A-KCountries L-RCountries S-ZCountries A-KCountries L-RCountries S-ZExternal applications

Why would anyone need them?

Distributed systems can be immensely useful in multiple scenarios, not necessarily limited to Big Tech companies serving millions of users at any time of the year:

Reliability improvements

One of the core strengths of a distributed system is that it can be made more tolerant to hard / catastrophic faults than a monolithic, single-process system, for a number of reasons:

Serving your customers better

There is a number of customer experience improvements which, while not impossible to introduce without distributed systems, becomes arguably easier if one can run arbitrary number of software components and processes in a system:

Workload distribution

In some cases, your systems handle significant loads - perhaps serving thousands or millions of customers at any given time, or processing plenty of batch jobs for your back-office reporting and such - to the point they might grind to a halt, or the batch jobs would be scheduled weeks ahead, becoming a bottleneck for your organization. The workload may be fairly constant, or subject to substantial changes over time. The nature of distributed systems allows to address these situations:

Loose coupling

If the dependencies between your teams boil down to binaries or lines of source code, it becomes painstaking to roll out changes, to synchronize them, and to prevent these changes from turning into one merge conflict after another. Big, complex, monolithic systems are notorious for having outdated dependencies and language versions, and applying even the simplest changes could easily take months of back-and-forth between dozens, if not hundreds of engineers.

While distributed systems do not solve the problem of dependencies between the teams as such, the fact software components communicate over the network rather than direct function / method invocations or shared memory makes it arguably easier for teams to operate on higher-level contracts, such as API definitions or event schemas. This way, the teams would be less likely to get in each other’s way, and by less destructive means. On the other hand, solutions such as contract testing and schema registries can be leveraged as a failsafe in case of human error and breaking the contracts.

Should you always go for a distributed system?

Same question could be applied to vehicles:

Should you always go everywhere by train?

At this point, you might be thinking it’s a ridiculous question. After all, it depends on multiple factors:

Depending on these (and other factors), you may figure the train is the fact the optimal option. On the other hand, if your destination is thousands of kilometers away, it may be more beneficial to book a flight, and in some remote areas your only choices could be, say, a boat or an off-road vehicle.

This rule applies to distributed systems as well - despite their unique strengths, they are only practical as long as the benefits outweigh the drawbacks. To make the decision even more challenging, there is no single flavour or architecture for such systems - even in these few examples, we have seen quite diverse approach to building them, sometimes without even realizing the system is distributed. Making the right architecture choices when designing one can bring your software systems to another level of best-in-class - while questionable, if not outright uninformed decisions may have devastating effects on your system.

Why would anyone avoid them?

I am far from advertising this particular kind of software systems as a silver bullet or a cure to all diseases. While distributed systems may be immensely powerful in certain scenarios, there are countless way things can go wrong when you build one - and quite literally:

Ease of making mistakes

As an Engineering Director, Architect, or Software Engineer, you might not necessarily know what you are doing when first adopting distributed systems, such as when your organization decides to pivot towards the all-famous Cloud Transformation. Conversely, you may have solid prior experience and still run into wrong decisions due to outdated or inaccurate information - or even make decisions before gathering critical inputs which, as unthinkable as it seems, does happen. The consulting company you hired to walk you through your transformation may provide recommendations optimized for their revenue, rather than your needs. The infrastructure team may be to comfortable to their labour-intensive ways of working - it could have worked fine while they still maintained a number of servers in that infamous Room 503, Floor B, only to turn out to result in increasingly sluggish and error-prone processes as the infrastructure for your system grows. Your Software Engineers or Architects might insist on building an in-house solution for everything, rather than following industry standards and proven components.

The bitter truth is - any software architecture, distributed pattern or toolkit might work great for one class of problems, while proving detrimental for others. Every orchestration solution - or lack thereof - would have its evangelists and adversaries, and so does public cloud, private cloud, and running your systems on VMs or bare-metal. The key is to make well-informed decisions and evaluate the risks before it’s too late.

Difficulty of designing the system

Consider a classic, sequential algorithm or system. Unless it is based on (pseudo-) randomness, most of the time it is going to have plenty of useful characteristics:

Unfortunately, the above does not necessarily hold true for distributed systems:

Costs

As you can imagine, if running an application once has a certain cost - monetary, operational or in terms of electricity usage - the cost of running it as multiple processes on multiple servers is likely going to cost more. This is particularly true when dealing with public vendors - on one hand, you do not need to provision your own infrastructure and manage data center(s), though on the other - the vendor is invariably going to add their margin for services and resources they provide to you.

It can become particularly harmful when combined with unlimited scalability and/or lack of a failsafe such as rate limiting - since the vendor does not really know whether your resource demands have exploded because your business is booming, or because you made a terrible software bug or misconfigured your cloud resources - if your resource usage grows beyond control, so do your bills.

New classes of problems

As pointed out earlier, in classic programs - unless dealing with truly mission-critical programs, like ICBM, self-driving car or spacecraft firmware - you can reasonably assume the hardware it runs on is reliable given appropriate investment (e.g. in RAID disk matrices or ECC memory dies), so that you rarely need to address inconsistency. And even if you do, most of the time the program is sequential and linear, and there is a single source of truth - hence, inconsistencies tend to be… rather consistent.

The distributed systems world is entirely different, though. Regardless of how reliable individual software and hardware components are, network is inherently unreliable, and guarantees are given by network protocols on a best-effort basis. There be partial failures and intermittent errors manifesting themselves as odd glitches and bugs rather than full-fledged crashes and hard failures, temporarily or permanently. Your data may be kept in multiple copies on various devices, so you can no longer trust in a single source of truth, and keeping all of the data in sync invariably gives a headache. In simpler systems, where the distributed adjective applies simply because the server and the database run on separate hardware, this problem can be (mostly) addressed by transactions, however this will come at a cost - stronger consistency guarantees of a database tend to come with a throughput penalty, and opting for weaker guarantees might ensure data consistency within a transaction, but not necessarily between two independent transactions. If you are dealing with multiple sources of truth, you can no longer rely on enforcing consistency with database transactions - since every individual database is going to have its own one, committed independently in some particular order - assuming you only use databases, and only ones supporting ACID transactions, that is. With distributed systems, it is still possible to ensure at least the infamous eventual consistency or, better, read-your-writes consistency, which are significantly weaker guarantees and allow at least some intermittent inconsistencies in the system. Rather than transactions, however, you might expect to see terms such as event sourcing, sagas, transactional outboxes or perhaps 2-phase commit - a great explanation can be found in Thorben Janssen’s article on dual write problem.

Using a sledgehammer to crack a nut

A special case of applying distributed systems in a detrimental manner, and a great example of over-engineering, is to build an excessively complex solution to deal with a rather simple problem - deciding to build microservices before even defining the non-functional requirements for a system is a hallmark sign, to the point it became a meme in my environment and has created quite a few of microservices’ opponents.

This is not to say microservices or complex systems in general are a bad thing - the point is rather that in the world of distributed systems, over- and under-engineering can be equally destructive for the system and the organization behind it. Remember that system architecture resembles a living being more than a set of diagrams set in stone tablets. What works well today may need to change in two years, and what is likely to be a must-have in two years is likely not suitable just now - thus, it is more beneficial to think of the distributed system architecture as dynamic and evolving over time. Instead of investing in designs for grand architecture to handle it all, consider addressing these basic principles in your architecture:

Can / Should they be avoided at all?

Well, yes and no. Given that these days most web applications, and a good part of mobile applications could be considered a distributed system, even if basic, your best bet would be to withdraw from any online activity.

Rather than taking an all-or-nothing approach or perceiving this question in black-and-white, it is best to apply common sense and do prior research:

Summary

Distributed systems are an immensely diverse and powerful class of software systems, which come in countless flavours and may help you address a variety of business and technical challenges. That being said, they need to be designed and built with caution and in a way that leverages their strengths while diminishing drawbacks.