Cloud Transformation misconceptions
- Misconception 1: Cloud as an opposite of on-premise
- Misconception 2: Cloud Transformation can be done without modernizing the processes
- Misconception 3: Well-known problems call for in-house solutions
- Misconception 4: Cloud requires implementing new, dedicated applications
- Summary
Whether your organization is a tech company or not, you have probably heard of Cloud Transformation, if not participated in one. In this article, I will cover a number of examples on how misunderstandings led the companies adopting cloud-based computing to struggle. Hopefully, awareness of such pitfalls could help you navigate your own efforts.
In a nutshell, Cloud Transformation is an organization’s strategic effort to adopt cloud computing for its external or internal operations and can be considered another stage of modernization, similar to and often following Digital Transformation. While, principally, such efforts should be taken with certain practical benefits in mind, it is not unheard of companies committing to Cloud Transformation simply to be perceived as modern and cutting-edge by the customers and/or shareholders.
Misconception 1: Cloud as an opposite of on-premise
The misconception here is that in order to adopt cloud computing, you need to give up the physical control of underlying computing resources and migrate your - possibly sensitive - business operations or product to infrastructure controlled by a 3rd party. While it is certainly a valid option in some scenarios, it is crucial to acknowledge on-premise infrastructure can be, and has been successfully leveraged by organizations to build their own, private cloud infrastructure.
This is a simplified example of private cloud infrastructure. Despite the fact it is operating in an on-premise data center, we can rather confidently say it is a private cloud since the real computing resources are abstracted away from the actors accessing them.
What does cloud mean, then?
The essence of cloud computing is that the computing resources - for example, servers - are abstracted away from software running on them. Rather than having to track which application runs on which servers (… in which server room, on which floor, in which office), cloud computing allows to allocate certain computing resources to a specific application from a pool of resources including these servers. Rather than micro-managing the infrastructure to ensure FizzBuzz Service runs on 192.168.244.112
for some arcane reason making it one of few suitable ones, those pooled computing resources are often orchestrated by a solution such as Kubernetes or HashiCorp Nomad, and server nodes are labeled to ensure only resources meeting certain criteria can be used for a certain purpose:
- You might want to run a daemon process on every server you have,
- You may need to run 3 instances of your application for redundancy, and for extra confidence have each of them run on a separate device,
- You might have multiple data centers, and want to manage them in such a way that applications run in each independently, but can also serve as a standby in case the other experiences an outage.
How can cloud computing complement on-premise?
As mentioned, cloud computing is an abstract, high-level approach to how computing resources are managed and how applications are ran - or scheduled - on them, and not at all about where these resources are located or who is managing them.
Conversely, on-premise is all about who and where runs and owns the infrastructure. In other words, on-premise implies that an organization has a strong ownership of its computing resources, regardless of how exactly these resources are being allocated for various purposes.
As you can see, both concepts are practically independent of each other, and there is no reason why cloud computing could not be leveraged to manage on-premise computing resources. In fact, as the on-premise infrastructure keeps growing, it becomes increasingly difficult and labor-intensive to efficiently manage all of the resources without introducing higher levels of abstraction, separation of concerns between applications and infrastructure - and, ideally, automation of infrastructure management.
Is on-premise or third-party cloud objectively better?
There is no straightforward answer for this as both have their strengths and limitations. The choice should also depend on whether the organization is already running its own data centers.
On-premise shines when confidence in controlling the infrastructure and data processed or stored on it is critical. While cloud vendors implement measures to ensure their solutions are secure, and infrastructure of their customers remains isolated and secured from unauthorized access, there might be scenarios where guarantees provided in a contract are not enough, and an organization would choose to run their own infrastructure. Additionally, on-premise may be actually cheaper to run in the long term, since apart from maintenance costs your organization does not contribute to a vendor’s margin, which might be considerable. Especially if you already have your own data centers, building a private cloud on top of them is a great way to benefit from both being cloud-enabled and owning your infrastructure. On the other hand, building cloud infrastructure from the ground up can be highly challenging, and requires extensive expertise in infrastructure management. It may also require thoughtful, up-front decisions and substantial initial investments, taking into account variability in the computing resources demands, the organization’s growth, and many other.
Conversely, leveraging third-party cloud infrastructure provided by a vendor can give your organization an edge if you need to become cloud-enabled fast and without up-front investments in on-premise infrastructure. It also takes the burden of actually handling the data centers off your shoulders, and may offer superior on-demand scalability if your demands are highly variable. However, it is a rather costly solution that may not be economical in the long run, and you need to be mindful about the implications of hosting applications and handling data on third party’s infrastructure. It may also become problematic if the usage of resources spins out of control - it is not too difficult to quadruple the infrastructure costs due to misuses and inefficiencies.
Misconception 2: Cloud Transformation can be done without modernizing the processes
The problem in this case is that Vanilla Solutions made investments to become cloud-enabled, however the engineering decisions - or lack of them - prevented the organization from leveraging the benefits of cloud computing. It was also a missed opportunity to modernize and streamline existing processes, which would only allow for more efficient cloud operations, but also could be transferred back to non-cloud infrastructure to some extent.
If you struggle to see any fundamental difference between Non Cloud infra and Cloud infra in the diagram above - worry not, this is because both infrastructures are essentially the same, and neither is cloud-enabled, regardless of where it is hosted or whether it had been set up with Terraform, from command line, or with shell scripts.
Why is that an issue?
There is a number of problems caused by fine-grained control of cloud computing resources:
- Unsuitability of cloud / infrastructure management tools,
- Complexity for end users, resulting in labor-intensiveness, error-proneness and excessive cognitive load when controlling extensive infrastructure.
Tools unsuitability
Infrastructure-as-Code tools are in fact optimized to work with high-level abstractions, and to build higher-level components out of more basic ones. An example of this philosophy is Terraform’s ability to define reusable module and variables, so that the user can compose complex resources out of basic building blocks, or leverage off-the-shelf modules provided either by Terraform, or open-source community, or other infrastructure engineers contributing in the organization.
In order to exert such fine-grained control, to the point I tend to call it infrastructure micro-management, IaC user needs not only to give up all the benefits these tools and high-level abstractions provide, but also to go the extra mile to force end results to be exactly as intended.
High complexity
To give an analogy, consider driving a number of vehicles:
- A custom built 1930s automobile having dozens of knobs and buttons on its dashboard, and the gearbox is non-synchronized so you must match the revs to be able to change gears,
- A simple, old-fashioned family car manufactured in 1980s-1990s. The operation is manual but not too complicated: A few dials, a clutch, brake and gas pedals, and the gearbox is a synchromesh so the operation is by far easier,
- A modern hybrid with automatic, continuously variable gearbox that does not even require the driver to select gears anymore.
From the perspective of technical complexity, the 1930s automobile is likely the simplest since it does not have any advanced features - the gearbox is basic, and so is the rest of its equipment. That being said, it is by far the most complex to operate for a driver - instead of focusing on the surroundings and getting from point A to B in a safe manner, the driver must keep an eye on numerous dials, watch the revs closely while switching gears, operate all the knobs and valves, not to mention having to stop every 50km or so to add more cooling water.
On the other extreme, we have a hybrid car which is by far the most complex mechanically - with hybrid combustion / electric drivetrain, most likely with a computer onboard controlling everything that is going on under the hood to optimize for performance and fuel economy as needed. The CVT gearbox may have inferior longevity compared to a well maintained synchromesh. On the other hand, it is by far the easiest car to operate - the driver does not need to get distracted by changing gears at all, and all the fine-grained controls are taken care of by the computer. Moreover, the car has plenty of ingenious safety systems which were not invented yet, or available only for premium cars in 1980s.
The analogy here is that all of the high-level, modern infrastructure tools are often highly complex by themselves, as they need to be robust enough to reliably handle infrastructure changes. At the same time, they make best efforts to keep fine-grained knob switching away from the end user of the infrastructure, in this case the engineering team building systems to run in the cloud. While doing this in an old-fashioned way certainly has its advocates and may be not too overwhelming at a smaller scale, it should be kept in mind that with growing infrastructure the difficulty would also ramp up significantly.
The implications of resorting to a highly complex solution - from end users perspective - are manifold:
- Labor-intensiveness, stemming from the fact that in order to achieve a high-level result, the user needs to complete multiple tasks, or apply many fine-grained changes,
- Error-proneness, caused by lack of proper abstractions, as well as human’s natural tendency to make mistakes from time to time - the more work there is to do and the less comprehensible the abstractions, the higher the risk that a routine task would result in an incident,
- Cognitive load resulting from the sheer amount of information to process at a time - be it configurations, source code, or mental model of how human-unfriendly values map to desired properties of various infrastructure components. Higher cognitive load leads to outcomes of both inferior quality and quantity, contributing to both labor-intensiveness and error-proneness of the solution.
Misconception 3: Well-known problems call for in-house solutions
It is unclear why Vanilla Solutions was inclined to basically re-invent the cloud from the ground up, especially if the cloud platform was to be hosted on a vendor’s infrastructure and the company was hiring engineering talent with prior cloud experience. It might have been done as an odd habit, stemming from how the organization operates in its core business, on the other hand this kind of behavior resembles not-invented-here syndrome, and argumentation provided would align with NIHS.
Why is this an issue?
Even more alarming was the fact that while the in-house solutions closely resembled existing standards, in this particular case Protocol Buffers, the organization was not willing to at least take inspiration from established industry standards, leveraging their strengths while keeping in mind specific requirements for the new protocol. The radical commitment to in-house solutions, combined with refusal to follow industry best practices, led to a number of issues:
- The development of the new protocol was slowed down by the fact that new edge cases were discovered on a regular basis,
- The delivery of the cloud platform was essentially blocked until the protocol design was finalized, and libraries, SDKs or at least reference implementations would be provided,.
- The new protocol would naturally be unsupported by third-party components such as existing cloud libraries, SDKs, network components and so forth, leading to chain reaction of one in-house solution necessitating another.
Can in-house solutions and industry standards co-exist?
In principle, they can and do on a regular basis. However, as a rule of thumb it is far easier and more practical to take off-the-shelf products - such as open source network components - and build higher-level in-house solutions on top of them. With this approach, organizations can leverage the best of both worlds:
- Highly customized, custom components and deliverables can be created, according to organization-specific needs,
- Following industry standards ensures that the solution will be easier to onboard to for new joiners to the organization,
- Well-maintained, industry-standard solutions are usually (though not always) well optimized for the task, in terms of performance and developer experience alike,
- The bulk of edge cases is addressed by creators of third-party solutions, and for the remainder there is likely community support - oftentimes, someone has already run into a similar problem,
- Without the need to invent low-level components, solutions can be built more rapidly and with more room for experimentation and prototyping,
- The organization can focus on building the intended systems and delivering value, rather than creating enablers first.
Isn’t third-party software risky?
Naturally, any third-party software to be adopted by an organization should be thoroughly reviewed, especially for security and licensing issues.
Luckily, there are public databases of open source vulnerabilities maintained and kept up-to-date by well-recognized entities, such as CVE and NVD, allowing to not only audit a library or component before adopting it, but also to monitor it and respond to vulnerabilities discovered after adoption. It is a best practice to incorporate at least an OSA / SCA scan into software’s CI pipelines to ensure regular vulnerability scanning.
That being said, open source software is not the only susceptible to vulnerabilities. In-house solutions can also suffer from security issues, especially if created without sufficient expertise, in a high-pressure environment and without relevant quality gates - CVE and NVD databases are full of open- and closed-source software vulnerabilities alike.
As for licensing of software, most digital companies already have policies regarding allowed and prohibited licenses in third-party dependencies. Examples of commonly used licenses typically considered safe include Apache License, Version 2.0 and MIT License, while copyleft licenses such as GPL are typically avoided due to the legal implications.
Misconception 4: Cloud requires implementing new, dedicated applications
The misunderstanding here was the assumption that the most difficult and labor-intensive part of building cloud applications is implementing the applications themselves, and that cloud computing resources are special in some undefined, vague way, to the point regular software would not even run there.
Are “cloud” applications that special?
The short answer is no. In cloud computing, countless programs, applications and operating systems are used, which were created long before the “cloud” term was first coined. Most notable examples include:
- Linux and Windows OS families,
- Common Linux/Unix utilities such as
cURL
,grep
,bash
, - Most mainstream programming languages, including Python, C, JavaScript, C# and Java.
To be more specific, just about any program can be executed “in the cloud” as long as a number of requirements is met, most notably:
- The cloud infrastructure is capable of provisioning computing resources with required hardware and software,
- The program can be configured to use resources provisioned through the cloud infrastructure,
- Ideally, the program in question should be runnable from command-line in non-interactive mode,
- Optionally, the program can be containerized to improve portability.
In fact, running a non-cloud application in the cloud is usually far easier, than vice versa. The reason is that some cloud applications follow cloud-native practices, which may lead to high coupling to cloud-specific abstractions or even services offered by a particular cloud vendor as a result of using dedicated cloud libraries, integrations and SDKs. Such applications may be inherently difficult to reliably execute outside of cloud environments. An opposite approach is called cloud-agnostic, where coupling is kept as low as practical, and cloud resources are utilized with interoperability in mind. This kind of software is typically easier to run in another vendor’s cloud, and it is not uncommon for the applications themselves to be literally cloud-agnostic - dependencies on specific cloud resources are contained in infrastructure as code definitions. A good example of literature covering distributed design patterns that help adapt software to cloud computing is Designing Distributed Systems by Brendan Burns.
Hardware and software requirements
In a nutshell, not all software is fully multi-platform, and even multi-platform programs typically need a runtime environment or to be compiled for target environment.
As a rule of thumb, third-party cloud vendors offer extensive support for cloud computing resources based on Linux distributions, and Windows or even MacOS virtual machines are also available. Software that is compiled to binaries rather than bytecode, or executed by an interpreter might require specific CPU architecture - and until recently the most readily available cloud resources were based on x86 (more specifically, x86-64). This has changed in the recent years, however, and ARM-based cloud resources have become available, making the offering more diverse.
In case of private cloud, however, what it has to offer in terms of hardware and software depends entirely on the owner. If you already run software tailored for a specific OS or hardware configuration, in principle there is no reason why this specific setup should not be incorporated into your private cloud as well.
Application configurability
Most software does not run in complete isolation, and has at least some dependencies to the so-called external world - upstream APIs, databases, message queues and file storage to name a few.
In the ideal world, such dependencies are configurable so that they can be adjusted to specific scenarios without having to modify the codebase and re-build or re-compile. Most notably, this approach can be used to allow configuring environment-specific URLs to databases and upstream services. Similarly, environment-specific certificates or secrets can be bundled with the executable at the deployment stage, or an integration with a secret vault can be set up to provision these sensitive values without putting them into configuration, which could lead to leaks and security incidents.
If your application is capable of all of this, there are pretty good chances it can already run in the cloud without changes - other than providing relevant configuration.
Non-interactive command-line invocation
Due to the somewhat ephemeral nature of cloud-based computing resources, applications should be able to start without human intervention. As such, it is advisable that all context required to start an application should be provided by:
- command-line arguments,
- environments variables,
- configuration files,
- other non-interactive measures.
If an application invariably requires interactive input to run, it requires workarounds, and becomes a nuisance when configuring a cloud service that should self-recover from crashes or auto-scale, or when you need to schedule a cron job that is supposed to trigger every so often. If you are familiar with systemd
on Linux or with services on Windows, this should sound familiar - in cloud computing, the problem is analogous to scheduling daemons or background services in an OS, only to find out the program asks the user for an stdin
input.
Containerization
This trait is not strictly mandatory to run software in the cloud in general, however many computing resources, cloud services, and orchestrators require applications to be containerized according to OCI specification. There are several reasons for this:
- Containerization ensures a fairly uniform interface to run applications written in any language, with any dependencies, and thus it is easier to support a single OCI specification rather than dozens of programming languages and hundreds of frameworks,
- Apart from executable itself, container images come with all the dependencies bundled within an image - most notably, system libraries, specific OS / distribution and build tools,
- Containers are somewhat easier to manage en masse than dedicated virtual machines per application, and typically have less significant overhead than a VM.
Why is this bias damaging?
The belief that cloud computing requires special software leads to excessive focus on software implementation, and may distract an organization from modernizing the infrastructure and its processes. Apart from under-appreciating if not overlooking the modernization efforts, this kind of mindset discourages code reusability, leading organizations to commit their limited resources and talent pools to efforts that might be unnecessary, and miss the opportunities to achieve better results with less effort and at a lower risk.
Summary
To conclude, among the consequences of misunderstanding cloud computing, day-to-day communication inefficiencies and misunderstandings are probably the least severe. Far more concerning are the possible long term consequences for the organization:
- excessive spending on Cloud Transformation,
- missed opportunities that could be leveraged by a successful transformation,
- creating new organizational challenges,
- exacerbating existing inefficiencies,
- disappointment of stakeholders who showed initial interest in the company’s modernization.