Infrastructure resources created on-premises, or in the cloud, can either be modified in place or be completely replaced with a new instance. It is usually not a problem to modify the servers, which involves upgrading libraries, rolling out new application code, or changing configuration settings in the early days. However, as your workloads scale, these in-place upgrades risk the reliability of the applications and complicate the testing and operational procedures. Before diving into the benefits of the immutable infrastructure, it’s important to introduce you to mutable infrastructures and explain their trade-offs. Let’s discuss this in the light of a real-life example, where I managed a Java-based web application on a Tomcat web server.

Mutable infrastructure

This is atraditional Java web stack with web application archives (WAR) files deployed on Tomcat web servers. For high availability, four similarly configured machines were wired together to form a cluster and were exposed to the end user through HAProxy, a high-performance TCP/HTTP load balancer.

Any system update on the Linux servers, be it a new software rollout, security patch, or an operating system configuration change, was carried out manually by initiating a remote SSH session and executing the required commands. As the application load increased, new servers were added to the pet fleet, increasing the count to 10. Manually executing commands was no longer practical, so I started leveraging tools such as Ansible. Ansible works by connecting to the managed nodes and injecting small programs known as Ansible modules. These modules represent the desired state of the system and they are executed on the host, over SSH by default. This allowed me to automate the manual tasks with a configuration file describing the sequence of commands to be executed. This file was then passed to Ansible along with a server inventory it had to be executed against. From here on, Ansible took over the responsibility of ensuring that all modifications were deployed across every single server in the fleet with a single command. The most important thing to highlight here is that these servers were still mutable. Whenever a change had to be rolled out, it was done in place, adding on top of previous modifications. Over time, you continue building, or in other words, move further away from the state the server was in when it was freshly provisioned. This is completely ok, so longas the change process is reliable and you don’t lose the server itself.

As mentioned in Werner Vogel’s quote earlier: “… failures are a given and everything will fail over time.” The same happened with this software application stack. Ansible simplified the server fleet management by taking over the heavy lifting involved around SSH and manual command executions. However, it could not circumvent other problems around networking and security. Let’s discuss them:

  • Random unresponsiveness of Linux package repositories resulted in installation failures.
  • Since Ansible was essentially running over SSH, network packet drops resulted in incomplete command executions.
  • The indeterministic state of servers further resulted in a unique combination of code and dependencies that hadn’t been tested before.
  • To fix the intermediate states of servers, developers were required to log into the servers and execute commands. Over time, this introduced additional configuration drift among the servers.
  • Having a dependency on developers and operations to SSH into live systems was a security risk.

As you can see, most of these errors would leave a subset of servers in a unique untested state as they would all fail in slightly different ways. This resulted in a lack of reliability and confidencearound how those servers would behave until this interim state was fixed. Debugging the 10 servers was still somewhat manageable, but imagine doing this on hundreds and thousands of machines.

Such half-failed scenarios contribute to major problems around the following:

  • Lack of testability of all possible permutations and combinations
  • Complex debugging procedures
  • Unreliable software applications

The life cycle of cloud resources is easy to manage and automate through APIs and SDKs provided by cloud providers such as AWS. This has paved the way for the adoption of immutable infrastructure practices – let’s see what it is and the benefits it provides.

Leave a Reply

Your email address will not be published. Required fields are marked *