You are here
Save Your Projects from Dependency Hell with Automation and Infrastructure-as-Code
In 2016, Google released an interesting and much-discussed book written by members of its internal DevOps team; people who practice a Google-flavored form of DevOps they call Site Reliability Engineering.
SRE, described in this book and in a series of videos released beginning in early March of 2018, shares with DevOps the desire to avoid “toil”: the mass of critical, error-prone, complex, dependency-ridden, but also boring and exhaustingly repetitive labor involved in provisioning, monitoring, and managing platforms for software development and large-scale production application support. Toil takes SRE/DevOps engineers away from the more-creative, collaborative, and strategic work of making diverse (and in Google’s case, planet-scale) applications run better and more cost-effectively.
Dependency Hell
Unless you work directly with computers, it may be hard to imagine what toil comprises, or grasp how much toil plagues the life of IT administrators. Setting up a Linux webserver and applying best-practice rules to harden it for production use (in other words, creating a box able to host a basic, small-scale, templated business website safely) can involve a hundred or more discrete steps performed through multiple interfaces, including:
- Configuring real or virtual hardware
- Installing an operating system
- Creating system users and groups
- Doing updates to bring the OS distribution up to current standards
- Installing dependencies and core applications
- Creating application users
Then making things run correctly by making relatively intricate changes to configuration text files using crude terminal-based editors (are you an emacs, vi, or nano person?) before finally restarting services (causing them to load the changed configurations) and testing.
No part of this work is entirely mechanical. Virtually every step involves situational dependencies: on specifics of the hardware (or on cloud APIs used to address the virtualized hardware), on version and update specifics of the OS, on utilities controlling installation of third-party software, plus app-vs.-OS and app-vs.app interdependence issues.
These parts are all potentially in motion beyond anyone’s control, so DevOps need to ensure (among many other variables) that versions installed are compatible with the current build, that later updates don’t break things, and that critical security patches are applied non-disruptively. Business dependencies, e.g., OS and application licensing requirements and system-level administration, can add further complexity.
Important configuration steps are often idiomatic to the specific installation you’re creating. For example, apps and components will need to know keys and credentials for talking to databases. Some facts (e.g., hostname) may be unique to a single server build.
Automation
Automation – more than simple scripting – is the first-order solution to the problem of burgeoning dependencies. Configuration and deployment orchestration systems like Ansible, Puppet, Chef, Salt, Terraform and many others let you describe complex deployments precisely in structured documents (commonly .yaml files), creating a codebase that can be versioned and managed like application code.
Unlike bash scripts, which are procedural (making them both prescriptive and fragile), most deployment orchestrators prefer to apply robust template-driven goal-state-seeking (i.e., “here’s a static description of the fully-configured server represented as a hierarchy – go figure out the order in which you need to install things”), letting you use procedural approaches only as required to work around specific problems.
Program files (Ansible calls them “playbooks”; other products use their own terminology) are human-readable and can be repurposed and reused conveniently by swapping out header files containing deployment specifics, or providing these interactively on the command line. The most sophisticated products (e.g., Terraform) work as part of a policy-managed automation ecosystem that includes task-dedicated servers for securely storing secrets and making them available to authorized deployments under encryption.
Popular deployment managers offer ‘provider’ interfaces letting them communicate directly with the management APIs of public and private cloud platforms. This lets the system provision virtual resources as well as physical ones, creating and scaling dynamic stacks of functionality on demand, e.g., Kubernetes, installed on a set of Amazon VMs and scaled up by creating new VM nodes, attaching them to the cluster’s virtual network, provisioning them to host Kubernetes workers, then issuing the required ‘join’ command to give over their resource capacity to the container orchestrator.
Use of this kind of comprehensive automation vastly speeds up day-to-day operations, and dovetails naturally with the requirements of continuous integration and delivery (CI/CD, see below). In a fully-realized infrastructure-as-code scenario, you can use the exact same provisioning logic to create all the platforms you require: from small-scale developer environments, running on a single VM, to large-scale production clusters on bare metal. In the ultimate evolution of such systems, triggering a deployment causes part or all of the underlying infrastructure to be built cleanly, before the application is deployed on top of it. This guarantees that software-defined infrastructure is in a known state (important for testing), and that its entire configuration is as represented in the codebase. Rapid, automated infrastructure deployment also makes practical use of advanced deployment techniques like Blue/Green (deployments without downtime) and Canary (controlled, gradual release of new code to some customers, while other customers are still using old/stable code).
Enter Containers and Microservices
Automation, however, doesn’t completely solve all dependency problems. For example, even a fully-automated deployment can fail because a component maker changes the name of a package, or deprecates a version and takes it out of circulation. A more deterministic answer, used heavily at Google, is to freeze components and their dependencies inside curated containers, letting you ignore local dependencies completely, while also speeding up deployments from ‘order of minutes to tens of minutes’ to ‘mere seconds.’
Using Docker tools in combination with standard Linux image-creation utilities, you can build a carefully-curated parent image for your containers, then insert your application, its required directories and any dependencies, define which ports your application needs exposed, and other details. The result is a runnable, single process application container that you can run individually, or reference, along with other containers, in a Dockerfile, letting you deploy multiple containers at one time (elaborations of the same concept can be used to create abstract application ‘deployments’ in Kubernetes). The container is immutable: unless you change it and create a new image, it’s immune to alteration, so can’t be broken by updates. For the same reason, it has a very limited attack surface. It communicates with other components via network connections. And it scales horizontally by simple duplication.
This is the basis of what’s come to be known as the microservices design pattern. The idea is that applications fully distributed, as individual, containerized components, running on container orchestration (i.e., on Docker Swarm, Mesos DC/OS, or Kubernetes) can be scaled out and back horizontally in response to changing need for the service they provide.
How component services find each other and negotiate interoperation is, of course, a separate question. The creator of a deployment can use labels, environment variables or other methods to help associated containerized components work together, but this won’t solve problems that arise when critical information is only made available at runtime. For this, developers need to use tools like etcd, the resilient key/value store, to provide information to newly-created containers in a known location, to help containerized apps learn about services in their environment, and to enable components to advertise their presence to one another. If you’re interested in exploring further, this recent article, by Bilgin Ibriyam, Principal Architect at Red Hat, clearly enumerates the main design requisites for cloud-native applications.