You are here
DevOps in Desperation - Did Someone Say Ansible?
Over the past few weeks, I’ve been composing a complex monitoring demo that my colleague, Bill Bauman (Innovation Lead at Opsview) and I will present at Percona Live, on April 24 in Santa Clara, CA. We’re excited to be talking about new paradigms for application delivery and monitoring, and hope to meet some of you there!
‘Kay-so … this demo. Our thesis (and this may have changed by the week after next) is that today’s IT folks are facing a new version of an age-old, familiar IT problem. Back in the day, we were more rearwards-looking about it: we talked a lot about ‘legacy infrastructure’ and how to keep it supported and viable for decades.
Then a bunch of things happened, among them: Y2K, Moore’s Law, several generations of re-thinks about the Internet and applications of ‘internet-like’ technology, mobility, BYODevice, virtualization, cloud computing, IaaS, PaaS, containers, Agile software development, continuous integration, the birth of DevOps and its very-serious, Google-sponsored, grown-up cousin, Site Reliability Engineering (SRE).
Nowadays, as always in IT, the trend is forward-looking: a visionary DevOps/SRE culture is busy automating away previously manual tasks (what Google’s SREs call ‘toil’) and building the next generation of applications on novel, software-defined platforms.
Meanwhile, given the unchanging realities of business, the result is a time-shifted version of the same old deal. Everyone is managing a hyper-diverse mix of brand new, process-breaking, agile stuff. But they’re also administering a lot of fairly-monolithic, mission-critical stuff, like databases as well as the ‘substrates’ for all of this that (when you take off your DevOps goggles), still look very much like traditional servers. Monitoring thus needs to provide coherent, actionable insight of this fast-expanding, diversifying, and deepening stack: still based on physical compute, but with its workloads being broken down into increasingly smaller components, dependently cooperative, horizontally scalable, and individually ephemeral.
Our demo environment is designed to represent this kind of diverse, modern infrastructure. We have a big Percona MySQL 5.7 database. We have a Kubernetes 1.9 cluster with a master and several workers. On top of the Kubernetes cluster, we’ve installed OpenFaaS: a Functions-as-a-Service framework that lets you easily (well … easily if you’re a rocket scientist, anyway) create autoscaling functions in a wide range of languages. We’ve written a set of Python functions, a javascript traffic generator, and a database-connection-pooling thingamabob that collaborate to stress all parts of the system, with functions grabbing images from the web, writing them to the database and reading them back (with optional front-side and back-side format-conversion an option).
All of this is wired up to Opsview Monitor, Opsview Mobile, Prometheus, Elasticsearch, Kibana and other monitoring tools, so we can show a range of perspectives on what happens as a complex stack reacts to changing traffic and operational demands.
Impressive, eh? Of course (see above about the unchanging realities of business) we’ve assembled all this on a bunch of big (lots of vCPUs, lots of RAM) VMs, running on a pair of Lenovo laptops and a spare Linux desktop in my home office, behind a very nice (but, clearly, consumer-grade) WiFi-enabled router that terminates a VPN, letting us reach the demo environment from anywhere, securely (in theory, but yes, it’s been tested).
Which brings me around to why and how I became a DevOps guru just last weekend. In building this demo, I decided, early on, not to manually configure static IPs on my equipment. Because who wants to do that when you’d rather play with Opsview’s brand new, beta Kubernetes monitoring Opspack, right? But then, as work progressed, I realized that I was stacking up layer upon layer of complex, highly-conditional, tweaky, delicate manual labor (read ‘toil’): installing, configuring, and integrating tools and components across multiple VMs and their hosts. A big investment of time and really, really, really hard to redo from scratch if anything goes wrong.
Snapshots? Cloned VMs? Heck yeah! You truly have to love the resiliency of Percona MySQL, Kubernetes, and Opsview Monitor itself in the face of sudden hardware failure and connectivity outages. (Seriously -- you can power-cycle machines at random in a Kubernetes cluster and when everything comes back up, the nodes find each other again, the pods containing your workloads all light back up, and life is usually remarkably good -- it’s like a miracle.) But I had a growing fear of what might happen if an overnight blackout or hard-drive failure took out one of my lovingly-hand-tweaked VM ‘pets,’ or (powers forbid) cause me to lose leases on all my dynamically-assigned IP addresses (this particular router defaults to a non-changeable, 24-hour lease time). If that happened, my demo (and I) would be sad.
So, last Friday night, I decided to turn my infrastructure into code by learning Ansible, and capture the entire demo configuration, so that, in the event of an accident, I could … you know … redeploy in mere minutes, with a single command.
The pain! Ansible is a great tool, don’t get me wrong. But what it really is, is five or six generations of great tool, all actively in use (built on Python, it’s ‘pythonesque’ in this way). So working with it (at least initially) can be a daunting matter of Googling solutions, finding them (the community around Ansible is huge and highly motivated to share, thankfully), but then realizing that even though these solutions work (because, back compatibility!), they’ve often been superseded by completely-different solutions that work much better (or not, and you have to experiment to figure that out).
In other words, Ansible turns out to be exactly the same as some of the demo’s major components, e.g., Kubernetes and OpenFaaS -- both of which are evolving so fast that when you hit a snag and Google it, you’re immediately reading active StackOverflow and GitHub threads with latest replies within the past two hours. But it’s very unlike Opsview Monitor or Percona MySQL, which are fully-commercial efforts and extremely stable. Percona is great, they are not kidding when they call themselves a “drop-in replacement for MySQL.”
It took the better part of two days to encode my setup, which involved first quiescing and snapshotting all the elements, bringing everything down, creating brand-new, clean VMs with uniform key-based security, meaningful hostnames (and /etc/hosts files), Ansible Python and MySQL client prerequisites; snapshotting all those (because, as a jocular colleague quipped: “Ansible is a terrific tool for severely breaking a lot of things very fast, in a very disciplined, structured way”), and then slowly building up the stack again from my documentation, debugging as I went.
It was actually fun, and involved a few meaningful challenges, even at my noob level of expertise and with my limited goals for code re-usability and structure. Challenge #1 was figuring out ways to make every task idempotent: i.e., make it not matter if the task is run again and again (inevitably the case when developing and testing). Some of the idempotency is handled for you by Ansible’s built-in modules (e.g., ‘apt’) and enforced by its state-aware logic (e.g., if you use apt to install packages, and stipulate that ‘state: present,’ Ansible won’t try to reinstall the packages). Some of it really isn’t, and can’t be: for example, modifying complex config files that differ on every host. In those cases, you end up doing a lot of regular-expression parsing with complex backreferences, using modules like lineinfile: (which, incidentally, can automatically make backups of config files before it messes with them.)
Another challenge was figuring out ways around Ansible’s default assumption of root in cases where the most efficient way to configure something was to assume the identity of a specific non-root user. Ansible has become: methods that handle these transitions elegantly. But they don’t work well in certain edge cases around configuring Kubernetes and Docker, you actually need to fully assume the user’s environment as well. Trivial to work around for a Linux genius, but a real head-scratcher for me.
By Sunday, things were working, and I extended the roles to include deployment of OpenFaaS, Percona, various dashboards, post-install tweaks to Docker Engine and other ephemera. I was proud of myself. But I have to admit, I was also, given the time investment required, questioning whether the project was really necessary in the first place.
In other words, I was doing what I’ll bet some IT managers do, when contemplating the cultural change to DevOps. “Infrastructure as code,” huh? Big whoop. Days lost to critical-path work, and what did I have to show for it? A bunch of fragile, noob Ansible code, attached to an infrastructure constantly in flux, and ultimately disposable (can’t be running Kubernetes on every laptop in the house when you have kids who need their Netflix, right?)
Well … I’m here to tell you, that’s crazy talk. And here’s how I know. Last night, around 2:00 AM, I was experimenting with OpenFaaS functions, building and rebuilding sample apps with changes, pushing them to Docker Hub and deploying them on my cluster. And I randomly deployed one community sample function (whose name I will omit, because I’m sure it had nothing to do with the function itself -- it was cosmic rays or something) that failed to terminate when removed. And, long story short, in attempting to remove it I made a dumb mistake and ended up nuking some important items in Kubernetes’ kube-system namespace, breaking my stack and sending other important components into crashloops.
While I’m sure a real Kubernetes expert would have been able to fix this quickly and easily, I am not that guy. (N.B. Neither was this Kubernetes’ fault. The current version of Kubernetes, which I’m using, actively prevents you from deleting things in kube-system. You have to be formidably careless and radically cavalier to end-run around the system’s error checking the way I did.)
So I restored clean snapshots for all my VMs and hit ansible-playbook -K top.yml (that’s a little Salt joke for y’all -- P.S. my understanding of Salt is pretty limited, too) and ten minutes later, I was back in business, deploying functions on my infrastructure.
Lesson learned! So now I’m a true believer. Any amount of time spent automating is time well-spent. Predictable, future disasters now safely avoided.
Meanwhile: We hope you’ll join us at Percona Live. We should have an interesting demo and a lot to discuss.