Part one of a series objectively examining important topics in contemporary data center monitoring, including observability, automation, and cost...
You are here
How to accelerate delivery with IT operations automation
This article, written by Opsview Content Lead, John Jainschigg, originally appeared in InformationAge.com. The series objectively examines important topics in contemporary IT monitoring, including observability, automation, and cost control.
DevOps is about accelerating delivery of new products and services at scale, reliably and affordably. Doing this requires IT operations automation -- using software to build, configure, deploy, scale, update, and manage other software.
We usually think of monitoring as happening alongside this process -- its job is to alert operators when things go wrong, help analyze issues, confirm compliance with service-level objectives. But it’s better practice to treat monitoring as a vital part of IT ops automation. A modern, full-featured monitoring platform can be a powerful automation engine in its own right, and a critical enabler for larger automation initiatives in application and infrastructure lifecycle management and problem mitigation. It can even, in many cases, work to enable autonomous operations like self-scaling and self-healing.
Here are some of the ways your monitoring system may be able to help you get more done, eliminate human error, and meet (not just comply with SLOs:
Streamlined, automated monitoring system deployment and lifecycle management
On-premises monitoring solutions can co-reside with monitored infrastructure; both in classic private clouds and datacenters and in provider-hosted virtual private clouds (VPCs). This lets them comply with security, privacy, data governance and other regulations; and helps them overcome bandwidth and cost barriers that can limit scalability of SaaS monitoring solutions. Premise monitoring must be deployed, scaled, and updated, however -- and this can be daunting for all but very simple, single-server configurations. Forward-looking makers of this kind of monitoring platform are starting to exploit popular deployment automation frameworks like Ansible, Puppet, and Chef (the same ones DevOps is using to automate infrastructure deployment and routine operations) to streamline monitoring-system deployment in scaled-out, highly-available configurations. For operator convenience, they’re hiding deployment-tool complexity behind webUIs and simplified configurators, though the standard tooling is accessible for DevOps folks who wish to dovetail monitoring-system or metrics-collector deployment with infrastructure roll-outs -- a best practice. Details of monitoring can be defined and maintained as part of definitive “infrastructure as code” repositories.
Automated agent deployment and monitored-object registration via API
Standard deployment tools like Ansible can also be used to inject, configure, and update monitoring components (endpoint agents, required libraries, etc.) on hosts. The same tools can extract facts from deployment manifests or directly from hosts at deploy time, then use monitoring-system APIs to rapidly configure monitoring for host infrastructure and applications, as well as “unmonitor” hosts at end of life. Routinely putting systems under monitoring as soon as they’re deployed enables rapid detection of issues in staging or production, and can be used to trigger rollbacks, if required -- an important best-practice for continuous delivery.
Some monitoring platforms can ingest data from operations management tools and configuration management databases (CMDBs), such as those offered by ServiceNow and similar vendors. This lets operators quickly and confidently configure monitoring for existing infrastructure, applications, and full business services -- avoiding laborious and error-prone manual compilation of system facts.
Discovery and automonitoring
Sophisticated monitoring solutions use an increasing range of methods, including direct access to hosts via SSH and indirect access via configuration repositories like Active Directory and services like Windows Discovery, to extract facts from existing infrastructure and speed up monitoring configuration by operators. Leading-edge products are now moving towards automating the process completely: creating comprehensive maps of infrastructure, apps, and complete business services and monitoring these things without the need for any manual intervention or direction.
Alert processing, notification, escalation, integration
Alerting is, of course, a powerful form of IT operations automation. It entails decision-making, which may be simple (e.g., some metric has surpassed a given threshold) or significantly more complex (e.g., several metrics, from separate systems, have entered states predictive of a particular kind of known failure for a critical business service). It involves sophisticated assignment and escalation based on issue, team rotas, time/date and other variables. It demands outbound integration with communications methods such as email, or with multi-mode notification platforms such as PagerDuty; or more sophisticated integration with issue-management (e.g., JIRA) or operations workflow management (e.g., ServiceNow) as well as collaboration (e.g., Slack) and other solutions. All this automation power works together to get the right alert to the right person at the right time, while avoiding over-alerting and fatigue -- smoothing operations and helping teams avoid downtime and meet SLO commitments.
Proactive issue mitigation
Finally, sophisticated monitoring solutions now provide the ability to execute scripts on hosts, or trigger centralized IT operations automation (e.g., Ansible) to perform tasks based on monitored conditions: from rebooting a failed server to scaling up an infrastructure cluster. Over the next decade, developments in machine learning will gradually improve the ability of monitoring systems to deduce the abstract structure and function of business services, monitor them automatically, predict their failure modes, repair them and optimize their performance -- either autonomously, or by optimal allocation of operator resources to tasks.