You are here

Don't Monitor Yourself into a Madhouse

Don't Monitor Yourself into a Madhouse

Done right, IT monitoring provides clarity and promotes operational effectiveness. Done wrong, it can make your staff crazy and limit business growth. 

This article, written by Opsview Content Lead, John Jainschigg, originally appeared in InformationAge.com. The series objectively examines important topics in contemporary IT monitoring, including observability, automation, and cost control.

In our last column, we discussed how IT monitoring provides visibility into the health, performance, and resource consumption of infrastructure, key applications (e.g., databases), and business services. But visibility needs disciplined application and careful filtering to make operations more efficient. Too much decontextualized data sows confusion and stress, increases costs, and can kill IT scalability and stunt business growth. 

Imagine you’re an IT operations generalist. You have broad technical expertise about your organization’s large IT estate. You deeply understand the business value of critical services, and are held accountable for maintaining their availability. And now, for some reason, you’re getting text messages from your monitoring system -- saying that a metric called MAX_CONNECTIONS has gone to critical on a particular MySQL database cluster. 

What does that mean? As an ops generalist, you probably have no idea. Ask an application architect or MySQL specialist, however, and they’ll give you a nuanced answer. MAX_CONNECTIONS is a built-in MySQL historical metric that records the fact that, at some point since the last DB restart, connections spiked over a known threshold (set to a very low default value). It conveys important information, but only to people who can use that information in context: prompting examination of upstream workloads and their traffic, why and how applications create and relinquish connections over time, options for connection pooling, and other variables, and determining whether (and how) database or apps need tuning, scaling, or other TLC.

In short: this is an example of a metric that IT ops generalists should probably not be alerted on -- at least not directly -- because it’s a bad match for their skills, role, goals, and accountabilities. Alerting this metric to an ops generalist will either cause confusion or concern over an issue that they cannot address. This can create a middleman scenario: a fire drill in which the generalist calls for help from a database specialist, who in turn (unless they are familiar with the connected application) might still be unable to provide definitive answers. In this case, real root-cause analysis is only likely to come from the specific DevOps personnel who architected, deployed, configured, stress-tested, and now manage the app and its database instance in production, and who understand how app and database are supposed to interact.

Right Info, Right Time, Right People

The lesson: visibility alone isn’t enough for efficient IT monitoring -- especially in larger, more complex enterprises. To be genuinely useful, enterprise monitoring solutions need to collect raw data, convert it to actionable insight (information), and deliver a subset of that information, filtered for relevance and utility, to the right people. Key to success is accomplishing this through appropriate communications channels and within process envelopes that facilitate proactive maintenance and drive incident responses that are both effective and proportionate.

The monitoring platform must then provide DevOps, IT operators, teams, managers, and business leaders with additional tools, letting them coordinate and fix the problem. These include solutions for collaboration, inquiry, root-cause determination, documentation of fixes, cost analysis, resolving the issue with impacted customers, and post-mortem analysis and process optimization.

Doing it Right: Integration Touchpoints

Enterprise IT monitoring solutions anticipate and enable ops/business process integration for each relevant persona. They serve the needs of larger staffs and diverse IT and business specializations (both within the enterprise and possibly within customer organizations as well) and link with the appropriate, domain-specific IT operations and collaboration tools. Here are some of the types of integration involved:

1. Plugins and monitoring packs: Top monitoring providers work hard to maintain a delicate balance between ease of use and the ability to customize. To help diversely-skilled IT staffs start monitoring quickly, solution makers often combine plugins --  the software needed to interface with device or software specific sources like host servers, operating systems, and databases -- with preconfigured sets of recommended service checks, post-processing logic, alerting thresholds, and other configuration information. By installing such an integration package (Opsview’s Opspacks are an example) operators can monitor nearly any device or application without needing domain-specific knowledge: just install the package, aim it at the technology, and monitoring just works.

One potential caveat of this approach, however, is that pre-packaged metrics -- as best-practice collections -- may include too much information to suit the specific needs of a given persona. A pre-built monitoring pack for MySQL databases, for example, might thus include instructions to alert on a variable like MAX_CONNECTIONS: useful for specialists, but not for generalists. For this reason and others, monitoring software should also provide features enabling highly granular customization of prebuilt monitoring templates. Savvy IT operators and domain specialists can use these to create subsets and aggregates of relevant service checks in versions that serve different roles’ needs for insight and notification.

2. Notification profiles, preferences, on-call lists, escalation and contingency logic, and other platform-side alerting management features: Enterprise monitoring platforms let operators create notification profiles for applications, teams, roles, and individuals. They can precisely manage which alerts each role receives, ensure seamless delivery of alerts to role-representatives currently on-call, customize delivery of alerts in various channels (e.g., email, text message, Slack or other messaging solution, or via an external notification or broader-based ops management platform), drive effective escalation if specific alerts aren’t resolved in timely fashion, and document transmission and acknowledgment of alerts for later audit, analysis, and verification. Within bounds of agreed-on policy, these systems also enable individuals to further tweak notification behavior, e.g., by temporarily suppressing repeated alerts on a known condition, once one such alert has been acknowledged. The summary here is to get the right alert to the right person at the right time, but not to bombard them with unnecessary detail or at 3 AM, when it isn’t business critical.

3. Integrations with enterprise ops management applications: Top enterprise monitoring solutions provide ease of integration with popular operations management suites (e.g., ServiceNow), notification providers (e.g., PagerDuty), ticketing and incident platforms (e.g., JIRA), collaboration frameworks (e.g., Slack), and other tooling. These integrations let the monitoring platform initiate operations and maintenance workflows in response to conditions. The goal is reduce time to response, save money, and hold to “incidents per shift” goals by putting operations on a highly-proactive footing: pushing required maintenance and other non-critical conditions into normal process, instead of alerting on them. The resultant stress reduction and business cost savings are well worth the planning and integration.

4. Integrations with collaboration tools: Once processes and workflows have been initiated, tickets created, etc., enterprise monitoring platforms may provide extensive and sophisticated integration with API-equipped, teamwise collaboration tools like Slack. The monitoring system can use these integrations to spin up and label multimedia group chats around specific issues, connect with pre-built chats for teams, and inject relevant metrics information into shared communications channels. Team communications in these channels becomes part of a (real time) audit trail for incident response, letting those who join later get up to speed immediately, and facilitating escalation.

5. Custom dashboards, Business Service Monitoring: Critical to providing specific roles with exactly the information they need to be effective. Top monitoring solutions let you create situational and/or role-based dashboards that aggregate important metrics, provide simple, easily-gisted visualization of key performance indicators, and simplify drill-down to detailed metrics to help specialists determine root causes. Dashboards can also help fulfill IT organizations’ need to provide KPIs to technical and business leaders in consumable forms. Business Service Monitoring (BSM) enables aggregation of metrics from all components and business logic that contribute to respective specific business services It imposes additional logic to evaluate availability and health of that business service, based on the summed states of all parts of the application services that are provided. This kind of dynamic, business-relevant view can be critical for ops generalists, whose main concern is whether the actual end-user-facing services (as opposed to just the infrastructure) are available and healthy.

6. External analytics, visualization, CMDB and other tools: Enterprise monitoring platforms provide feature-rich APIs that enable extraction of collected metrics (e.g., time-series data) by analytics and visualization platforms (e.g., Grafana, Splunk), letting specialists  integrate and apply these tools holistically. Enterprise monitoring also exchanges data with Configuration Management Databases (CMDBs) -- consuming information to get target systems monitored more quickly, and providing information to help keep CMDB and monitoring configurations properly aligned.

7. Reports: Comprehensive reporting capabilities let monitoring feed customized information to different roles for operations, budgeting, and other management purposes. Reports are auditable long-term, and document compliance.

Trust your Platform (Then Iterate and Optimize)

For larger organizations and more complex IT estates, achieving high operational efficiency with monitoring demands ongoing work to gradually build and optimize customizations and integrations supporting team efficiency and productivity. Best practice is to implement effective monitoring using enterprise supported, domain-specific monitoring templates, then gradually customize as appropriate for your IT estate. It makes good sense to begin by iterating over notifications, since these will impact personnel and workflow most directly: working to reduce unneeded alerts and deal with more conditions proactively, within standardized process workflows.

Get unified insight into your IT operations with Opsview Monitor

More like this

Visibility
Blog

Part one of a series objectively examining important topics in contemporary data center monitoring, including observability, automation, and cost...

Automation
Blog

DevOps is about accelerating delivery of new products and services at scale, reliably and affordably. Doing this requires comprehensive IT ...

Nagios
Blog

Learn what the three most critical aspects are of monitoring MariaDB and how you can do it with Opsview.