Operations Alerts and Tragedy of The Commons

Today I would like to continue my never ending quest of finding parallels between IT and economics and social sciences. I will start with a preamble, but if you are already familiar with a concept of “operations alert” in context of IT, you can skip it.

Preamble

I have spent a big part of my career in technology operations of small, medium and huge companies, so a concept of “operations alert” is very dear to my heart. For those who are not familiar with this concept, operations alert is an automated message about something in your IT environment or infrastructure that went wrong. For example, a server crashed or application stopped responding. Some people call these things “alarms” instead of alerts.

These messages can take many forms. When a company is small, it almost always starts with alerts sent out as email messages or SMS. Later on, as the number of alerts sent and analyzed each day grows, companies usually deploy a dedicated system that centralizes, aggregates and presents the alerts in a more manageable way. It’s usually a client-server architecture, where clients are monitoring agents deployed on all or most machines, that send the information to a central server for processing. Or sometimes there are no agents, and server regularly performs checks (sends probes, also sometimes called active monitoring) of network services and generates alarms off of responses (or lack thereof). Examples of open source solutions in this area are Big Brother (and clones/descendants), Hyperic, OpenNMS, Nagios, Zabbix, Zenoss and many others.

When organization gets a ton of alerts each day, it needs to prioritize them. And a concept of “alert severity” is born. It’s usually one of “critical”, “major”, “minor”, “warning”, “info” and “debug”. The higher the severity, the more important an alert is and the sooner it needs to be analyzed. Usually, alerts are created by specialized engineers who are responsible for a particular server or application (called SME - subject matter expert), while people who receive them and react to them are generalists (engineers not focusing on a particular technology but with very broad expertise in system and network administration).

Who Sets Severity?

I looked at many tools and observed how several organizations implemented operations monitoring, and I noticed a pattern - alerts severity is set by SMEs (I was such an SME up until recently). An SME analyzes the pool of alerts that his systems can ever generate, rates them by how important they are, and assigns priorities accordingly. Generalists monitor the dashboard and supposedly react to alarms in the order of decreasing severity.

All good, right? Wrong! Enter the Tragedy of the Commons. Generalists’ attention and time are finite resource. In order for SME to get attention to alerts sent from his systems, he tends to inflate severity of his alerts to draw more attention of generalists. As a result, quite soon, all your alerts are marked “critical”. All SMEs combined would be better off if all their peers fairly assigned severity, but each individual SME is better off if they inflate the severity for alerts sent by their systems. Niiiiice!!!

Solution

I think there might be a solution to the tragedy of commons problem in IT operations monitoring after all. It’s easy to explain but difficult to implement. Your alerts should not have severity at all. In other words, when an alert message reaches central server, it should have no severity. Once an alert is received, its severity should be a function of real-time status of entire environment. One minute a fan failure on your secondary DNS server is top priority (and hence a “crit”), but next minute a network interface failure on your primary DNS becomes a much higher priority. And of course web front door outage half an hour later easliy trumps both of these problems (provided they are not related of course).

I have some ideas how this can be implemented, but not ready to write them up yet. For now, when you evaluate monitoring solutions and vendors, consider that red severity field in their nice screenshots and ask yourself if it’s going to help you achieve better operations efficiency, or lead you down the path of the tragedy of the commons.

Categories: devops | economics |