Risk in IT Systems

TL;DR - This post is not about ways and techniques how to run IT systems better or more reliably or deal with failure more efficiently; it's about mathematical models (or lack thereof) of what risk in IT systems actually is. I bet at least 80% of you will stop reading this post now. For the other 20% - I am happy you are curious.

Ideas that I describe in this post were in large part shaped by literature on risk management and modeling in capital markets and portfolio management.

Do you know what our biggest problem in IT is? It’s our complete and utter inability to measure risk of the systems and services that we build. Often, people don’t even understand what the risk is and confuse risk with other concepts.

Why do we need to measure risk? To do A/B testing among other things. When you implement two approaches to solving a problem and you need to decide which one to put in production, you’d better be able to compare risk of each approach and act accordingly. Also, deeper understanding of risk should help us analytically determine risk of bigger systems consisting of individual smaller systems, components and services whose risk we know. Ability to measure risk analytically will allow us to forecast systems’ risk before spending time and resources building them - a huge win.

Let’s say you have a system S. It can be an individual service in your stack or it could be entire web-facing system that consists of multiple frontend and backend services. Let A(t) be a function that somehow measures overall health of your system S at time t such that 0 ≤ A(t) ≤ 1. There is no universal rule what A(t) could be. For some systems, it could be percentage of incoming queries answered correctly within a predetermined time. For others, it could be percentage of available workers.

Now let’s pick 2 numbers f and r such that 0 ≤ f < r ≤ 1 (as you may have guessed, “f” stands for failure and “r” stands for recovery). We will say that system S experienced a failure if its A(t) falls down to or below f, and we will say that system S has recovered from failure when after hitting f, A(t) climbs back up to at least r.

In this formalization, time to recovery (TTR) is going to be Tr - Tf such that:

  • A(Tf) ≤ f
  • A(Tr) ≥ r
  • Tf < Tr
  • t, Tf ≤ t < Tr : A(t) < r

If you average TTRs over a period of time, you get a well known measure called mean time to recovery (MTTR).

Similarly, time between failures (TBF) is going to be Tf2 - Tf1 such that:

  • A(Tf2) ≤ f
  • A(Tf1) ≤ f
  • Tf1 < Tf2
  • ∃! Tr , Tf1 < Tr < Tf2 : A(Tr) ≥ r
  • t, Tf1 < t < Tr : A(t) < r
  • t, Tr < t < Tf2 : A(t) > f

(last 3 conditions indicate there was exactly 1 recovery between failures at Tf1 and Tf2)

If you average TBFs over a period of time, you get mean time between failures (MTBF).

TBF is an indicator how prone S is to failure (could be bad system design, but also could be complex domain with a lot of external dependencies). TTR shows how good your tooling and processes around failure detection and remediation are. But by themselves neither one of them, nor their means, tell you anything about risk.

Instead, pick a duration for a period of time, sum all TTRs in each period and divide them by how long each period is. For example, let’s use monthly scale and in a month of January your TTRs were 3 hours, 5 hours and 1 hour. 3 + 5 + 1 = 9, divided by 24*31 (number of hours in January) - 1.21%.

If you repeat this arithmetic exercise for many months, you will end up with a series of numbers representing amount of failure in S per month. We will make an assumption here that the random variable of monthly amount of failure is normally distributed (this assumption could be proven wrong in the future but I am going with it by default because many processes in nature are normally distributed; our industry doesn’t publish enough datasets about operations metrics publicly so that confirming whether this is true is nearly impossible at this time - see below).

If you average these numbers, what you will end up is expected amount of failure per month (mean of normal distribution, corresponds to top of the bell curve). Is this risk? No.

Imagine a system which is down 5% of the time every single month for the past 5 years, no matter what you do, no matter how much business has grown over this period, how much more complex the system has become now and how much you improved your automation. Risk in this system is almost 0 - you are almost positive its amount of failure next month will once again be 5%.

Risk corresponds to standard deviation of monthly amounts of failure, not their mean (standard deviation is square root of variance). A system S1 with smaller variance (for example, every month S1 is in failure mode roughly same amount of time) has lower risk than a system S2 with larger variance (for example, one month it is in failure 10% of the time, another month 0.1% of the time), even if expected amount of failure of S1 is bigger than that of S2.

How about an individual change (say something that you get asked about when you attend a change management meeting), either repetitive like pushing out a new release or performed only once? Similar to a system, a change has a success function with values between 0 (change totally didn’t work) to 1 (change worked as expected without unexpected side effects). Same logic about expected success rate and risk as standard deviation of outcomes will apply here as well - you will be modeling this change as if it were performed over and over again.

To sum up:

  • TBF shows how prone your system is to failure
  • TTR shows how good your troubleshooting is
  • mean amount of failure per period shows expected amount of failure per period
  • standard deviation of amount of failure per period is risk
  • risk and expected amount of failure are two totally different measures

Last year Christopher Brown in his talk discussed devops as craft and as a science. In order to properly graduate from the former to the latter, our discipline needs to start sharing hard data about techops metrics so that reasearch can be done and new theories can be tested against real-life historical data. I am primarily looking at shops of techops excellence like Amazon, Etsy, Facebook, Github, Google, Netflix, Twitter and others.

Categories: devops |