Our field is currently undergoing a seismic change towards becoming more and more quantitative. While in the past a chart was viewed by many as state of the art, charts won’t surprise anyone today. In fact, we now have systems that can produce any number of charts at varying time scales with just a few clicks.
On our way towards data-driven technical operations where runtime data are collected to be analyzed by machines instead of humans, we will inevitably start seeing statistics play bigger and more prominent role in our data analysis systems.
As someone who is very interested in statistics, I see the following five potential pitfalls that are waiting for us as the use of statistics becomes more widespread in techops. I will order them by decreasing severity (this, of course, is very subjective).
Applicability of current estimates to the future
Statistics is about estimating parameters of entire population (see some terminology) based on a sample of observations. It’s very easy to take one of your backend services, calculate 95th percentile of its response time during top 4 usage hours on each of the recent 8 Wednesdays, calculate some of this sample’s statistics and then say next Wednesday will be somewhat like this with such and such error.
The problem here is that we currently take for granted the fact that past 8 Wednesdays are a good predictor of what’s going to happen next Wednesday.
Identification of good time boundaries for relevant samples
Let’s say you are studying performance characteristics of your database cluster. You take data for the past 2 months and analyze it as a single sample and come up with some results. What you neglected to account for, however, is that at certain time during these 2 months you upgraded disks on your database servers to much faster ones. Or you swapped out one library with another that led to a significent throughput increase.
I see this left and right all over the place. If your sample includes observations that are not similar, your sample is not good enough. See this for more details.
Data exploration vs detection of actionable alerts
There are two distinct use cases for statistics in tech ops. On one hand, you have looking to learn something from the data. It could be hypothesis testing or it could be trying to detect a trend or it could be predicting the future.
Another use case however which is completely distinct is trying to alert off of statistical data in near real time.
These 2 use cases do overlap a bit but in large part they are separate and a tool meant for one may not be a good fit for the other.
Dissemination of insight gleaned from data
Most people who are paying attention to what’s going on in the world get bombarded with thousands of statistics per day. These numbers often come from all sorts of “experts” and talking heads, who throw them in just to sound more knowledgeable. Looking for example? Any statement that includes “on average” without explaining how the sample was obtained.
It’s important to us because sooner or later we will have to share insight we gain from our data anlysis with others and we have to be aware that our audience’s understanding of statistics could be subpar. This is especially important if a person with whom you share results of your analysis is going to make a decision based on his or her interpretation of the results.
Low bar set by enterprise buyers
A lot of software vendors whose tools include elements of statistical analysis sell to enterprise. Unfortunately, state of the art at enterprise is so low that when a sales person says “standard deviation,” they immediately get enterprse buyers’ attention.
As a result, tools include statistics for all sorts of things and misuse them. Case in point - claiming anything about “66% of your sample lies within one standard deviation from the mean” without explicitly proving normality.
Obtaining valid results by applying statistics in tech ops is not going to be easy but it’s an inevitable next phase that should be embraced as a challenge.
For more, please see this post.