Troubleshooting

One of the areas of tech ops that doesn’t get its fair share of discussion is troubleshooting. It’s not easy to teach troubleshooting - possibly because how successfully one can troubleshoot a given system largely depends on one’s experience with the system and on quality of the system’s feedback loops (accuracy and timeliness of monitoring data).

But despite the fact that troubleshooting is often more art than science, it has a set of general rules and guidelines, without which troubleshooting is nothing more than guessing. These are all common sense rules that formally come from boolean algebra and first-order logic. They universally apply to the first half of troubleshooting - finding what’s wrong.

It’s important to emphasize that troubleshooting activities are always measured against two independent goals - finding and fixing the issue, and doing it as fast as possible. It’s the second goal that makes use of logic mandatory - you usually can’t afford to mentally build a list of anything that could have gone wrong and then start crossing items off this list one by one. To speed things up, you usually analyze symptoms and check only those hypotheses that plausibly match them. Ability to properly prioritize hypotheses comes purely from experience, but not wasting your time on things that can’t explain what you are observing has a lot to do with logic.

A key aspect of troubleshooting is causality: event A leads to event B, or A causes B, or A implies B (A -> B). A is sufficient for B here, and B is necessary for A.

A -> B is the same as NOT B -> NOT A. Imagine, for example, that A = "filesystem is full" and B = "writes to filesystem are failing." In this case A -> B. Therefore, if writes are working (NOT B), it means filesystem is not full (NOT A). But if writes are failing (B), it does not automatically mean that filesystem is full (for example, it could be mounted read-only).

Another way to look at A -> B is (NOT A) OR B. This form can be easier to work with when you are applying negation - see below.

When A is sufficient and necessary for B, it means that A and B are are true or false both at the same time. Another way of saying it is “A is true if and only if B is true.” This statement formally consists of two: A -> B and B -> A.

Then there are important rules about negation that are called De Morgan's laws:

NOT (A OR B) = (NOT A) AND (NOT B)
NOT (A AND B) = (NOT A) OR (NOT B)

So how could you apply these rules in practice? First and foremost, never waste your time on checking A if you are observing NOT B and you know that A -> B.

Secondly, never assume that NOT A causes NOT B if you only know that A -> B.

Finally, never assume causality out of mere correlation of two events. If A and B tend to occur together, in bigger systems it’s often hard to determine if there is any causlity and which way it goes - further analysis is required.

Simple rules I mentioned in this post are not a complete guide to troubleshooting but they can still help you save time and resources - remember that any amount of time you spend investigating a hypothesis that you should have rejected based on pure logic, is time wasted.

Categories: devops | distributed |