Designing a fully-automated or nearly-fully-automated computer system with many moving parts and dependencies is tricky, whether a system is distributed, hyper distributed or otherwise. Failures happen and must be dealt with. After a while, most folks grow up from “failures are rare and can be ignored” to “failures are not that rare and can not be ignored” to “failures are common and should be taken into consideration” to “failures are frequent and must be planned for.” The latter seems to represent the current prevailing point of view.
But here is a kicker - it’s not the end. I saw this tweet, read this post and checked out a book by Charles Perrow titled “Normal Accidents” from the library. Published in 1984, the book is not about IT, but its material fits our field nicely. And boy, was I enlightened!
The book’s main point: no matter how much thought is put into the system design, or how many safeguards are implemented, a sufficiently complex system sooner or later will experience a significant breakdown that was impossible to foresee beforehand, principally due to unexpected interaction between components, tight coupling or bizarre coincidence. For us in IT, it translates to "no matter how much planning you do or how many safeguards you implement, failures will still happen."
There are at least 3 common themes that are present in multiple illustrations in the book:
- A big failure was usually a result of multiple smaller failures; these smaller failures were often not even related
- Operators (people or systems) were frequently misled by inaccurate monitoring data
- In a lot of cases, human operators were used to a given set of circumstances, and their thinking and analysis were misled by their habits and expectations ("when X happens, we always do Y and it comes back" - except for this one time, when it didn't)
I have had my share of outages and downtimes, and I can attest that I have seen these 3 factors play a big role in tech ops. Some were bugs in management and monitoring code, some where human error, some where bizarre set of dependencies but all were a combination of multiple factors. For example, who would have thought that with a failure of primary DNS resolution server, the VIP would not fail over to the secondary; and even though hosts had more than one “nameserver” line in /etc/resolv.conf, application timed out waiting for DNS to respond before getting to ask the second nameserver; without name resolution, multiple load balancers independently thought that there was no capacity behind them (because management code calculated capacity in near real-time relying on worker hosts’ names) and disabled themselves, thus taking down the entire farm - now I know of course…
It turns out we can’t eliminate normal accidents altogether, but here are several techniques that I have been using to speed up detection and response in order to reduce the downtime.
Complexity budget. Described by Benjamin Black, this is a technique to allocate complexity among components beforehand and strictly follow the allocation during implementation phase. It helps avoid unnecessary fanciness and leads to simpler code, which tends to be easier to troubleshoot and recover after a failure.
Control knobs/switches for individual components. As John Allspaw shows on this slide, you need to be able to turn off any component in an emergency, or throttle it up or down. Planning this feature and building it in from the very beginning is very important.
Accuracy of monitoring data. Ensure your alarms are as accurate as possible. No matter how much chaos is going on inside the system during a severe failure, last thing you can afford is misleading the operators with wrong information. If you tried to ping a host A and didn't get a response, your alarm should not say "host A is down" because it's not the knowledge you obtained - it's an assumption that you made. It should say "failed to ping host A from host B" - maybe it was network on host B that was an issue when a ping attempt was made, how do you know?
Availability of monitoring data. There is a reason first thing the military try to do when attacking, is disrupting enemy's means of communication - it's that important, which applies to our case as well. You either design your systems to be able to get monitoring data even during the worst outage imaginable (ideally from more than one source), or you at least should be getting an alarm about lack of such monitoring data (it's a very weak substitute though).