As you probably heard, on May 21 of this year US East region of Amazon EC2 cloud experienced a severe outage. The event received considerable coverage around the blogosphere - you can find the most comprehensive collection of links on the topic at highscalability.com. The guidelines of design for failure have now spread far and wide. But where do you start?
One good place to start is a thought exercise trying to come up with as many possible failure scenarios as you can imagine and let your mind go through the steps how you are going to deal with each one of them. It should not take a lot of time or resources. All you need is to start a document where you will record your initial findings - you will update this document throughout your system’s lifecycle as more information becomes available.
Again - this is a pure thought exercise. You are not writing any code here or redesigning any of your systems.
Create a table with the following columns.
1. Something bad happens. Examples: You lost your database master. Half of your frontend capacity is not responding. Your traffic increased sharply and unexpectedly. Some of your servers are unable to read or write to their disk. Loss of cooling - servers overheating. Fire alarm in the datacenter.
The more bad things you can think of here, the better. Remember also that not all problems occur in an instant - some issues could be building up over time, others could be affecting only some fraction of your systems.
Something bad could be going on for hours or even days, or it could be a sub-second event in duration with huge impact on your service. Are your systems going to be impacted the same way regardless how long an event is ongoing?
2. You or your monitoring systems find out that someting bad happened. Don't skip this step - it's more important than you think!
Can your monitoring detect the error condition reliably (with sufficient confidence level but without unmanageable amount of false positives) and within desired timeframe? (Hint: if for all your events you answer “yes” to this question during initial assessment, either you didn’t think of enough bad things that could happen, or your monitoring system is better than 99.98% of all monitoring systems deployed out there.)
3. You need to know in advance approximate impact such event will cause on your systems. Once you find out about an actual event, you will need to confirm that your anticipated impact is indeed occurring.
Example: Did you think that a loss of this router would only result in your loss of half of your backend capacity, while in reality (due to a recent change) you lost both 50% of backend and frontend?
4. You react. You should have a rough idea what you would do in response to a given event. As they say, "The best improvisation is one that was prepared in advance."
Note that “put up a ‘We are sorry, site is down’ static page and wait for provider to fix a problem upstream” is a valid reaction.
5. Assign likelihood to each event. The key here is absolute values are not very important - you only need to offer a hypothesis which event is more likely to occur than another event.
6. Assess expected impact relative to expected impacts of other events.
A combination of columns (5) and (6) should help you prioritize your ops work. I can’t think of universal guidance whether to go for “low likelihood, big impact” or “high likelihood, low impact” first - it probably makes the most sense to start somewhere in the middle. When prioritizing work, also pay attention to column (4) - especially if a current reaction is deemed insufficient.
For certain failures, you will have columns (2) and (4) automated - for example, leading NoSQL solutions can often tolerate a loss of one node in the cluster. I still recommend to have such events listed in this table, even though you might not need to do any additional manual reacting.
You also need to remember to revisit this document as you are working on your system, adding components, growing capacity in response to demand, develop new features, fix bugs, move between platforms, and so on.
And finally, you can grow this document ad infinitum. Even when you think you covered every single event imaginable, either wait a month or two or start planning for combinations of events (for example, loss of router AND independent traffic spike).
If you do this right, this exercise will help you weather your future operations storms with more confidence. But remember that if you have a sufficiently complex system, no matter how much planning you do sooner or later it will experience a "normal accident" caused by one or more unknown unknowns. Your goal is to postpone a normal accident for as long as you can.