Are You a Responsible Owner of Your Availability?

Last month AWS released Reduced Redundancy Storage feature of S3. There were several aspects of this announcement that appeal to different people, but I especially appreciated one part - S3 now offers a choice of less availability for a lower price.

Availability of your system, just as any other part of your service, is a feature. Just as with anything else, one needs to invest time, effort and resources in building it out. And whatever you dedicate to availability (such as development time) can’t be used for other features - this is what’s known as opportunity cost. If you could put same resources to a better use somewhere else, investing them in availability may not be the optimal decision. Additionally, availability draws from your complexity budget which is going to impact other areas - HA systems tend to be more complex and hence require more effort to develop, maintain and improve them over time. Availability, just as any other feature, has a price tag that you will have to pay to get it. Because you own your site’s availability, it’s up to you to decide how much availability you want AND can afford to build.

The last point is very important. Our daily lives are filled with points of failure - home appliances (can break), a usual route you take to work (could be impacted by road construction), your regular coffee place (your favorite barista could transfer to a different location). Do you maintain 2 different non-overlapping routes to work? Or do you frequent 2 coffee shops in order to have an alternative if one shop drops from your list? In other words, in our lives we regularly forgo availability when it doesn’t make sense - why shouldn’t we follow the same rule in our professional lives?

Availability is not a binary option. You could have all-active N-tuple, you could have active-active pair, you could have an active-passive pair with automatic failover, or same active-passive pair with manual failover. And finally, in today’s cloudy world, you could also have just a single resource with ability to replace this resource quickly if it goes down. Options include geographic redundancy, vendor/provider diversity, and so on. Availability could be as simple as host your systems at a very reliable provider. Or at the very least - be able to detect when there is a problem and be able to restore the system within a preset amount of time. Different levels of availability obviously don’t cost the same - pick one that you want and can afford.

Secondly, if your overall service consists of multiple smaller parts, you are free to choose different levels of availability for individual parts. Anything which responds to synchronous calls (a call that expects a reply immediately) - like web front door - may have one level of availability (higher), background jobs may have lower level. Designing each subsystem with appropriate level of availability will reduce your costs and most likely will let you save some of complexity budget for other things.

Thirdly, while availability is a single metric, problems that impact it are not. Some problems could be frequent and easy to deal with, other problems could be rare and catastrophic. Do you want to build your service to withstand a failure of a host, all hosts, all of your ISP, entire Internet? It’s all about the tradeoffs between costs, severity of each type of problem and probability of these problems to occur.

Fourthly, remember that availability measures that you build are your defenses against problems. A particular type of problem that you want to protect against, requires an availability measure targeted at this very problem - matching it by functionality, size and cost (a single defense measure may work against multiple threats). Imbalance in any of these three categories between your defenses and the problems they are meant to prevent will lead to suboptimal results. After all, you don’t use a shield to defend against a cannon and you don’t duplicate your entire operation into the second datacenter just to protect against a router failure.

And finally, beware of peer pressure. If your web front door’s availability costs $1m per month and it’s bringing in $10m per month worth of revenues, it can be a no-brainer. But if you are investing 50% of your complexity budget in availability just because everybody else is doing it, I think it could be a problem.

Going back to AWS and putting my amateur behavioral economist’s hat on, I am curious how many people decided to take advantage of lower price for lower availability of RRS. And even more interestingly, if S3 initially were at RRS availability and AWS announced better availability for higher price, would we end up with the same distribution of people using higher and lower availability?

Categories: devops | infrastructure-development |

Comments (1)

[...] This post was mentioned on Twitter by hnquestions. hnquestions said: Are You a Responsible Owner of Your Availability? http://bit.ly/9BN9S1 [...]