On Importance of Planning for Failure

As you probably heard, on May 21 of this year US East region of Amazon EC2 cloud experienced a severe outage. The event received considerable coverage around the blogosphere - you can find the most comprehensive collection of links on the topic at highscalability.com. The guidelines of design for failure have now spread far and wide. But where do you start?

One good place to start is a thought exercise trying to come up with as many possible failure scenarios as you can imagine and let your mind go through the steps how you are going to deal with each one of them. It should not take a lot of time or resources. All you need is to start a document where you will record your initial findings - you will update this document throughout your system’s lifecycle as more information becomes available.

Again - this is a pure thought exercise. You are not writing any code here or redesigning any of your systems.

Create a table with the following columns.

1. Something bad happens. Examples: You lost your database master. Half of your frontend capacity is not responding. Your traffic increased sharply and unexpectedly. Some of your servers are unable to read or write to their disk. Loss of cooling - servers overheating. Fire alarm in the datacenter.

The more bad things you can think of here, the better. Remember also that not all problems occur in an instant - some issues could be building up over time, others could be affecting only some fraction of your systems.

Something bad could be going on for hours or even days, or it could be a sub-second event in duration with huge impact on your service. Are your systems going to be impacted the same way regardless how long an event is ongoing?

2. You or your monitoring systems find out that someting bad happened. Don't skip this step - it's more important than you think!

Can your monitoring detect the error condition reliably (with sufficient confidence level but without unmanageable amount of false positives) and within desired timeframe? (Hint: if for all your events you answer “yes” to this question during initial assessment, either you didn’t think of enough bad things that could happen, or your monitoring system is better than 99.98% of all monitoring systems deployed out there.)

3. You need to know in advance approximate impact such event will cause on your systems. Once you find out about an actual event, you will need to confirm that your anticipated impact is indeed occurring.

Example: Did you think that a loss of this router would only result in your loss of half of your backend capacity, while in reality (due to a recent change) you lost both 50% of backend and frontend?

4. You react. You should have a rough idea what you would do in response to a given event. As they say, "The best improvisation is one that was prepared in advance."

Note that “put up a ‘We are sorry, site is down’ static page and wait for provider to fix a problem upstream” is a valid reaction.

5. Assign likelihood to each event. The key here is absolute values are not very important - you only need to offer a hypothesis which event is more likely to occur than another event.

6. Assess expected impact relative to expected impacts of other events.

A combination of columns (5) and (6) should help you prioritize your ops work. I can’t think of universal guidance whether to go for “low likelihood, big impact” or “high likelihood, low impact” first - it probably makes the most sense to start somewhere in the middle. When prioritizing work, also pay attention to column (4) - especially if a current reaction is deemed insufficient.

For certain failures, you will have columns (2) and (4) automated - for example, leading NoSQL solutions can often tolerate a loss of one node in the cluster. I still recommend to have such events listed in this table, even though you might not need to do any additional manual reacting.

You also need to remember to revisit this document as you are working on your system, adding components, growing capacity in response to demand, develop new features, fix bugs, move between platforms, and so on.

And finally, you can grow this document ad infinitum. Even when you think you covered every single event imaginable, either wait a month or two or start planning for combinations of events (for example, loss of router AND independent traffic spike).

If you do this right, this exercise will help you weather your future operations storms with more confidence. But remember that if you have a sufficiently complex system, no matter how much planning you do sooner or later it will experience a "normal accident" caused by one or more unknown unknowns. Your goal is to postpone a normal accident for as long as you can.

Categories: cloud-computing | devops | infrastructure-development |

Coffee and Design for Failure

In the wake of the Judgment Day Outage, I would like to offer you a story.

One morning John Doe woke up and decided he wanted a grande mocha. Nearby Starbucks was 3 minutes away by car. John went to his garage but looked like his garage door opener wouldn’t work - power was out at his house. John pulled the door up manually and started the engine.

As he approached exit from his subdivision, he noticed that the road was closed and road crews were working on an emergency power issue (one that caused power outage at his house). ETA was 30 minutes. The other exit from his subdivision was closed for 60 days due to road resurfacing.

John returned home, left his car in the garage and decided to walk to his nearby Starbucks. He noticed cloudy sky and took an umbrella. But it started to rain with strong wind gusts and his umbrella was not helping him a lot. He was soaking.

At that point John decided to give up on the idea of grande mocha from Starbucks and settled on his home brew. He returned home and didn’t have a Starbucks coffee that day.

Now, let me ask you - did John do a good job at designing for failure? After all, he was responsible for designing his process of obtaining coffee. He owned that process so he was in charge.

Or is John a rational individual who acted rationally under a given set of circumstances?

Two other blog posts of mine that you may like in this context are Normal Accidents in Complex IT Systems and Are You a Responsible Owner of Your Availability?.

Categories: infrastructure-development |

Corporate Open Sourcing

The idea of open source could by now be familiar to most folks in the industry. When a new technology is open sourced by an individual, the situation seems to be well understood. But when a corporation opens source code for one of its products, it looks to me like not everyone is fully aware of all peculiarities.

I can think of 3 common situations.

Something is developed for internal needs and then is open sourced after some internal use, after it becomes obvious others may benefit. I perceive Google and Facebook to be favoring this approach. This approach is the closest a corporation can get to motives that drive individuals to open source their work - sharing of knowledge, earning street cred, recruiting, and so on. Key differentiator here is that owner pretty much doesn't care what others do with their technology - their internal plans more or less do not depend on whether open sourced project becomes popular or not.

Something is open sourced while being developed or at early stages (say right after the first shippable milestone) , without owner's aspirations to build an ecosystem or get it adopted as a standard, at least initially. This is a business move - it's conceivable to think of a situation where a company's product needs to have its source code publicly available, in order to help sales. How permissive a license is may or may not matter here, and neither does whether contributions are accepted or not or even encouraged.

Something is open sourced while being developed and the goal is right from the start to build an ecosystem, or for technology to become a standard. In this case, main driver to open source was the ecosystem or aspirations to become a standard or some sort of fundamental building block (such as a "Linux kernel of the cloud"). Talk of licensing, code ownership foundations, governance, commoditization, contributions, etc are all found in this approach.

In all three scenarios, open sourcing is great and I applaud every company that does it, no matter what their motives are. But #3 is extremely difficult. In fact, of all open source tools and technologies that we commonly use today (Linux, Pyton, Ruby, any project in Apache Incubator and many thounsands of others) - I can’t think of any that followed scenario #3.

For those who have not been following the industry lately, Open Compute project initiated by Facebook seems to follow approach #1 - they have a finished datacenter based on this spec (in Prineville, OR) and they shared their findings after it was completed. They seem to like their design and looks like they will continue building to this spec, regardless whether it becomes adopted by others or not.

Openstack and Cloudfoundry (led by Rackspace and VMware, respectively) seem to follow approach #3. It’s a great idea - all I am saying is that their job is more difficult, and they are truly blazing their trail, it hasn’t been attempted thus far. Or if it has been attempted, I can’t think of it or it didn’t work out. If they make it in their respective areas, they will be the first.

This post on GigaOm could be interesting read in this context.

Can Openstack and Cloudfoundry both succeed at the same time at something that has not worked out up until now?

Categories: internet |

Reselling Your Cloud Instances

This post is a hoax, it was published on April Fool's Day.

One beautiful afternoon in March, I was onboard a Metra train on my way home from CohesiveFT’s offices in downtown Chicago. I was reading a book on my Kindle, and little did I know that I was about to find out something huge that may soon change the IT landscape forever.

Not too far from my seat was a gentleman playing Solitaire on his laptop (I know, I know, no one is perfect). It looked like he had a conference call to attend, but since he thought his Solitaire was more important, he put his phone on speakerphone and turned up the volume, having placed the phone on a seat next to him. I have been trying to ignore the call but it was so loud one couldn’t do much about it - volume was set that high, I am telling you.

The call turned out to be one between all 7 US major cloud computing providers and top Wall Street firms. Apparently, a similar call happened earlier that day between European cloud providers and London-based representatives of pretty much the same banks.

As you might have already guessed from the title of this post, they were discussing ability to resell cloud instances (in infrastructure-as-a-service context). While providers will focus on implementing the actual mechanism how an instance can be “reassigned” to another account, the role of banks here will be to turn this into a new class of commodity that can be traded. Essentially, cloud providers as a group were pitching a new asset class to the investment bankers and wanted them to make the market for this commodity.

Here is how they said it’s going to work from technical perspective.

Say you have an instance in the cloud. There will be an API call that would be an “offer” - you are saying you want to sell this instance at such-and-such price. Someone else may have an outstanding “bid” for a similar instance. When there is a match, a trade will be executed: your old instance is going to be terminated, bidder’s image will be placed exactly in the slot your instance used to occupy, and then bidder’s image will be started.

The key they emphasized was that it was not about customers buying capacity from various providers at dynamic prices. It was about customers’ ability to resell an instance to another customer. At any price a buyer would be willing to pay! Also, providers said that if they ever get a request for a lot of capacity while they can’t satisfy it quickly enough, they would be willing to buy back capacity from their own customers on the open market.

This is obviously huge! To read more, please see here.

Categories: fun |

Netflix The Rising Star of Cloud

Every disruptive change has its pioneers. Someone must be the first to think of it, someone must be the first to launch it, someone else could be the first to use it at scale. While most observers agree that Amazon Web Services are an undisputed pioneer of cloud computing on the supply side (i.e., among vendors and service providers), year 2010 saw the emergence of Netflix as one of the pioneers in the use of cloud computing by enterprises.

Nextflix, many say, is not a typical enterprise. Even though it’s a big company (employs more than 2,000 people according to its press kit) that is publicly traded on NASDAQ with current market capitalization of slightly over $11B, it’s relatively young (founded in 1997 according to Wikipedia), caters primarily to consumers and its IT is their business (as opposed to a typical enterprise where its IT supports its business - when Netflix was a DVD mailing company, its IT supported its business; as Netflix is transforming itself into a content streaming company, its IT is becoming its business).

Netflix has always been a bit geeky. Their recommendation algorithm has been their prized asset for some time. Also recall, for example, a competition they ran in 2006 offering a big reward to anybody who would develop a better one.

I started following Netflix’s cloud use in 2010. Netflix is a big operation, possibly even regarded by AWS as their “reference” customer. If you follow cloud computing, you couldn’t have missed it.

I watched several presentations given by Adrian Cockcroft (see his interview with Randy Bias) and subscribed to techblog.netflix.com. The latest post there is full of cloud wisdom of a cloud practitioner.

For example, we learn that Netflix went to cloud in search for high availability and agility was almost like a side effect. We learn that Netflix does not want to be a datacenter expert because they regard it as an accidental complexity. Being a consumer brand in a relatively new market segment, they did not want to worry about getting their capacity forecasts right.

If you look around the web, you will find that Netflix is running a Java application stack, which makes it close to many other enterprises out there. But the key components that sets them apart is their internal operations platform tailored for Amazon EC2 - from development to QA, to deployment, to monitoring, to trend analysis, to troubleshooting. (It’s important to realize that an “internal operations platform” is not only software - it’s also a set of processes, standard operating procedures, mind set and operations philosophy).

And here is the point of this post (finally!). Netflix essentially has built an enterprise-friendly world-class PaaS for Java. They probably built it without thinking of selling it as a standalone product to other enterprises one day, but I would like to ask - why not? If the world's biggest web retail operation managed to build a hosting business, why can't one of the world's geekiest enterprises build an IT ops platform business as well?

Think about it - Netflix Web Services, Java application hosting, for enterprise by enterprise…

Can’t say for sure if it will happen or when but my gut feel is that it will happen eventually, and if not Netflix then some other cloud customer will enter application hosting (PaaS) business with a platform originally developed for internal needs.

Netflix folks, are you reading this?

Categories: cloud-computing |

Previous Page
Next Page