Applying 5 Whys to Amazon EC2 Outage

Earlier this week AWS published a post-mortem report about their last week’s outage - http://aws.amazon.com/message/67457/.

Of several impairments and service disruptions caused by the outage, an hour-long unavailability of us-east-1 control plane is in my opinion the most important. Let’s apply 5 whys analysis to this impact. All answers below are direct quotes from the report, with my occasional notes where needed.

What happened?

There was a “service disruption which occurred last Friday night, June 29th.”

Why?

“From 8:04pm PDT to 9:10pm PDT, customers were not able to launch new EC2 instances, create EBS volumes, or attach volumes in any Availability Zone in the US-East-1 Region.”

Why?

“The control planes for EC2 and EBS were significantly impacted by the power failure” in a single AZ.

Why?

AWS were unable “to rapidly fail over to a new primary datastore” that internally serves their control plane.

Why?

“The EC2 and EBS APIs are implemented on multi-Availability Zone replicated datastores. These datastores are used to store metadata for resources such as instances, volumes, and snapshots. To protect against datastore corruption, currently when the primary copy loses power, the system automatically flips to a read-only mode in the other Availability Zones until power is restored to the affected Availability Zone or until we determine it is safe to promote another copy to primary.”

There was “blockage which forced manual assessment and required hand-managed failover for the control plane”.

Why?

The answer to this why was withheld from a public outage post-mortem report.


To me, this outage is the most worrisome of all AWS service disruptions that I know about. In a nutshell:

AWS effectively lost its control plane for entire region as a result of a failure within a single AZ.
This was not supposed to be possible.

In hindsight, and knowing what we now know from the outage report (which is not necessarily what was known to AWS folks working the outage directly at the time), one course of action could be as follows.

Certain language in the report (“To protect against datastore corruption, currently when the primary copy loses power, the system automatically flips to a read-only mode in the other Availability Zones”) leads me to believe that during the hours preceding the events described in the report, us-east-1 primary control plane lived in AZ soon to be affected by generator failure.

Between 7:24pm PT and when utility power was restored some time before 7:57pm, AWS crews should have discovered that something’s not right with generators in this AZ (this is not a fact, this is an assumption - it’s possible this information was not available at the time). If they did, they could have immediately initiated moving of control plane primary from this AZ just in case, because this AZ’s generators could not be trusted. This might have prevented the control plane outage. (Again - a lot of assumptions on my part here).

And finally, I can’t leave without pointing out a surprise about this outage report, which I hope AWS will never repeat in the future. They say:

While control planes aren’t required for the ongoing use of resources, they are particularly useful in outages where customers are trying to react to the loss of resources in one Availability Zone by moving to another.

This puzzles me. Control plane and corresponding API endpoint which serves as interface between the control plane and AWS customers are not merely useful in outages - THEY ARE CORE AND ESSENTIAL components, ESPECIALLY during an outage. If you could call in and dictate to an operator what AMI to launch with what security groups, elastic IPs and keypairs - I might have bought a “nice to have” argument. But there is no other way to react to an outage except by interacting with API endpoint - hence control plane is a “must have.”

Categories: cloud-computing |