Concise Introduction to Infrastructure as Code

After my last post, I received several questions about how one could get started with infrastructure as code. While I can’t provide a thorough step-by-step guide that will cover all possible situations and nuances, I thought I’d post a very brief generic outline.

The end goal of infrastructure as code is to perform as many infrastructure tasks as possible programmatically. Key word is "automation."

To accomplish this goal, you will most likely first focus on two major areas - monitoring and deployment (when in doubt whether to focus on monitoring or deployment first, I recommend monitoring).

All levels below indicate a given activity is performed programmatically. Essentially, you should read each level as “to complete this level, I must be able to do this programmatically.” Just as in a computer game, you will want to move up from level to level for each individual piece of infrastructure that you have.

Several words on terminology. Whenever I say “infrastructure” below, I mean your systems, applications, services and data. Whenever I say “store” below, I mean store the information and be able to query, search and filter it programmatically.

Monitoring

Level 1 - Obtain runtime information about your infrastructure

Level 2 - Store historical runtime information somewhere (when in doubt, I recommend Graphite)

Level 3 - Generate mashup metrics and store them (by mashup metric I mean a metric that is not directly observable but one that is generated from various direct observations or other mashup metrics, potentially over a period of time)

Level 4 - Detect, generate and store events (points in time when something significant happened)

Level 5 - Generate alarms (notifications about particularly important series of events, meant to be analyzed by humans, usually in near real-time)

Level 6 - Detect, generate and store complex events (events that comprise events from different parts of your infrastructure)

Level 7 - Predict events (this is the highest level in monitoring; ability to predict the future requires level 6 plus ability to establish causality in addition to correlation plus potentially some amount of statistics)

Deployment

Level 1 - Deploy a machine (reminder - need to do it programmatically)

Level 2 - Install OS on a machine (in IaaS, levels 1 and 2 are combined)

Level 3 - Machine boots up with network access and naming services (DNS) configured, network security enabled, user accounts can login

Level 4 - Applications and data are automatically installed and configured (when in doubt, I recommend Chef or Puppet)

Level 5 - Applications are automatically added to correct resource pools and automatically start responding to real requests

Once you achieve high levels in monitoring and deployment (not necessarily highest though), you can start doing things like self-healing, autoscale, testing through fault injection and other cool things that are also parts of infrastructure as code but go beyond the scope of this blog post.

Categories: devops |

Infrastructure As Code - Tiki-Taka of TechOps

Techops traditionally has been pursuing a dual mandate. On one hand, a part of resources is dedicated to new projects and expansion initiatives. On the other hand, there’s always been a significant effort to make sure existing systems are up and running. In each focus area, the industry has developed a significant knowledge base and accumulated a lot of experience.

This dual mandate has been in place for such a long time that not a lot of people are even questioning it now. So let’s pause and think - could there be a better way? I propose approaching this question with the help of a soccer analogy.

Just as techops, a soccer team on the field is also pursuing a dual mandate - attack and defense. Over the years there have been numerous strategies and metholodologies how each phase of the game needs to be built. These strategies included recommendations and best practices on how to defend, how to attack and how to transition between the two.

But then they adopted tiki-taka in Spain (see also a thread on quora). While I am not going to go into significant detail here, for our purposes it’s important to note that tiki-taka deprioritizes the dual mandate - instead of focusing on attack and defense as completely separate situational configurations, tiki-taka focuses on maintaining ball possession as one of the primary means through which games are won.

In tiki-taka, possession is a primary goal, and particular decisions on offense and defense are a consequence. In other words, tiki-taka in large part replaces an offense/defense dual mandate with a single focus on possession.

Going back to techops, I see infrastructure as code playing the same role in techops as tiki-taka plays in soccer. Intsead of trying to tailor one’s strategy to cover both new projects while continuing to keep the lights on (it’s worth noting that these two areas often times present different or even conflicting goals), by treating your infrastructure as a software engineering project, you can eliminate ambiguity and potential conflict.

Categories: devops |

Applying 5 Whys to Amazon EC2 Outage

Earlier this week AWS published a post-mortem report about their last week’s outage - http://aws.amazon.com/message/67457/.

Of several impairments and service disruptions caused by the outage, an hour-long unavailability of us-east-1 control plane is in my opinion the most important. Let’s apply 5 whys analysis to this impact. All answers below are direct quotes from the report, with my occasional notes where needed.

What happened?

There was a “service disruption which occurred last Friday night, June 29th.”

Why?

“From 8:04pm PDT to 9:10pm PDT, customers were not able to launch new EC2 instances, create EBS volumes, or attach volumes in any Availability Zone in the US-East-1 Region.”

Why?

“The control planes for EC2 and EBS were significantly impacted by the power failure” in a single AZ.

Why?

AWS were unable “to rapidly fail over to a new primary datastore” that internally serves their control plane.

Why?

“The EC2 and EBS APIs are implemented on multi-Availability Zone replicated datastores. These datastores are used to store metadata for resources such as instances, volumes, and snapshots. To protect against datastore corruption, currently when the primary copy loses power, the system automatically flips to a read-only mode in the other Availability Zones until power is restored to the affected Availability Zone or until we determine it is safe to promote another copy to primary.”

There was “blockage which forced manual assessment and required hand-managed failover for the control plane”.

Why?

The answer to this why was withheld from a public outage post-mortem report.


To me, this outage is the most worrisome of all AWS service disruptions that I know about. In a nutshell:

AWS effectively lost its control plane for entire region as a result of a failure within a single AZ.
This was not supposed to be possible.

In hindsight, and knowing what we now know from the outage report (which is not necessarily what was known to AWS folks working the outage directly at the time), one course of action could be as follows.

Certain language in the report (“To protect against datastore corruption, currently when the primary copy loses power, the system automatically flips to a read-only mode in the other Availability Zones”) leads me to believe that during the hours preceding the events described in the report, us-east-1 primary control plane lived in AZ soon to be affected by generator failure.

Between 7:24pm PT and when utility power was restored some time before 7:57pm, AWS crews should have discovered that something’s not right with generators in this AZ (this is not a fact, this is an assumption - it’s possible this information was not available at the time). If they did, they could have immediately initiated moving of control plane primary from this AZ just in case, because this AZ’s generators could not be trusted. This might have prevented the control plane outage. (Again - a lot of assumptions on my part here).

And finally, I can’t leave without pointing out a surprise about this outage report, which I hope AWS will never repeat in the future. They say:

While control planes aren’t required for the ongoing use of resources, they are particularly useful in outages where customers are trying to react to the loss of resources in one Availability Zone by moving to another.

This puzzles me. Control plane and corresponding API endpoint which serves as interface between the control plane and AWS customers are not merely useful in outages - THEY ARE CORE AND ESSENTIAL components, ESPECIALLY during an outage. If you could call in and dictate to an operator what AMI to launch with what security groups, elastic IPs and keypairs - I might have bought a “nice to have” argument. But there is no other way to react to an outage except by interacting with API endpoint - hence control plane is a “must have.”

Categories: cloud-computing |

How AWS Could Improve Spot Market

I recently noticed a surprising lack of new features being announced for AWS spot market. This disappoints me, as I have had high hopes for this groundbreaking idea since it was initially launched. Assuming spot price is set only through supply and demand, I think the following 3 suggestions could be of interest.

Let’s start with the key problem for spot market customers today - their biggest problem right now is managing anomalous price spikes. I emphasize that I use the word "manage" instead of "avoid" because I am focusing on general case, not on any specific strategy. And in general case, one can’t categorically demand that absolutely all spikes be avoided.

In order to be able to manage the spikes, customers must be able to see them coming (with some confidence level). Right now we can only rely on pricing charts and our ability to extrapolate past spike patterns into the future. This is extremely unreliable and error-prone. And besides, there is no guarantee that past patterns will still be valid in the future. Instead, customers need a better way.

Suggestion #1: AWS should publish real-time indicator of current available supply.

It could be an absolute number (“N spot instance slots remaining”) or it could be a range (“the number of spot instance slots currently remaining is between N and M”).

The idea here is that the lower these numbers are, the more likely spot price is about to go up. Nothing is guaranteed of course - supply could be decreasing without any significant price changes. But in theory every spike is preceded by a drop in available supply (while not every drop necessarily leads a spike).

Next. Once we get a more or less good idea when spikes could be coming, we need to know how large they could potentially get. And this leads to the next suggestion.

Suggestion #2: AWS should publish the top of its order book in real-time.

Essentially it’s a list of N highest bids, along with corresponding sizes of each bid. This will kill two birds with one stone. On one hand, there will be a clear reference point how high spot price could get in extreme situation if a spike were to occur right now. On the other hand, it will establish an upper-side price anchor that could reduce irresponsible bidding, which in its turn could reduce severity of the spikes - both I expect are going to be good news for customers.

For a provider, it’s important to avoid publishing entire order book though - as I showed before, provider benefits from diversity of bids, especially at the low end.

And finally:

Suggestion #3: Allow in-place modification of a running spot instance's bid.

Right now, even if you know when a spike is coming and you can estimate how high it could be, there is nothing you as customer can do except terminate your instance or let it run - you can’t do anything to survive a spike if it’s going to exceed your instance’s price. This severely limits your options and makes proper management of price spikes pointless.

Conclusion

I realize that these could be far-reaching suggestions. Coupled with AWS’ preference for secrecy, they could even be nearly impossible. But without these (or anything similar), spot market will remain very difficult to bid on intelligently - in its current form, it’s just too random.

Read other posts on my blog tagged amazon-ec2-spot or cloud-pricing .

Categories: cloud-computing |

Amazon Web Services and Innovator's Dilemma

AWS rolled out yet another service last week - Simple Workflow Service (SWF). I haven’t had a chance to kick the tires yet but I liked what I saw in the docs.

But then on Twitter I started noticing people expressing concern about how this new service stacks up against existing tools in this domain (I am not an expert but acronyms such as BPMN and BPEL were mentioned several times). Some said outright that in their opinion SWF was a “retrograde step” considering “so much work [already] done in this domain.” These people are widely known as experts in their fields, and I have no doubt they have a valid point. But in spite of this, I found myself liking SWF even more.

Why?

First of all, SWF follows a well-established rollout strategy that AWS has been relying on since the very beginning - initial rollout with at least minimal viable product followed by frequent significant updates. Do you recall, for example, that EC2 launched without user-selectable kernels (aki-XXXXXX), EBS or even Elastic IPs - can you imagine EC2 as you know it today but without these features?

Secondly, SWF most likely is an attempt to productize a technology used internally. As such, a set of features available at rollout most likely approximates a set of most important features in use by amazon.com. And if Amazon uses them, it’s likely others may use them too. It’s like writing some useful piece of code for internal use, then discovering its usefulness and publishing the code on github as an open source project. Except Amazon doesn’t publish code and turns it into a product instead.

I actually know firsthand a thing or two about this approach. CohesiveFT VPN-Cubed (disclosure - I am this project's lead engineer) was not initially envisioned as what it is today - a premier connectvity solution between clouds and datacenters. It's rooted in something that I developed for internal use in a completely unrelated project. We saw its potential and it became a product, even if some key features were not even present in the first version.

And finally there is this thing called Innovator’s Dilemma. I know, I know - everybody read the book, and everybody is already tired of its mentions. But it’s not fiction - it’s not enough to read the book, one must understand it. If you don’t understand the difference between sustaining and disruptive technological changes, you must re-read it even if you are not interested in pursuing an MBA degree.

For the purposes of this post, however, here is an excerpt from the book. And here is a key part:

Generally disruptive innovations were technologically straightforward, consisting of off-the-shelf components put together in a product architecture that was often simpler than prior approaches. They offered less of what customers in established markets wanted and so could rarely be initially employed there. They offered a different package of attributes valued only in emerging markets remote from, and unimportant to, the mainstream.

SWF is meant to be not like BPEL, BPMN, etc. It is meant to be mostly ignored by existing users of this technologies - it is meant to be unimportant to them. This is all by design!

This is their M.O., it’s what they are extremely good at and it’s how they have been doing it since the very beginning. While past results are not necessarily a good indicator of future successes or failures, AWS’ track record is stellar. I wouldn’t bet against SWF if I were you.

And please - read the book (or re-read if necessary), you won’t regret it.

Categories: cloud-computing |

Previous Page
Next Page