The Rise of DevOps

If you are in IT, you probably noticed that most of the industry’s technical buzz lately has been centered around one of three huge areas - cloud computing, nosql and devops. Unlike Web 2.0 or Social Web, which are about content generation and content consumption models on the Internet, these three are actually about how software systems are built and operated - it is “engineering” vs “product.”

DevOps is on the rise as a newly re-defined standalone discipline, as evidenced by increased number of good articles about it around the blogosphere. In this post, I am going to take a stab at outlining what DevOps means to me.

I’ve got some devops cred. Before joining my current employer where my role morphed over time away from devops, for over 2 years I had been at Orbitz.com where I was in a group in charge of monitoring and automating a hugely distributed multi-datacenter custom airfare search application, running on many hundreds of machines, with several times as many separate entities and processes that needed to be coordinated, restarted, tweaked and so on (our group was in charge of everything above hardware, OS and basic network services such as connectivity, DNS and DHCP). Before that I had had various sysadmin roles, which all involved a large degree of coding beyond the level of simple shell or perl scripts.

To me, devops is a distinct discipline at the border between software engineering and ops, which focuses on developing software for the infrastructure on top of which end-user-facing software is running. It’s sometimes referred to as development of infrastructure software and includes release deployment. Devops has the following distinguishing characteristics.

1. Ability to write code beyond simple scripts

Obvious necessary condition.

2. Focus on stability and uptime

Stability and uptime in devops almost always trump features.

3. Extra focus on moving between states

In dev land, I have often observed situations when the end result of a particular feature was analyzed on its own merit, without taking into consideration how a system can be moved from its current state to its desired future state. Devops pays extra attention to this problematic area.

4. Different angle on business revenue

While developers usually work on things that are meant to increase or sustain business revenues, devops often work on things that are meant to prevent or reduce loss of business revenues. This is somewhat similar to defense vs offense in team sports. The key word is “balance.”

5. In devops, we are users of our own software

This is one of the most important distinctions. Unlike developers who create software to be used by someone else (internal customers, end users, site visitors, etc), devops is about developing software for internal needs. For example, you can certainly get sloppy in logging that error, but it’s you, not someone else, who’s going to suffer the consequences of having to waste extra time to find necessary information.

6. Architect, developer, tester, product manager, project manager - all in one.

My personal experience in devops is I/team get(s) an area of responsibility and it’s up to me/us to make it happen. Assigning priorities, figuring out dependencies, reacting to unexpected changes, managing resources - all of these functions are performed in devops by the same group of individuals.

7. Awareness of normal accidents

I have an entire blog post dedicated to this - check it out.

8. QA in production

Some tasks in devops can’t be adequately tested in smaller synthetic environments. Lack of scale, lack of unique hardware, lack of sufficient capacity in vendor’s test environment, lack of sufficient connectivity from test site to vendor’s systems - all could be factors. Phased deployment and other techniques designed to reduce the risk of a complete meltdown are (or should be) used extensively in such scenarios, but the truth is - from time to time in devops I had no other way but to actually run a test system in a live production environment.

9. Manual first, then automate

In my experience, a devops task is more likely to start out as something done manually at first, and automated later. In dev land, tasks rarely go through manual phase before being coded up and shipped in a release.

10. Almost always distributed or hyper-distributed

Conclusion

Devops is on the rise primarily due to realization that there is a big gap between developing end-user systems and bare-bones systems administration, in large part due to fast growth of IaaS cloud computing. Devops originated at places where a relatively few sysadmins were in charge of many hundreds or even thousands of hosts - where doing their job without automation was impossible. As time goes on, I expect devops will further solidify its role as a first-class citizen and make inroads into non-cloudy companies as well.

Categories: devops | infrastructure-development |

Workloads in Cloud Computing

In computer science, according to Wikipedia, abstraction is a “mechanism to reduce and factor out details so that one can focus on a few concepts at a time.”

When you hear about abstraction in the context of virtualization-based IaaS cloud computing, the most well known abstraction is computing resources themselves (encapsulation is at play here as well). You don’t need to know exact hardware on which your instance is running, or exact network setup - you only need to be able to treat your compute instances as nearly identical units that respond to certain set of signals in a predictable way.

With emergence of multiple IaaS clouds however, there is a second abstraction that is going to play a big role - workload.

Workload is an abstraction of the actual work that your instance or a set of instances are going to perform. Running a web server or a web server farm, or being a Hadoop data node - these are all valid workloads. I am treating a workload as an abstraction, because I intentionally leave out a huge component - exact way how work in a given workload gets mapped to resources offered by a given cloud. When speaking in terms of workloads, I want to focus on what needs to be done, as opposed to how it’s going to be done in the context of a particular cloud (remember that from technical architecture perspective, clouds are far from identical).

For example, “run this blog for 1 year, for up to 100 visitors a day” is a what (workload), while “run this blog on m1.small EC2 instance in us-east-1 for 1 year” or “run this blog on Terremark instance with 1 VPU and 1 GB of RAM for 1 year” are a how (for lack of a better word, I am going to call them deployments).

I think such an abstraction is very helpful. Running this blog may take 1 small instance in one cloud, half of a small instance in another, a third of an instance plus a dedicated load balancer in third. As you can see, once you map a workload to a set of compute, storage and network resources offered by one cloud, you can no longer move it to another cloud - your deployments are non-transferable from one cloud to another. Workloads here serve as transferable equivalents of your cloud deployments.

Secondly, workloads by themselves may have properties or attributes that could dictate where workload can or can’t run. This justifies existence of a workload as a separate entity - it is in theory possible to construct a workload for which no deployment can exist in any of the clouds available today.

There are many examples what kind of attributes a workload may possess. A workload may have a compliance attribute, which says that this workload must run in an environment with a certain certification. Another attribute may be a geo location requirement, whereas it must run within a certain geographic region for a legal reason.

A workload may be time-bound (“runs for 5 hours”) or time-unbound. A workload may have a specific start time or flexible start time, in which case it may have a hard stop time (for example, must finish by a certain time in the future). It can be interruptable or must run without interruptions.

A workload may have a certain lower limit of resources that it needs, expressed in work-independent form. For example, serving Wordpress blog for 1 visitor a day as opposed to 100 visitors an hour are two very distinct workloads (note that workloads are different, while the application inside the workload is the same). The latter will certainly end up consuming more resources than the former.

Workload may have a budget associated with it, it may have redundancy requirements. It may require a certain OS or distribution. It may require a certain feature (for example, persistent disk or non-private IP address directly attached to eth0). It may require a certain minimal access speed to some data source (for example, if my data are in S3 on the East coast, I may want my workload to run somewhere near). Each requirement is a restriction - the more requirements a workload has, the fewer clouds can potentially run it.

Conclusion

The answer to the question “Where is the best place to run this task?” used to be treated as a binary decision (“on premises” vs “in the cloud”) but not any more - because there are many different and incompatible implementations of the latter. Looking at your tasks via workloads/deployments prism may open new horizons for computing mobility. There is a saying “select the right tool for the job.” It can be now extended to “select the right tool and the right location for the job.”

If you like the idea of cloud computing workloads, you may find this post by James Urquhart interesting as well.

P.S. Believe it or not, this is my 100th post on this blog. Not bad. Hope at least some of you enjoy reading my posts as much as I enjoy writing them.

Categories: cloud-computing |

On Dangers of Prematurely Making API Public

From time to time, I come across a statement that every service on the Internet must have an API, or people behind this service are doing it wrong. This phrase usually applies specifically to publicly available API.

As a user who stands to benefit from increased number of services allowing third-party applications and mashups, I certainly tend to agree. But as a developer, I realize that prematurely making API public may be a disaster.

Publication of API represents a long-term commitment. You as a developer are committing to supporting this API for some non-trivial amount of time (at least 12 months I would imagine) and are essentially inviting other developers to create new functionality against this API. No one likes to spend their time developing against a given API just to discover shortly that API changed, or some functionality that used to be offered is no longer available.

By making your API public you are signaling that this part of your system is very stable, its functionality well established, understood and developed, usage patterns well thought out. Or at least that's how I as a third-party developer interpret your action.

If you know your audience well enough and are pretty confident that they won’t mind your tweaking things after initial publication, you may take a risk. Twitter famously launched their API very very early, and in the end it proved a huge success for them. (So if they listened to my advice in this post, they would be worse off).

But not all developer audiences may be as agile and forgiving as Twitter’s. I can imagine a very conservative big user of your API that will very strongly object to your changing the API. What do you do next? Maintain 2 versions? But what if underlying database schema changes make old API incompatible with what you are trying to do in the future? Fork and host 2 different systems, old and new? I can’t honestly imagine a worse scenario.

My advice - before publishing your new API, make sure you are not going to force yourself into a corner down the road. Only publish API for those parts of your systems that are very stable (both operationally and from perspective of internal mechanics and functionality) and where usage patterns are well researched and predictable to a certain extent. Don't rush it.

Categories: software-engineering |

Digging into EC2 Spot Price History

In December 2009, Amazon Web Services team introduced yet another innovation - spot pricing for EC2 instances. Several sites were created shortly to track spot price history by creating price charts. But price charts are relatively boring - juicy meat is in the dynamics hidden inside series of numbers which represent the price history. Let’s do some exploring!

Several notes first.

  1. All references to times and dates below are GMT for all regions.
  2. Spot instances went live on December 14, therefore I ignore all data points before that (for simplicity, my cutoff was set at UNIX timestamp 1260777600 - it's 8am on December 14 GMT, which translates to midnight in Seattle where AWS is headquartered).
  3. Spot price history was obtained on January 25, 2010 at 10:54pm via API and cached locally for analysis.
  4. In order to be able to deal with integers instead of floats, all prices below are represented in points where 1,000 points = $1 per compute hour.
  5. Each product is specified as [region, instance_type, product_description] tuple.
  6. I am only going to outline facts below, all interpretation is up to you.
  7. These results have not been exhaustively verified, my analysis code may have bugs. Use at your own risk.
#1 Price averages

Here is a chart of average spot price for each product relative to regular price for the same product (averages take into consideration for how long each price was valid). Percentage next to a product identification represents the ratio between average spot price and regular price.

#2 Price increases in a row

Maximum number of price increases in a row was 6. It occurred on January 23-24 for [us-west-1, m1.large, Windows] and the price went up from 256 to 273.

5 price increases in a row happened also once, 4 in a row - 16 times, 3 in a row - 95 times, 2 in a row - 643 times, and a single increase immediately followed by price reduction happened 2,433 times. Of the latter, 684 times (28%) were a single price increase followed by price returning to where it used to be right before the increase (X -> X+Y -> X).

#3 Individual price increases

Maximum single price increase in absolute terms was 928 - it occurred for [us-east-1, m2.2xlarge, Windows] when the price went up from 572 to 1,500. Second biggest was 890 for [us-east-1, m1.large, Linux/UNIX] and third biggest - 551 for [us-east-1, m1.small, Windows]. Note that all of these occurred in us-east-1.

The biggest price increase as percentage of the regular price was 460% when a price for [us-east-1, m1.small, Windows] jumped from 49 to 600 on January 24. The second and third biggest in the category were 262% increase for [us-east-1, m1.large, Linux/UNIX] (110 -> 1000) and 64% increase for [us-east-1, m2.2xlarge, Windows] (572 -> 1500).

The same two biggest increases were also the biggest price increases as percentage of current spot price - 1,124% and 809%, respectively. Third place in this category was a 186% increase for [eu-west-1, m1.small, Linux/UNIX] when the price went up from 28 to 80.

Here is a chart showing price increases and reductions day by day.

#4 Number of datapoints per product and/or product family

There were a total of 4,469 spot price revisions for Windows and 3,885 for Linux/UNIX. By region, us-east-1 had the least price revisions in total - 2,491, of which 1,254 were for Windows and 1,237 for Linux/UNIX (50.3% vs 49.7%). A total of 2,809 price revisions in eu-west-1 were distributed 1,518 for Windows vs 1,291 for Linux/UNIX (54% vs 46%). A total of 3,054 price revisions in us-west-1 were distributed 1,697 for Windows vs 1,357 for Linux/UNIX (56% vs 44%).

[eu-west-1, m1.small, Windows] had the most price revisions - 287. [us-east-1, m2.4xlarge, Windows] had the least - 40.

Across all regions combined, the most price revisions per day happened on January 22, 2010 - 351 price revisions.

#5 Percentiles

Here is a Google Fusion table with percentile estimates for each product. I tried to calculate percentiles from 50th through 95th (step 5) and 99th, but since a price function consists of discrete values, not all percentiles could be estimated. For each percentile, a nominal price is provided along with its percentage of the regular instance price for a given product. Percentiles take into consideration for how long a given price was valid.

#6 Spot price over regular price

Situations when a spot price is equal or exceeds regular price are especially interesting. Most such situations occurred in us-east-1, and none of them occurred in eu-west-1.

Spot price has reached but not exceeded the regular price for [c1.xlarge, Linux/UNIX] twice, [c1.medium, Windows] twice, [m1.small, Windows] 6 times, [m1.large, Windows] once - all in us-east-1.

In us-west-1, spot price for [m1.large, Linux/UNIX] exceeded the regular price by 20 for under 2 hours on December 29.

Spot price for [us-east-1, m2.2xlarge, Windows] exceeded the regular price by 60 for over 20 hours on January 11-12.

Spot price for [us-east-1, m1.large, Linux/UNIX] exceeded the regular price by 64 on December 17 and by 660 twice on December 18.

And finally, spot price for us-east-1, m1.small, Windows exceeded the regular price by 480 once and by 430 once - both on January 24.

Conclusion

There are hardly any surprises in the spot price history so far, but it’s only been less than 2 months since the feature was launched. As the usage ramps up, I expect it will become more interesting. Kudos to AWS team for coming up with this innovative pricing mechanism and being the first to introduce it at such a large scale in a real environment. Only time will tell if it will stick in its current form or if it will morph into something else (I have a couple of ideas), but the first small step towards dynamic pricing of computing resources has been made.

Read other posts on my blog tagged amazon-ec2-spot.

Categories: cloud-computing |

Normal Accidents in Complex IT Systems

Designing a fully-automated or nearly-fully-automated computer system with many moving parts and dependencies is tricky, whether a system is distributed, hyper distributed or otherwise. Failures happen and must be dealt with. After a while, most folks grow up from “failures are rare and can be ignored” to “failures are not that rare and can not be ignored” to “failures are common and should be taken into consideration” to “failures are frequent and must be planned for.” The latter seems to represent the current prevailing point of view.

But here is a kicker - it’s not the end. I saw this tweet, read this post and checked out a book by Charles Perrow titled “Normal Accidents” from the library. Published in 1984, the book is not about IT, but its material fits our field nicely. And boy, was I enlightened!

The book’s main point: no matter how much thought is put into the system design, or how many safeguards are implemented, a sufficiently complex system sooner or later will experience a significant breakdown that was impossible to foresee beforehand, principally due to unexpected interaction between components, tight coupling or bizarre coincidence. For us in IT, it translates to "no matter how much planning you do or how many safeguards you implement, failures will still happen."

There are at least 3 common themes that are present in multiple illustrations in the book:

  1. A big failure was usually a result of multiple smaller failures; these smaller failures were often not even related
  2. Operators (people or systems) were frequently misled by inaccurate monitoring data
  3. In a lot of cases, human operators were used to a given set of circumstances, and their thinking and analysis were misled by their habits and expectations ("when X happens, we always do Y and it comes back" - except for this one time, when it didn't)

I have had my share of outages and downtimes, and I can attest that I have seen these 3 factors play a big role in tech ops. Some were bugs in management and monitoring code, some where human error, some where bizarre set of dependencies but all were a combination of multiple factors. For example, who would have thought that with a failure of primary DNS resolution server, the VIP would not fail over to the secondary; and even though hosts had more than one “nameserver” line in /etc/resolv.conf, application timed out waiting for DNS to respond before getting to ask the second nameserver; without name resolution, multiple load balancers independently thought that there was no capacity behind them (because management code calculated capacity in near real-time relying on worker hosts’ names) and disabled themselves, thus taking down the entire farm - now I know of course…

It turns out we can’t eliminate normal accidents altogether, but here are several techniques that I have been using to speed up detection and response in order to reduce the downtime.

Complexity budget. Described by Benjamin Black, this is a technique to allocate complexity among components beforehand and strictly follow the allocation during implementation phase. It helps avoid unnecessary fanciness and leads to simpler code, which tends to be easier to troubleshoot and recover after a failure.

Control knobs/switches for individual components. As John Allspaw shows on this slide, you need to be able to turn off any component in an emergency, or throttle it up or down. Planning this feature and building it in from the very beginning is very important.

Accuracy of monitoring data. Ensure your alarms are as accurate as possible. No matter how much chaos is going on inside the system during a severe failure, last thing you can afford is misleading the operators with wrong information. If you tried to ping a host A and didn't get a response, your alarm should not say "host A is down" because it's not the knowledge you obtained - it's an assumption that you made. It should say "failed to ping host A from host B" - maybe it was network on host B that was an issue when a ping attempt was made, how do you know?

Availability of monitoring data. There is a reason first thing the military try to do when attacking, is disrupting enemy's means of communication - it's that important, which applies to our case as well. You either design your systems to be able to get monitoring data even during the worst outage imaginable (ideally from more than one source), or you at least should be getting an alarm about lack of such monitoring data (it's a very weak substitute though).

All in all, to everybody in IT, I highly recommend the Normal Accidents book as well as this whitepaper (linked from John Allspaw's blog).

Categories: distributed | infrastructure-development | software-engineering |

Previous Page
Next Page