On Amazon EC2 Spot Price Spikes

Last week I came across an interesting post on Amazon EC2 spot price spikes published on GigaOm. In the comments, in response to a question from a reader, the author stated that “I don’t think anyone ever expected that the market would behave like this.” I have been interested in this expectation for some time and decided to kick off 2012 blogging season on somic.org with a post dedicated to this topic.

As I described before, a spot instance technically is not that much different from a regular on-demand instance - it has the same CPU and RAM capacity, same network traffic and bandwidth allowances. The only fundamental difference is that AWS can terminate it under certain circumstances (when spot price exceeds this instance’s bid) - because of this, most people rationally expect spot instances to trade at a discount to the price of regular on-demand.

Furthermore, somewhat similar to the concept known as law of one price, one’s intuition says that if the spot price exceeds on-demand price, people will stop bidding on spot and will start getting regular instances instead, until spot price comes down as a result of reduced demand.

But then we face a question. How is it possible that we see the following prices for m1.small/Linux in us-east-1 when its on-demand price is 0.085:

SPOTINSTANCEPRICE	0.500000	2011-11-16T14:39:39-0600	m1.smalLinux/UNIX
SPOTINSTANCEPRICE	2.000000	2011-11-16T14:53:40-0600	m1.smalLinux/UNIX
SPOTINSTANCEPRICE	3.000000	2011-11-16T15:32:37-0600	m1.smalLinux/UNIX
SPOTINSTANCEPRICE	2.000000	2011-11-16T17:51:35-0600	m1.smalLinux/UNIX
SPOTINSTANCEPRICE	3.000000	2011-11-16T20:33:19-0600	m1.smalLinux/UNIX
SPOTINSTANCEPRICE	2.100000	2011-11-17T01:43:24-0600	m1.smalLinux/UNIX
SPOTINSTANCEPRICE	3.000000	2011-11-17T02:38:30-0600	m1.smalLinux/UNIX
SPOTINSTANCEPRICE	2.100000	2011-11-17T05:34:07-0600	m1.smalLinux/UNIX
SPOTINSTANCEPRICE	2.000000	2011-11-17T10:05:29-0600	m1.smalLinux/UNIX
SPOTINSTANCEPRICE	3.000000	2011-11-17T12:22:39-0600	m1.smalLinux/UNIX
SPOTINSTANCEPRICE	2.000000	2011-11-17T13:39:53-0600	m1.smalLinux/UNIX
SPOTINSTANCEPRICE	5.000000	2011-11-17T13:53:19-0600	m1.smalLinux/UNIX
SPOTINSTANCEPRICE	1.000000	2011-11-17T14:50:19-0600	m1.smalLinux/UNIX
SPOTINSTANCEPRICE	0.500000	2011-11-17T19:49:52-0600	m1.smalLinux/UNIX
SPOTINSTANCEPRICE	1.000000	2011-11-18T07:05:59-0600	m1.smalLinux/UNIX
SPOTINSTANCEPRICE	0.500000	2011-11-18T09:23:32-0600	m1.smalLinux/UNIX

Note that spot price in this timeframe fluctuated between 588% and 5,882% of on-demand regular price. (I am wondering if users who had spot instances running at these prices know how to spell “overpaid.”)

I think there are several possible explanations.

Firstly, it’s possible that whoever does actual bidding on spot instances is not the same person who pays the bill - for example, developers bid, while accounting department gets charged on its credit card. This may lead to careless bidding and drive the price to unnecessarily high level.

Secondly, it’s possible that some customers don’t have enough sophistication built into their automated bidding systems. In the price history snippet above, spot price remained extremely high for entire day of November 17 (Thursday) - it shouldn’t have gone unnoticed even by a semi-automated system (i.e., one with occasional human supervision and monitoring). The right course of action was to cancel one’s all outstanding bids and switch to using on-demand. Any automated bidding system must be monitoring spot price at all times and be prepared to switch new instance launches away from spot to on-demand if spot price remains elevated for long periods of time, as well as termine current spot instances.

Thirdly, it’s possible that EC2 decided to shut down spot market for this instance type by setting spot price above all bids (I am talking about $5 price).

And finally, it’s possible that some customers want to gamble. They could be bidding above their true price to be able to weather the spikes. Their thinking could go like this: “Over life of a spot instance, in normal times when spot price is below on-demand, we realize good savings. We could give back some of those savings to offset times when spot price spikes so that in the end we still come out ahead.”

Hence, in addition to recommendations of the GigaOm post’s author, here is my advice.

  1. Avoid spot instances for workloads that must run non-stop for the foreseeable future (especially in us-east-1 where spot prices seem to fluctuate and spike a lot more)
  2. Do not set spot price above on-demand price unless you really know what you are doing and have sufficient automated instrumentation in place to protect you in case spot price does go through the roof
  3. Do not submit spot instance bids for immediate execution if spot price is already significantly higher than on-demand
  4. Hoping that spot price will come down to below on-demand very soon is not a bidding strategy, it's gambling

You can read my other posts about EC2 spot instances here.

Categories: cloud-computing |

VXLAN and NVGRE - Not a Long Term Answer

Last week I came across a blog post titled NVGRE Musings. It’s got some great links to posts about two recently introduced proposals - VXLAN and NVGRE. But what drew my attention was the following thought from the first paragraph:

Supporting an L2 service is important for virtualized servers, which need to be able to move from one physical server to another without changing their IP address or interrupting the services they provide.

I don’t know if this is what network vendors are hearing from their big customers, but I strongly believe it's not the right answer to failover needs in the long term. No matter how hard I am looking, I don't see failover within the application as the network layer's problem to solve. The fact that they can doesn’t mean they should. My experience, as well as my understanding of current state of affairs in the biggest web-based tech ops organizations, lead me to a conclusion that the best place to handle application-level failover is application software itself.

Since application lives on top of the network layer, network is not in a good position to provide custom tailored failover solutions - all it can do is generic functionality that must remain transparent to the app. The biggest selling point of this approach is customers don’t need to re-architect their applications. There is a mounting body of evidence, however, that application does benefit from at least some modernization when it’s being moved to a newer operating environment (this was the apporach taken by Netflix who picked parts of the old app that they liked and discarded parts that they found lacking after years of running them in their datacenter environment or that were found unfit for their new IaaS environment).

George Reese said it best:

[...] New apps -> app is responsible for availability; Old apps -> infrastructure is responsible for availability

With this said, I do see a part of the network stack where the need for a new standard is the greatest. I am talking about DNS service name resolution.

Some parts are being addressed by Google in their efforts around including a part of end-user’s IP address as a part of a resolution request so that a more meaningful geo-distribution can be done by authoritative DNS servers. But I think it should not stop here.

Right now when a client requests a list of IP addresses corresponding to a given hostname, it simply gets a list of IPs (which in general case technically is not even ordered). Instead, the response could include a lot more meta information: “here is a list of IP addresses corresponding to the service you requested - try them in this order; if all of them fail to respond with such and such timeout, here is a backup list; and finally here is a token for your request - if you fail to get any response from any of the IPs listed within such and such timeout, retry this DNS query.”

With expanded retry and failover capabilities in host name resolution protocol on the client, it will become much easier to build hyper-distributed highly available services and applications - and that’s what I think the network industry should be focusing on.

Categories: cloud-computing |

Complex Systems: Generalists and Specialists

The following tweet from @saschabates that appeared in my stream this morning caught my attention:

#surgecon emergent theme: complex systems cannot be effectively diagnosed without smart generalists who understand them end to end

This statement is correct (otherwise it wouldn’t have been a theme emerging out of one of the best tech conferences). However it caught my attention not because it’s right - but because it can be hugely misinterpreted.

It's not saying that smart generalists are the only key to diagnosing problems in complex systems - it's saying that without smart generalists operating a complex system is nearly impossible. The key to running a system successfully is having a balanced mix of generalists and specialists. Generalists are usually necessary but not sufficient.

There are generally two types of roles in tech ops - generalists and specialists. A DBA, for example, is initially a specialist. And so is a network engineer. Specialists focus on some part of a bigger system; they have detailed knowledge about all interactions between components within their area of expertise and usually will have significant understanding of how their part integrates into the whole system. Their view is from inside out.

Generalists, on the other hand, approach the system from reverse angle - from outside in. Their focus is on interaction of components within the system; while they study behavior of the system as a whole, they will inevitably develop deeper understanding of individual components but the depth will vary by component.

Why can’t we all be generalists - know everything, interchangeable, able to effectively resolve all issues by ourselves without any help? The answer is simple. You can know tech. You can master troubleshooting techniques based on logic. But in a sufficiently complex system (which usually is changing at quite a fast pace), you won’t have enough time to accumulate enough experience with each component to become effective at running it sole-handedly.

While the idea “let’s all be generalists” could sound appealing, it’s not achievable in general case - the more complex a system gets, the less likely one will be able to become both a generalist and a specialist in it.

But that’s not all. At times an attempt is made to run a system with specialists but without generalists. This approach also doesn’t work universally. While it could work in smaller teams, sooner or later it breaks down as the number of people increases. Generalists are called in to help direct the specialists.

The bottom line: both generalists and specialists are key to successful operation of a complex system. These are distinct skills even if they are assigned to the same individuals. Size of the overall tech ops team and complexity of the system at hand play a decisive role in deteremining when it’s a good time to branch out into separate generalists and specialists.

Categories: devops |

Troubleshooting

One of the areas of tech ops that doesn’t get its fair share of discussion is troubleshooting. It’s not easy to teach troubleshooting - possibly because how successfully one can troubleshoot a given system largely depends on one’s experience with the system and on quality of the system’s feedback loops (accuracy and timeliness of monitoring data).

But despite the fact that troubleshooting is often more art than science, it has a set of general rules and guidelines, without which troubleshooting is nothing more than guessing. These are all common sense rules that formally come from boolean algebra and first-order logic. They universally apply to the first half of troubleshooting - finding what’s wrong.

It’s important to emphasize that troubleshooting activities are always measured against two independent goals - finding and fixing the issue, and doing it as fast as possible. It’s the second goal that makes use of logic mandatory - you usually can’t afford to mentally build a list of anything that could have gone wrong and then start crossing items off this list one by one. To speed things up, you usually analyze symptoms and check only those hypotheses that plausibly match them. Ability to properly prioritize hypotheses comes purely from experience, but not wasting your time on things that can’t explain what you are observing has a lot to do with logic.

A key aspect of troubleshooting is causality: event A leads to event B, or A causes B, or A implies B (A -> B). A is sufficient for B here, and B is necessary for A.

A -> B is the same as NOT B -> NOT A. Imagine, for example, that A = "filesystem is full" and B = "writes to filesystem are failing." In this case A -> B. Therefore, if writes are working (NOT B), it means filesystem is not full (NOT A). But if writes are failing (B), it does not automatically mean that filesystem is full (for example, it could be mounted read-only).

Another way to look at A -> B is (NOT A) OR B. This form can be easier to work with when you are applying negation - see below.

When A is sufficient and necessary for B, it means that A and B are are true or false both at the same time. Another way of saying it is “A is true if and only if B is true.” This statement formally consists of two: A -> B and B -> A.

Then there are important rules about negation that are called De Morgan's laws:

NOT (A OR B) = (NOT A) AND (NOT B)
NOT (A AND B) = (NOT A) OR (NOT B)

So how could you apply these rules in practice? First and foremost, never waste your time on checking A if you are observing NOT B and you know that A -> B.

Secondly, never assume that NOT A causes NOT B if you only know that A -> B.

Finally, never assume causality out of mere correlation of two events. If A and B tend to occur together, in bigger systems it’s often hard to determine if there is any causlity and which way it goes - further analysis is required.

Simple rules I mentioned in this post are not a complete guide to troubleshooting but they can still help you save time and resources - remember that any amount of time you spend investigating a hypothesis that you should have rejected based on pure logic, is time wasted.

Categories: devops | distributed |

Amazon EC2 Spot Instances - A Flop?

When Amazon Web Services launched EC2 spot instances in December 2009, I was very excited about the beginnings of potential revolution in how computing resources could be priced, bought and sold. I have followed this unprecedented phenomenon with great interest, blogging my thoughts along the way.

But today, over 1.5 years since the launch, I am not so sure anymore. While I have no insider information on what goals AWS set out for this program and how it’s been performing against these goals, there is a significant publicly available indicator that convincingly shows that this thing that AWS calls “spot market” is not performing the function of a spot market (clearing at equilibrium price). Instead, EC2 spot instances as of today are simply a discounted product with a couple of features removed (similar to an airline selling non-refundable tickets at a discount to a price of fully-refundable tickets).

There are basically 4 features that AWS strips out of their regular on-demand product to justify a discount:

  • a call to request an instance does not return an object corresponding to the instance you requested; instead you get a spot response object
  • there is an unfedined time interval between creation of spot response object and instance object (this time interval is usually small but technically it’s not defined)
  • a spot instance even during normal operation may never get started
  • a spot instance, once it’s started, can be terminated by EC2 under certain conditions even during normal operation

Officially stated goal of this discounting is for EC2 to be able to reduce unused capacity while retaining a legal right to reclaim such capacity quickly if a need arises suddenly. As currently designed, it’s a win-win for both EC2 and customers. It’s a terrific idea. But it’s not a market driven by supply and demand.

If you want to see for yourself, please open a new browser tab and head over to http://cloudexchange.org. Pick a product. Wait for a chart to load. Observe a nicely fluctuating price. So far so good.

But now, instead of looking at a weekly chart or monthly chart, look at all-time chart (click “All” in the lower right). Do you see it? It's a flat line! Well, more specifically, you will see a predominantly constant-amplitude oscillator with constant upper and lower limits.

It's the fact that oscillator's upper and lower limits are constants that shows that this is not a true spot market. Why? Because such limits are easily identifiable - you only need to take a look at a long term chart. And if bidders know in advance what the maximum price is going to be (occasional spikes notwithstanding), they should rationally bid above known maximum. And if this were a real market driven by supply and demand, the oscillator should have swung higher on some future iteration (once enough bids above current known maximum accumulate). But it doesn't.

Note that it’s impossible to perform more extensive analysis due to lack of information - we don’t know how many bids are coming, for what times, we don’t know available supply (which can be fluctuating independently of bids since it’s shared with on-demand regular product). But overall constant upper and lower limits over long term are very unlikely in a system driven more or less by supply and demand.

You might object to my calling this a flop. Maybe you are right. This pricing mechanism definitely serves a purpose. But the idea of spot instances was to form a spot market - otherwise AWS should have named them “discounted instances.”

I think such renaming is the right thing to do, and with the knowledge they accumulated in the last 18+ months, AWS should start a real spot market, one driven by supply and demand, with more market information than just historical prices published via API. That’s what pioneers do - they critically analyze the past and continue to build fascinating future for all of us.

More on cloud pricing is here.

Categories: cloud-computing |

Previous Page
Next Page