Troubleshooting

One of the areas of tech ops that doesn’t get its fair share of discussion is troubleshooting. It’s not easy to teach troubleshooting - possibly because how successfully one can troubleshoot a given system largely depends on one’s experience with the system and on quality of the system’s feedback loops (accuracy and timeliness of monitoring data).

But despite the fact that troubleshooting is often more art than science, it has a set of general rules and guidelines, without which troubleshooting is nothing more than guessing. These are all common sense rules that formally come from boolean algebra and first-order logic. They universally apply to the first half of troubleshooting - finding what’s wrong.

It’s important to emphasize that troubleshooting activities are always measured against two independent goals - finding and fixing the issue, and doing it as fast as possible. It’s the second goal that makes use of logic mandatory - you usually can’t afford to mentally build a list of anything that could have gone wrong and then start crossing items off this list one by one. To speed things up, you usually analyze symptoms and check only those hypotheses that plausibly match them. Ability to properly prioritize hypotheses comes purely from experience, but not wasting your time on things that can’t explain what you are observing has a lot to do with logic.

A key aspect of troubleshooting is causality: event A leads to event B, or A causes B, or A implies B (A -> B). A is sufficient for B here, and B is necessary for A.

A -> B is the same as NOT B -> NOT A. Imagine, for example, that A = "filesystem is full" and B = "writes to filesystem are failing." In this case A -> B. Therefore, if writes are working (NOT B), it means filesystem is not full (NOT A). But if writes are failing (B), it does not automatically mean that filesystem is full (for example, it could be mounted read-only).

Another way to look at A -> B is (NOT A) OR B. This form can be easier to work with when you are applying negation - see below.

When A is sufficient and necessary for B, it means that A and B are are true or false both at the same time. Another way of saying it is “A is true if and only if B is true.” This statement formally consists of two: A -> B and B -> A.

Then there are important rules about negation that are called De Morgan's laws:

NOT (A OR B) = (NOT A) AND (NOT B)
NOT (A AND B) = (NOT A) OR (NOT B)

So how could you apply these rules in practice? First and foremost, never waste your time on checking A if you are observing NOT B and you know that A -> B.

Secondly, never assume that NOT A causes NOT B if you only know that A -> B.

Finally, never assume causality out of mere correlation of two events. If A and B tend to occur together, in bigger systems it’s often hard to determine if there is any causlity and which way it goes - further analysis is required.

Simple rules I mentioned in this post are not a complete guide to troubleshooting but they can still help you save time and resources - remember that any amount of time you spend investigating a hypothesis that you should have rejected based on pure logic, is time wasted.

Categories: devops | distributed |

Amazon EC2 Spot Instances - A Flop?

When Amazon Web Services launched EC2 spot instances in December 2009, I was very excited about the beginnings of potential revolution in how computing resources could be priced, bought and sold. I have followed this unprecedented phenomenon with great interest, blogging my thoughts along the way.

But today, over 1.5 years since the launch, I am not so sure anymore. While I have no insider information on what goals AWS set out for this program and how it’s been performing against these goals, there is a significant publicly available indicator that convincingly shows that this thing that AWS calls “spot market” is not performing the function of a spot market (clearing at equilibrium price). Instead, EC2 spot instances as of today are simply a discounted product with a couple of features removed (similar to an airline selling non-refundable tickets at a discount to a price of fully-refundable tickets).

There are basically 4 features that AWS strips out of their regular on-demand product to justify a discount:

  • a call to request an instance does not return an object corresponding to the instance you requested; instead you get a spot response object
  • there is an unfedined time interval between creation of spot response object and instance object (this time interval is usually small but technically it’s not defined)
  • a spot instance even during normal operation may never get started
  • a spot instance, once it’s started, can be terminated by EC2 under certain conditions even during normal operation

Officially stated goal of this discounting is for EC2 to be able to reduce unused capacity while retaining a legal right to reclaim such capacity quickly if a need arises suddenly. As currently designed, it’s a win-win for both EC2 and customers. It’s a terrific idea. But it’s not a market driven by supply and demand.

If you want to see for yourself, please open a new browser tab and head over to http://cloudexchange.org. Pick a product. Wait for a chart to load. Observe a nicely fluctuating price. So far so good.

But now, instead of looking at a weekly chart or monthly chart, look at all-time chart (click “All” in the lower right). Do you see it? It's a flat line! Well, more specifically, you will see a predominantly constant-amplitude oscillator with constant upper and lower limits.

It's the fact that oscillator's upper and lower limits are constants that shows that this is not a true spot market. Why? Because such limits are easily identifiable - you only need to take a look at a long term chart. And if bidders know in advance what the maximum price is going to be (occasional spikes notwithstanding), they should rationally bid above known maximum. And if this were a real market driven by supply and demand, the oscillator should have swung higher on some future iteration (once enough bids above current known maximum accumulate). But it doesn't.

Note that it’s impossible to perform more extensive analysis due to lack of information - we don’t know how many bids are coming, for what times, we don’t know available supply (which can be fluctuating independently of bids since it’s shared with on-demand regular product). But overall constant upper and lower limits over long term are very unlikely in a system driven more or less by supply and demand.

You might object to my calling this a flop. Maybe you are right. This pricing mechanism definitely serves a purpose. But the idea of spot instances was to form a spot market - otherwise AWS should have named them “discounted instances.”

I think such renaming is the right thing to do, and with the knowledge they accumulated in the last 18+ months, AWS should start a real spot market, one driven by supply and demand, with more market information than just historical prices published via API. That’s what pioneers do - they critically analyze the past and continue to build fascinating future for all of us.

More on cloud pricing is here.

Categories: cloud-computing |

Following on Twitter Using RSS

When I am trying to decide whether to follow a given account on Twitter or not, I usually look at the following three criteria:

  1. whether I am interested in this account's tweets
  2. whether this account's tweets include at least some degree of real-time relevance
  3. whether this account can participate in a discussion

Turns out however there are plenty of accounts that are missing properties #2 and/or #3. Bots that tweet links on a specific topic, for example. Or a celebrity comedian such as @shitmydadsays - this account probably won’t respond to any mentions and the tweets are rarely real-time sensitive.

I found it’s much more efficient to consume such tweets not via Twitter, but via RSS. Twitter used to include a feed on every account’s home page but not anymore. Here is how you can follow a Twitter account via RSS.

Let’s say you want to follow @StephenAtHome.

Open this link in your browser:

https://api.twitter.com/1/users/show.xml?screen_name=StephenAtHome

(If you prefer JSON, use https://api.twitter.com/1/users/show.json?screen_name=StephenAtHome).

Note user id value - for @StephenAtHome it’s 16303106.

Then add the following feed to your reader:

http://twitter.com/statuses/user_timeline/16303106.rss

This method helps me better manage my Twitter reading experience by ensuring real-time sensitive content and conversations go to Tweetdeck and the rest ends up in Google Reader.

Categories: internet |

Network: From Hardware Past To Software Future

At this year’s GigaOm Structure conference, there was a single event that attracted my interest the most - network virtualization panel (I didn’t attend the conference, I was only following along over the Internet). It wasn’t just because it involved OpenFlow. I think there is a bigger trend at play here - a lot functionality that we are used to seeing in network gear is moving to application level, from hardware to software. OpenFlow is just one of the manifestations of this bigger trend. Let me explain.

Networking was first about moving packets, in large quantities and with low latencies. This demand was met by specialized hardware which I assume was able to perform the job better than a general-purpose machine (“better” in this context means faster, more reliably and more cheaply). From their early days, network vendors have also extensively focused on what developers of modern distributed or hyper-distributed applications focus today - failure detection, fault tolerance. When application servers were still growing vertically (bigger machines with redundant power supplies, for example), network already was using distributed gossip-like protocols to exchange information.

Over time, however, more and more services found their home within the network layer - load balancing, virtual addresses, traffic encryption and so on. The idea was to let application remain unaware of all of this complexity on top of which it was sitting.

While this approach had been working for a while, it ran into a wall. Firstly, without direct control over network from applications, current setups were always extremely inflexible and high maintenance (dedicated network engineering staff, change management process in addition to application code rollouts, etc). Secondly, features baked into hardware take longer to tweak (unless vendor had sufficient foresight to plan for new requirements). Thirdly, hardware is harder to replace from financial perspective (pay up-front + maintenance).

Final hit was delivered relatively recently by infrastructure-as-a-code. Flexible IaaS models can’t effectively support customers’ hardware. While there are places where hardware is still very visible to customers (VPN connectivity from customers’ datacenters to their IaaS resources), this is a temporary phenomenon - there are numerous IaaS-compatible software solutions already (please see my disclosure in upper right).

Furthermore, a lot of non-packet-moving functionality can be efficiently delivered in software these days. Look at Heroku - their frontend routing mesh is a massively-scalable load balancer that could be tweaked in real-time. Good luck trying to accomplish the same in hardware.

We currently think of Ciscos and Junipers of the world as hardware vendors. What they actually are is software companies - they just don't let their software run anywhere except on their own hardware. I bet we are going to see this transformation play out within the next 3-5 years. In not so distant future, network gear will go back to focusing on one thing they do exceptionally well - moving packets. All other functionality will turn into software products and will be used on application servers.

Categories: cloud-computing | infrastructure-development |

Two Weeks on Twitter Without Reading My Timeline

TL;DR Twitter reading experience is extremely inflexible and not scalable, and the company discourages third-party developers from innovating in general-client niche. Twitter must significantly improve reading experience, or allow third-party developers more freedom.


In the first half of this month, I decided to perform an experiment. For at least two weeks I didn’t read my Twitter timeline. I only sent an occasional tweet or replied if necessary (the plan was to reply to mentions and to tweets surfaced by multiple searches that I read via RSS).

What could be the point of such a weird arrangement? Public tweets in general form a basis of three distinct activities - publishing, participating in a conversation and reading (Twitter as a whole also supports one-to-one private messaging via DMs).

Each activity delivers its own benefits at the cost of efforts to focus mentally and time. Reading is unique among them however because in a system based on following other accounts (where each account is free to publish anything they want), its signal-to-noise ratio is significantly lower than that of other activities.

Lower signal-to-noise ratio leads to higher costs (mental focus and time spent). As such, information I obtain via reading my Twitter timeline is relatively costly to me. The goal of the experiment was to see if I could replace Twitter timeline with a less costly way of obtaining the same information.

Turns out I couldn’t do it easily. Reading blogs as I always do, checking Techmeme and Hacker News kept me informed about the most important news but the color added by many folks I follow on Twitter, was missing.

This outcome was somewhat expected. But there was another thing that I realized during the experiment. Twitter the company stopped paying attention to reading experience (lists was their last innovation there). Even more worryingly, it is my understanding that they actively discourage third party developers from building general-purpose Twitter clients. This leaves their official stance - "river of updates" - to be the only way of consuming (reading) one's timeline.

Maybe “river of updates” is the best approach for many people (even though I doubt it). Maybe even for most. But saying it’s the best experience absolutely for all is a stretch. I want bookmarks (plural is not a typo), I want ability to sort my timeline by attributes other than time (for example - location, sender), I want “always on top” attribute, I want filters that could be shared between users - in addition to obvious creteria such as sender, time and location, I want advanced things such as current rate of my timeline (how many tweets per minute are appearing in my timeline now), send rate of sender (how many tweets per minute the sender sent on average last minute, last 5 minutes and last 15 minutes).

Granted, I don’t mind if Twitter itself doesn’t feel that these are features worthy of their official client. But if it’s the case, Twitter must not discourage third-party clients either. And if Twitter sticks to its guns on this, I hope it won’t be too long before it’s overtaken by someone else who will provide a better reading experience.

Categories: internet |

Previous Page
Next Page