Crash vs Connectivity Loss in Distributed Applications

10 Oct 2008

Designing a distributed application to be fault tolerant is one of my favorite things that I often get to do at work. First of all, it should never fail under normal circumstances. Don’t believe people who tell you that circumstances are never normal - if it’s the case, a fault-tolerant design is the least of your worries and you need to get overall environment to be at least somewhat stable first. But then, circumstances don’t remain unchanged for too long - something will happen sooner or later. So you want to expect as many possible failure scenarios as you can think of, try to anticipate how the event will impact your application, how the app will find out that the event occurred, and what to do about it.

But it’s not what I wanted to write about. As you might imagine, I read a lot on the subject - learning from other people’s mistakes and experiences in distributed systems world has never been easier, thanks to blogging and general tendency towards openness and disclosure. In all this stream of data that I get, the most frequent failure scenarios can by typically categorized as a “hardware crash” or “software crash.” Something was running fine, and then - BAM! - it crashed. It no longer exists. Nothing can talk to it anymore. Nothing can ask it how it’s doing, or what was the last thing it did. It crashed. Died. Disappeared.

But is crash the worst that could happen? Unfortunately not. Connectivity loss is way more tricky to deal with. Your Nagios thinks your web server crashed because it’s not responding? Can’t tell - not enough information. Everything you know is that nagios could not connect to the web server. It doesn’t mean that the latter crashed. Or you can’t connect to your messaging backend - did it crash? Not necessarily, everything you know at the moment is that connectivity between you and remote end is broken.

So why do I say the connectivity loss is way worse than crash?

Crash is the same crash to all clients. All clients will fail to connect. Connectivity loss however can impact only a fraction of your client base. So half of your clients are failing over to the secondary, while the other half are still attached to primary. And you neglected to implement an alarm for that - and now your customers see only half of your inventory on the site? Oops.
Crash is usually a terminal state, as in your application can't easily leave a crash state on its own. And what about connectivity? Oh, not at all - connectivity can be restored without your direct intervention. It can range from route convergence after a backup link gets up, to easing network congestion after a spike in traffic. Are you going to be prepared?

And here is yet another twist. No matter what your position is on cloud computing, it is here to stay. And it is only a matter of time before many more services on which you rely for your operations will be scattered all over the world (or space, but that’s later). Connectivity loss will be occurring way more often than crashes, and unless you start approaching it as a different problem, you might be in for a big surprise.

Categories: distributed |

Rails update_attribute vs update_attributes

22 Sep 2008

In Rails, update_attribute method bypasses model validations, while update_attributes and update_attributes! will fail (return false or raise an exception, respectively) if a record you are trying to save is not valid.

This means that if at certain point during the project you adjust validations such that some records that used to be valid are now invalid, expect your code to stop working without any obvious reasons. This will happen because update_attributes will no longer update old records properly, but unless you check its return value, you’ll never know that it’s failing.

I stumbled upon this problem at least twice in the last month, and decided to write it up to finally remember to do it right next time.

Categories: ruby |

PalmOS Blazer-Friendly Browsing with GWT

18 Sep 2008

Those of us who still have a PalmOS-based device and use its Blazer browser will probably know that Blazer may take some time to render complex pages, and the end result might not even be readable on a small screen. I recently found a solution to this problem that works great, at least for me.

On your phone, head over to http://www.google.com/gwt/n and enter URL you are trying to get. GWT (which stands for Google Web Toolkit) will fetch the content and optimize it for your mobile browser. Additionally, it will adjust all links to also go through GWT, which makes Internet surfing with Blazer not painful at all.

For example, I like checking Techmeme on train on my way to work. They offer mini version, which renders well on my Palm Centro from Sprint. But if I want to follow a story and click on a link, I usually get the page not optimized for mobile (there are several exceptions that detect user agent and adjust content formatting). Instead, in my Blazer bookmarks, I have this - http://www.google.com/gwt/n?u=http%3A%2F%2Ftechmeme.com. From this page, I can jump to any story and get the content nicely formatted for my Centro.

UPDATE: It looks like I might have confused Google Wireless Transcoder (GWT) with Google Web Toolkit (GWT).

Categories: uncategorized |

SEO and 301 Redirect

09 Sep 2008

I was under assumption that when a site moves to a new domain or URL space, the best thing to do from SEO perspective was to put up one’s site at a new place and set up old site to do HTTP 301 redirects (Moved Permanently).

I did it a couple of weeks ago when I was moving this site to its current address, but noticed today that my old address still shows up in Google at the top of search results. I got curious, and checked both Yahoo and MSN and both of them properly do not display links that have been redirected.

Am I missing anything, or is it a bug in GoogleBot?

Categories: blogging |

Slides for my AMQP/RabbitMQ Talk

31 Jul 2008

I recently gave a talk titled Introduction to AMQP Messaging with RabbitMQ at a big web technology company in Chicago. You can now see the slides here on Slideshare, or download PDF.

Categories: rabbitmq |

Crash vs Connectivity Loss in Distributed Applications

Rails update_attribute vs update_attributes

PalmOS Blazer-Friendly Browsing with GWT

SEO and 301 Redirect

Slides for my AMQP/RabbitMQ Talk

About

Categories

Recent