Designing a distributed application to be fault tolerant is one of my favorite things that I often get to do at work. First of all, it should never fail under normal circumstances. Don’t believe people who tell you that circumstances are never normal - if it’s the case, a fault-tolerant design is the least of your worries and you need to get overall environment to be at least somewhat stable first. But then, circumstances don’t remain unchanged for too long - something will happen sooner or later. So you want to expect as many possible failure scenarios as you can think of, try to anticipate how the event will impact your application, how the app will find out that the event occurred, and what to do about it.
But it’s not what I wanted to write about. As you might imagine, I read a lot on the subject - learning from other people’s mistakes and experiences in distributed systems world has never been easier, thanks to blogging and general tendency towards openness and disclosure. In all this stream of data that I get, the most frequent failure scenarios can by typically categorized as a “hardware crash” or “software crash.” Something was running fine, and then - BAM! - it crashed. It no longer exists. Nothing can talk to it anymore. Nothing can ask it how it’s doing, or what was the last thing it did. It crashed. Died. Disappeared.
But is crash the worst that could happen? Unfortunately not. Connectivity loss is way more tricky to deal with. Your Nagios thinks your web server crashed because it’s not responding? Can’t tell - not enough information. Everything you know is that nagios could not connect to the web server. It doesn’t mean that the latter crashed. Or you can’t connect to your messaging backend - did it crash? Not necessarily, everything you know at the moment is that connectivity between you and remote end is broken.
So why do I say the connectivity loss is way worse than crash?
- Crash is the same crash to all clients. All clients will fail to connect. Connectivity loss however can impact only a fraction of your client base. So half of your clients are failing over to the secondary, while the other half are still attached to primary. And you neglected to implement an alarm for that - and now your customers see only half of your inventory on the site? Oops.
- Crash is usually a terminal state, as in your application can't easily leave a crash state on its own. And what about connectivity? Oh, not at all - connectivity can be restored without your direct intervention. It can range from route convergence after a backup link gets up, to easing network congestion after a spike in traffic. Are you going to be prepared?