Shiny Cloud APIs - Necessary But Not Sufficient

In the stream of non-stop cloud computing chatter that was surrounding VMWorld 2009 that wrapped up last week, I noticed a pattern - folks were paying disproportionate amount of attention to API, API portability and API standardization, as opposed to actual technology concepts and constructs that are going to power new clouds.

API indeed is important - I blogged about it before. But so is curb appeal of a house you might be looking to buy. But you are not going to buy a house just because it looks nice from the outside, right? You will want to consider interior, location, and many other factors before making a decision. Similarly, API alone (or portability of API between multiple vendors) is not nearly enough to get you to choose this cloud other its competitor. There are other things such as features, infrastructure decisions, bandwidth, pricing, tech ops, technical support that play a significant role (or at least should play a role in your decision making).

Well-thought-out, scalable, responsive and easy-to-use API is a NECESSARY condition of a successful cloud, but not SUFFICIENT.

It means that a successful cloud implies good API, not vice versa. Another way to read the same would be to say that bad API implies unsuccessful cloud (A->B is the same as (not B)->(not A)).

I am very excited about recent developments in infrastructure-as-a-service space, but would like to see core concepts and technologies that power clouds discussed as much as new API.

Categories: cloud-computing |

The Concept of Hyper Distributed Application

Most folks in the industry are familiar with “distributed applications.” If app components are running on multiple hosts and need to communicate with each other using network, the app is said to be distributed.

Distributed applications are known for complexity of assuring all components are on the same page as to what’s going on around them. Hardware failures, network failures, operator errors can all cause chaos; distributed applications foresee these exception situations and attempt to know how to deal with them.

Up until now, the network piece of the puzzle has been usually under application owner’s control - it could be a LAN, or it could be a leased line to remote datacenter. Occasionally, a VPN would be used to provide a dedicated communication channel between locations over public Internet but its use was rarely focused on important stuff - a mission critical application would usually get a leased line.

With advance of public clouds such as Amazon EC2 and Google AppEngine however, these notions are changing. One day you may decide to leverage each cloud’s strengths and distinct features to build your app, or may want to avoid cloud lock-in or provide redundancy. In short, you may want to multi-source your infrastructure.

Your multi-sourced infrastructure will of course be a distributed application. But there is a significant difference between this and old-style distributed apps - this time you no longer have network connectvity under your control. And as a result, you will face 3 significant phenomena that substantially complicate using today’s distributed algorithms - uneven bandwidth, uneven latency and increased probability of connectivity loss (I blogged about the latter here).

And this is what I call a hyper distributed application. In other words, hyper distribution application is a distributed app which runs on a network with uneven bandwidth, uneven latencies and increased probability of connectivity loss (as measured against that on a regular LAN), usually outside of application owner's control (for example, Internet).

One example of a hyper distributed application is VPN-Cubed that we at CohesiveFT created to address emerging needs to multisource infrastructure. By the very nature of functionality it provides, its components (we call them VPN-Cubed Managers - they act as virtual routers and switches) are sometimes distributed over LAN, sometimes distributed over WAN, sometimes both. Communications between manager 1 and manager 2 can be fast and reliable, but between manager 1 and 3 slow and less reliable, with more frequent resets. Or manager 3 may simply disappear (as seen by its peers) - no, it doesn’t have to be down due to crash; it can simply mean that its network connection to the outside world was down, possibly temporarily.

Hyper distributed applications are relatively rare, because most architects tend to avoid this if they can. For example, Amazon EC2 has 2 regions - US and EU. Each region is a distinct EC2 system, with its own API endpoint, its own AMI IDs, kernel IDs, security groups, keypairs. There is no replication or conflict resolution between the regions - they are totally independent of each other. Why? Because it would be quite difficult to interconnect them into a single entity over public Internet. (I won’t be surprised if it gets implemented in the future though.)

Another example showing that hyper distributed applications are a distinct breed comes from Facebook Engineering blog post titled Scaling Out:

This setup works really well with only one set of databases because we only delete the value from memcache after the database has confirmed the write of the new value. That way we are guaranteed the next read will get the updated value from the database and put it in to memcache. With a slave database on the east coast, however, the situation got a little tricky.

When we update a west coast master database with some new data there is a replication lag before the new value is properly reflected in the east coast slave database. Normally this replication lag is under a second but in periods of high load it can spike up to 20 seconds.

It nicely illustrates how hyper distributed nature of the application adds complexity on top of what a plain distributed app already has.

In conclusion, I would like to propose to separate a category of distributed applications that run on top of networks with uneven bandwidth and uneven latencies into their own (I don’t care much if they end up being called hyper distributed or something else), and start building up research and practical approaches focusing specifically on this area.

P.S. Also consider the future: when we reach inter-planet or inter-galactic communications, you know that latencies and bandwidth in space would not be (initially) the same as on our planet Earth. Better start working on this research now in order to be prepared…

Categories: distributed | software-engineering |

Electrical and Plumbing Analogies in Application Monitoring

Water and electricity are two components without which a modern home can’t function well. Both are provided as a utility, and both have strictly defined access points from which they can be consumed - taps for water and outlets for electricity.

But there are also differences. Every child knows that electric shock can cause injury even as a result of a short exposure - hence most perceive electricity as a powerful force. This force however has a binary switch attached to it, in the form of switches, circuit breakers and distribution board. Turn it on - electricity is flowing, turn it off - it’s not. When off, electricity can’t leak by design.

Water, on the other hand, is not perceived as such a great force because damage from short exposure is unlikely to be too severe. Additionally, indoor plumbing has no binary on-off switches - it’s measured by a degree of “open” or “close”, “hot” or “cold.” As a result, leaks can and do occur from time to time. And it’s these leaks that have a potential to do costly damage over time but still are not perceived dangerous enough to warrant immediate attention.

There are many things in software applications that are binary in nature - web server daemon is up or down, for example. We all take these all-or-nothing components seriously, because when it’s nothing, the app is down.

But we have our fair share of potentially leaky stuff as well - memory leaks, file descriptor leaks, network connection leaks, and so on. In other words, things that don’t happen instantaneously but build up over time, often hidden behind other bigger component. Some of us don’t take these issues seriously enough because they lack the perceived power of being able to cause significant damage quickly enough. And it’s a mistake.

When monitoring a component of “electricity” type, most common test is to send a probe - if it returns OK, the component is up (“active monitoring”, “active polling” or simply “polling”). But this doesn’t work when monitoring a component of “plumbing” type - if water is flowing, it doesn’t mean there is no leak. In this case, a set of alarms instrumented into the component itself would be a better fit.

The sooner we realize different nature of various components of our applications and the need to monitor them differently, the higher uptime for our applications we are going to achieve.

Categories: software-engineering |

New Era in Internet Search - Google vs Bing

This week marks the beginning of a new era in Internet search. For the first time in modern Internet history, there is a number 2 with sizable market share. This is going to become interesting once Bing and Yahoo! finish integration.

I switched to Google Search many years ago because it was the best - its results were most appropriate, its query language was most predictable, it was fast. In other words, it allowed me to find things easier, faster and with least amount of effort. Search was the first social application on the Web - by clicking on a search result you let search engine know “this is what I was looking for,” which is a form of user participation which allows users to influence (“vote”) selection of content.

I now feel however that Bing search is as good as Google’s. When I end up working on someone else’s machine without Firfox, I end up using IE. And while at first, I always went to google.com explicitly before submitting my search, on a couple of occasions I got lazy and tried Bing (via their upper-righthand corner search textbox). And surprise - results didn’t suck.

While up until now competition was based on quality and technology, now it’s shifting to marketing, distribution, conversions, churn rates, and so on - because quality (I think) is pretty close and no longer is a distinguishing factor (in economics speak, search quality is no longer a competitive advantage). Interestingly, if you read Wikipedia article on Competitive Advantage, you will see that “many forms of competitive advantage cannot be sustained indefinitely” - exactly what happened here.

I also think that this event re-emphasizes increased importance of Google’s relationship with Mozilla (which makes #2 browser) - there is no way IE will default to anything but Bing. This also underscores importance of Google’s investment in Chrome, their own browser platform. If I were Mozilla, I think I could try to extract maybe better terms from Google next time they renegotiate the contract. It’s a win-win for both.

My final observation has nothing to do with search. With technological competitive advantage gone, Google vs Bing showdown is now going to be about execution and - most importantly - effectiveness of leveraging network effect. I find it very interesting, because the same type of showdown may occur in other areas. For example, micro blogging. Twitter currently is by far the #1 platform not due to its technological competitive advantage (their technology is complex, their traffic is huge - but it’s not insanely mathematically complex to require PhDs to figure it out). As a result, all potential entrants to the microblogging space face a single huge obstacle - overcoming huge Twitter network effect. I suspect that Bing vs Google will let us observe and study whether and how network effect can be tamed and ultimately reversed.

In other words, I am most interested in seeing whether network effect in social Internet can be sustained indefinitely as a competitive advantage or not.

Up until now, I can’t name a case when a social Internet site which dominated its field got pushed aside and slipped from #1. In all cases up until now, new entrants carve up a niche and end up dominating it, while the original #1 remains overall #1. Or did I miss an example - could you help in the comments below?

It’s about to get very interesting.

Categories: internet |

Evaluating Cloud Computing from Buy vs Rent Perspective

What is driving people, projects and organizations to adopt cloud computing?

There is no single answer. Everyone’s situation is different, and everyone assigns different weights to different factors. But what is common in “to cloud or not to cloud” decision making is that fundamentally it’s like buy vs rent in housing.

*aaS is all about rent vs buy - rent is housing-as-a-service pay-as-you-go after all. You either want to be able to get out fast, or staying in one place for a long time doesn't scare you. Rent might be more expensive over time and might constrain you in certain ways (I never met a landlord who would agree to let tenants paint walls bright green, for example), but on the other hand it does not require an up-front payment and allows a certain degree of flexibility. Buy involves a commitment, but may provide some benefits (like ability to do that painting project).

The key observation is that there is no single factor that simplistically would let you choose one over the other. Rent vs buy decisions are based on personal preferences, current situation, future plans, and surrounding circumstances - all subjective. In nearly identical situations, one would choose rent, and another would choose buy - and both will end up making right decisions for themselves.

Similar logic should apply to cloud computing decisions. A popular phrase is “it depends on workload” - which is another way of saying it depends on your use case, what you are trying to accomplish and which obstacles you’re trying to overcome. It also depends on what kind of company you are, how you have been doing your infrastructure projects in the past, what your plans are, and so on.

There is no right or wrong in either case, and in spite of what some would like you to believe, cloud computing is not right for every use case. So focus on what’s the right tool for the job at hand, with an eye towards the future.

Categories: cloud-computing |

Previous Page
Next Page