Most folks in the industry are familiar with “distributed applications.” If app components are running on multiple hosts and need to communicate with each other using network, the app is said to be distributed.
Distributed applications are known for complexity of assuring all components are on the same page as to what’s going on around them. Hardware failures, network failures, operator errors can all cause chaos; distributed applications foresee these exception situations and attempt to know how to deal with them.
Up until now, the network piece of the puzzle has been usually under application owner’s control - it could be a LAN, or it could be a leased line to remote datacenter. Occasionally, a VPN would be used to provide a dedicated communication channel between locations over public Internet but its use was rarely focused on important stuff - a mission critical application would usually get a leased line.
With advance of public clouds such as Amazon EC2 and Google AppEngine however, these notions are changing. One day you may decide to leverage each cloud’s strengths and distinct features to build your app, or may want to avoid cloud lock-in or provide redundancy. In short, you may want to multi-source your infrastructure.
Your multi-sourced infrastructure will of course be a distributed application. But there is a significant difference between this and old-style distributed apps - this time you no longer have network connectvity under your control. And as a result, you will face 3 significant phenomena that substantially complicate using today’s distributed algorithms - uneven bandwidth, uneven latency and increased probability of connectivity loss (I blogged about the latter here).
And this is what I call a hyper distributed application. In other words, hyper distribution application is a distributed app which runs on a network with uneven bandwidth, uneven latencies and increased probability of connectivity loss (as measured against that on a regular LAN), usually outside of application owner's control (for example, Internet).
One example of a hyper distributed application is VPN-Cubed that we at CohesiveFT created to address emerging needs to multisource infrastructure. By the very nature of functionality it provides, its components (we call them VPN-Cubed Managers - they act as virtual routers and switches) are sometimes distributed over LAN, sometimes distributed over WAN, sometimes both. Communications between manager 1 and manager 2 can be fast and reliable, but between manager 1 and 3 slow and less reliable, with more frequent resets. Or manager 3 may simply disappear (as seen by its peers) - no, it doesn’t have to be down due to crash; it can simply mean that its network connection to the outside world was down, possibly temporarily.
Hyper distributed applications are relatively rare, because most architects tend to avoid this if they can. For example, Amazon EC2 has 2 regions - US and EU. Each region is a distinct EC2 system, with its own API endpoint, its own AMI IDs, kernel IDs, security groups, keypairs. There is no replication or conflict resolution between the regions - they are totally independent of each other. Why? Because it would be quite difficult to interconnect them into a single entity over public Internet. (I won’t be surprised if it gets implemented in the future though.)
Another example showing that hyper distributed applications are a distinct breed comes from Facebook Engineering blog post titled Scaling Out:
This setup works really well with only one set of databases because we only delete the value from memcache after the database has confirmed the write of the new value. That way we are guaranteed the next read will get the updated value from the database and put it in to memcache. With a slave database on the east coast, however, the situation got a little tricky.
When we update a west coast master database with some new data there is a replication lag before the new value is properly reflected in the east coast slave database. Normally this replication lag is under a second but in periods of high load it can spike up to 20 seconds.
It nicely illustrates how hyper distributed nature of the application adds complexity on top of what a plain distributed app already has.
In conclusion, I would like to propose to separate a category of distributed applications that run on top of networks with uneven bandwidth and uneven latencies into their own (I don’t care much if they end up being called hyper distributed or something else), and start building up research and practical approaches focusing specifically on this area.
P.S. Also consider the future: when we reach inter-planet or inter-galactic communications, you know that latencies and bandwidth in space would not be (initially) the same as on our planet Earth. Better start working on this research now in order to be prepared…