Developer's Attempt to Define Cloud Computing

I have been closely following cloud computing for many months now. As a developer, I get often frustrated by lack of clear and widely accepted definition of what cloud computing actually is. This is a problem, because without a definition, every imaginable operation performed over the Internet all of a sudden became a “cloud.” It dilutes the value and obscures the innovation cloud computing concept used to stand for in its early days.

The term “cloud computing” consists of 2 words - “cloud” and “computing.”

Cloud

Traditionally, an image of cloud is used on network diagrams to denote an opaque network entity (for example, Internet or MPLS cloud). Opaque in this case means that to an enduser it’s a black box - you hook up inputs and outputs as directed, and you get functionality. In addition to opaqueness, there are other less obvious properties that clouds on network diagrams usually possess:

  • cloud is multi-tenant (many endusers use same one)
  • cloud resources (links, bandwidth) are not dedicated (each enduser gets to use some up to their quota; if user A no longer uses a resource, cloud can assign it to user B)
  • cloud is outside of enduser's full control
Computing

Firstly, allow me to note that I strongly disagree with pure linguistic approach here - to linguists, “computing” and “computer” are derived from the same root, such that “computing” is an action which involves a “computer.” I disagree with it because it’s too general and useless for our case.

I define “computing” as running user-provided software. It doesn’t have to be developed by user - one can download it from the web and run it. But it’s still the user who provides this software in this particular case. In contrast, if you use a web site to perform a certain operation, you also use software - but in this case, it’s the software developed and operated by the web site, hence it’s a service, not computing.

My Definition of Cloud Computing

Cloud computing is a form of using opaque multi-tenant networks of computers outside of enduser's full control with primary goal to run software provided by the enduser, in which computational resources are allocated dynamically (as opposed to being permanently assigned).

Examples and Caveats
  • If we take a well-known SPI model (Software as a Service, Platform as a Service, Infrastructure as a Service), contrary to current mainstream thinking, only IaaS can be cloud computing when enduser provides the software to run.
  • I added a clause about "primary goal" to eliminate things like Google Spreadsheet from cloud computing - even though a spreadsheet program may run macros (which are software code) and such macros could be provided by enduser, it's still not cloud computing, because the primary goal of a spreadsheet program is number crunching, not running macros.
  • Programming frameworks (such as Hadoop for example) can be both: Hadoop can be cloud computing when enduser provides their map and reduce functions; but if enduser ends up running defaults or functions that ship with Hadoop distribution, there is no software supplied by enduser so it's not cloud computing.
  • Things like storage as a service, backup as a service are all "cloudy," but they are not computing. There is already a term for this - Internet. Therefore, I consider "cloudy" by itself to be a redundant term.
  • Google AppEngine (GAE) is a cloud computing platform. Many don't put it into IaaS category because it doesn't provide customers with access to low-level hypervisor-based VMs. But this alone doesn't make it non-IaaS from developer's standpoint - after all, a VM in hypervisor model is one thing, and a VM in language interpreter model is another (JVM, Erlang VM, Python VM, etc) but it's still a VM in a sense that it encapsulates running code inside and proxies all system-level requests through its abstraction layer. GAE provides access to its BigTable infrastructure, its memcache infrastructure so to me it's very much an IaaS system and satisfies my definition of "cloud computing."
  • In my opinion, multi-tenancy is a necessary condition of a cloud computing platform. Multiple tenants must not be different companies - they can be different business units, different departments. The key is that there must be dynamic allocation of resources and scarcity. If all resources are dedicated to one organization and simply switched between applications, it's not cloud computing - it would be simply an infrastructure controlled via API.
  • Same thing about on-premises server farms with cloudy features - they are not cloud computing, because they are not opaque to enduser and they are under enduser's full control.
Conclusion

All in all, I hope this blog post gets us closer to finally figuring out once and for all what “cloud computing” is and what it isn’t.

Categories: cloud-computing | software-engineering |

Full Data vs Incremental Data in Messaging

My recent experiments with messaging for a distributed application led to a realization that I would like to share with you in this post. It’s not an earth shaking discovery but you may still find it interesting.

Do you remember an old Unix command to create tape backups called dump? Remember its concept of levels? To refresh your memory, in a nutshell level 0 (full backup) includes all files on the filesystem, and any other level corresponds to incremental backup where only files modified since last backup are included.

It turns out somewhat similar concept applies to messaging, specifically to the contents of messages themselves.

A message in general is some piece of information that one system passes to another. On one hand, publisher may make an observation, extract information from it, package entire current state into a blob, and send it out as a message. The same sequence of operations is performed at regular intervals. Examples of this model include sending a message about processes currently running on the system, clients currently connected to a server, current usage of RAM, etc. This model roughly corresponds to dump’s level 0 - consumer needs just a single message to obtain all information that publisher sent, there is no need for consumer to accumulate and merge a series of messages to get the full picture.

On the other hand, a publisher can send a message that contains information about a single event. For example, a new client connected, a new job got submitted to the backend, hard disk failed. This mode is more like incremental backup - a message contains only a delta, its payload doesn’t carry entire state.

Each of these models has its good and bad sides. In full data model, a single message is sufficient to transfer all knowledge about current state from producer to consumer, and consumer can start reading messages at any point in the queue - by design it will catch up once it receives and processes at least one message. The downsides of this model are waste of bandwidth and processing power (if there are no changes, same contents will be transferred over and over again) and the fact that delta must be calculated by consumer (for example, having received 2 “ps auxww” outputs, consumer would have to diff them and parse the result).

Incremental data model clearly provides an easy delta and is less wasteful on resources, but requires consumer to merge multiple messages to get entire picture and as a result is sensitive to a point from which a consumer starts reading the queue.

A potential solution is to do what dump does - send full data once in a while, followed by deltas. This way consumer will catch up eventually - once it gets full data message (which will come sooner or later). Another caveat is that not always does a consumer need a full picture - in a classic scalability scenario of supervisor-workers model, workers rarely need more than contents of their current job contained in an incremental message.

But it’s not the end of it. While working on a problem, I realized that usually I as a developer don’t even get to choose which model to use - it’s dictated to me by the nature of information I am trying to pass from one system to another. Some data can be easily obtained as full and very difficult to obtain as incremental, some vice versa. For example, a list of current processes on Linux is trivial to obtain as full (ps auxww) and quite difficult to obtain as incremental (I would need a notification about when each process starts and dies). Or in case of incoming jobs - it’s easy to obtain delta (one job) but it’s quite difficult to know current status of all jobs.

My conclusion here is that there are 2 main factors to think about:

  1. can my publisher get data in full or incremental form?
  2. does my consumer need data in full or incremental form?
If the answers to above questions are the same, you are good to go. But if they are different, you need to understand potential issues as discussed above and analyze further. I hope to be able to provide more practical thoughts on this in the future - stay tuned.

Categories: rabbitmq | software-engineering |

Why I Sometimes Prefer Shell To Ruby or Python

Shell was among the first things I got familiar with when I was introduced to Linux. It’s not a typical programming language, primarily due to lack of easy-to-use high-level data structures such as hashes and arrays (anticipating your objection to this - note I said “easy-to-use”). This may explain why I often get funny looks from folks when I mention that I use shell quite a bit, often in quite non-trivial systems.

And here are my reasons.

Memory Management

Shell scripts are excellent in managing their memory and one has to try real hard to cause a shell script to leak memory. This makes shell a very convenient tool for long running processes, supervisors in multiple-workers models, daemons and so on. There is an easy explanation for this. In shell, there are only a handful of built-in primitives - everything else is an external command, which gets started and then finishes before giving control back to your script. If there is a memory leak in that command, it won’t damage your calling script and will usually be insignificant because it will return quickly.

No Exceptions

This is a double edged sword, and you need to be careful how you exploit this “weakness.” This feature allows me to write compact code which is easy to understand without enclosing every single command in “try… except”. For naysayers, I would like to point out that a strict mode exists, where every error is treated as fatal and causes the script to exit (set -e).

In general, not all unforeseen error conditions warrant a crash, like you get in Python or Ruby when an unhandled exception gets propagated all the way to the top. If a problem is transient, it may be better to ignore it temporarily.

To assure a Ruby or Python script doesn’t crash on some unforeseen transient problem, many people often end up enclosing their entire program in a wildcard try… except block to catch any exception - but to me this approach is dangerous, even though I sometimes end up using it myself.

If you are writing a daemon process to perform some action in a loop, shell is often by far the most stable alternative.

When Not To Use Shell

My personal rule of thumb is don’t use shell when you expect to need high-level data structures like hashes or arrays beyond what for loop can give you, or when you can see potential for code reuse following OOP patterns like inheritance, or when your program needs to participate in some orchestration schemes that go beyond creating and removing files on the filesystem.

Conclusion

I wouldn’t overlook shell if I were you.

Categories: software-engineering | ruby | python |

How Long Ago Was This EC2 Instance Started?

By accident, today I discovered an easy way how to determine how long ago your EC2 instance was started. Note that uptime shows time since last reboot, so it’s not what we get here. Here is a bash implementation.

Categories: cloud-computing |

Branching In Git When Working On Big New Features

A note to self.

When starting to work on a new big feature, always set up 2 branches for it. Say FEATURE_work and FEATURE_integration. Do your regular development in FEATURE_work committing as often as you want. When you reach certain milestones (but entire feature is still not ready yet), squash merge FEATURE_work into FEATURE_integration. When entire feature is finished, merge FEATURE_integration into master.

This gives you a much nicer history of commits, lets you group changes by milestone, and allows to keep big feature as multiple commits in master.

Categories: software-engineering |

Previous Page
Next Page