JSON vs XML in API

George Reese recently wrote a blog post about API design, William Vambenepe commented here. This is an interesting topic, I have a post on this subject too - it’s titled Developing API Server - Practical Rules of Thumb. In this post I would like to expand on the first point George made in his post - JSON vs XML.

As you may know, I led design and development of VPN-Cubed API at CohesiveFT, therefore I am approaching this subject primarily from perspective of API server side, not client.

We designed VPN-Cubed API to be able to support both JSON and XML. Since GET requests in HTTP have no body, arguments must be passed in query string. For all other HTTP methods, we take the arguments as a hash, convert them to JSON (or XML), set Content-Type header appropriately and send the resulting representation of arguments as a body of our request. Client also selects the format in which it wants to get response by using Accept header. We assume default value of application/json - which means JSON responses are sent by default.

Despite being designed for both JSON and XML, VPN-Cubed API ended up shipped with JSON only. And there is a reason why we chose JSON over XML.

Generally speaking, API is exchange of messages - client submits a request, server returns its response (example: “need new instance with these parameters” - “here is current representation of the instance object you requested”). In most domains, overwhelming majority of messages map nicely to nested lists (arrays) and hashes (dictionaries). This is a key insight that plays a role in JSON vs XML battle.

There is no easy and universal way how to represent nested hashes and arrays using XML (if there is, I hope to hear from you about it - I need stable libraries that can convert arrays and hashes to XML and back that could interop among each other for all major programming languages). Of course it’s possible and not terribly difficult, but it’s something that one must do.

Contrast this situation with JSON - you don’t need to worry about this, it’s already taken care of for you. The only limitation of JSON that we faced was that JSON doesn’t like integers as hash keys, it wants you to convert them to strings or use an array instead of a hash.

There are certainly some features in XML that JSON doesn’t have, but this is a show stopper. While your mileage may vary, I think this is the biggest reason why JSON has been slowly making ground against XML in API world recently. See also this post on Programmable Web.

Categories: software-engineering | cohesiveft |

IaaS vs PaaS

For a very long time, I had regarded platform-as-a-service (PaaS) as a catch-all bucket for everything cloudy that was not software delivered over Internet on demand (SaaS) or infrastructure (IaaS). Over the past several months however, with announcement of new players in PaaS space such as CloudFoundry and OpenShift, I found myself thinking about PaaS in a new light.

PaaS currently seems to be converging on a concept that is essentially an expanded application server (typical examples of application servers are Tomcat, Weblogic, Websphere, Glassfish, JBoss etc). You package your web application in a certain way and upload it to the server. Server then sets up the environment (such as, for example, your database connection pools) and runs your app.

PaaS of course adds a few twists (examples of functionality that PaaS could offer include multitenancy, autoscaling, API, off-premises hosting, multi-language) but fundamentally it essentially feels to me like a glorified application server.

Several observations.

Firstly, every big software vendor seems to have at least one product in its current lineup that in some shape or form fits into the application server space. I expect each of these vendors to repackage their offerings into a PaaS or PaaS-like product - the more the merrier.

Secondly, the more I think about it, the more I become convinced that a private PaaS will dominate private IaaS at enterprises for applications developed in-house. If a company adopts one of the application servers today as an internal standard, it simply makes no sense to allow internal development of any applications that would not run on them.

Thirdly, you gotta hand it to Google - when everyone was crazy about a cloud model popularized by Amazon EC2, they didn’t cave in and didn’t start offering low-level OS VMs. They have focused on language VMs (Python VM, JVM) and up since the very beginning - this looks exactly what PaaS has become now. In their latest release, they added backends for long-running background processes (in other words, all daemons that do not fit HTTP request-response model). I expect other PaaS implementations to follow suit.

Fourthly (as a direct consequence of points #2 and #3 above), I now think that private IaaS clouds will become a place where enterprises run their vendor-supplied (possibly closed-source) non-web-based workloads. As a result, software vendors will need to adopt new ways how they distribute their software. There will be no need to do installers and try to detect a machine’s hardware and OS. All software can be shipped as a VM image (with or without customer access, or maybe just partial customer access).

And finally, I am now convinced that today’s PaaS moniker should become application server as a service. Or - to make the acronym easier to pronounce - a webapp container as a service (WCaaS or ACaaS). There is simply too much “platform” beyond an application server use case - think data store as a service, messaging bus as a service, external connectivity as a service, load balancing as a service, naming as a service, and so on. Each of these could be a standalone service.

Good times for cloud computing!

Categories: cloud-computing |

On Importance of Planning for Failure

As you probably heard, on May 21 of this year US East region of Amazon EC2 cloud experienced a severe outage. The event received considerable coverage around the blogosphere - you can find the most comprehensive collection of links on the topic at highscalability.com. The guidelines of design for failure have now spread far and wide. But where do you start?

One good place to start is a thought exercise trying to come up with as many possible failure scenarios as you can imagine and let your mind go through the steps how you are going to deal with each one of them. It should not take a lot of time or resources. All you need is to start a document where you will record your initial findings - you will update this document throughout your system’s lifecycle as more information becomes available.

Again - this is a pure thought exercise. You are not writing any code here or redesigning any of your systems.

Create a table with the following columns.

1. Something bad happens. Examples: You lost your database master. Half of your frontend capacity is not responding. Your traffic increased sharply and unexpectedly. Some of your servers are unable to read or write to their disk. Loss of cooling - servers overheating. Fire alarm in the datacenter.

The more bad things you can think of here, the better. Remember also that not all problems occur in an instant - some issues could be building up over time, others could be affecting only some fraction of your systems.

Something bad could be going on for hours or even days, or it could be a sub-second event in duration with huge impact on your service. Are your systems going to be impacted the same way regardless how long an event is ongoing?

2. You or your monitoring systems find out that someting bad happened. Don't skip this step - it's more important than you think!

Can your monitoring detect the error condition reliably (with sufficient confidence level but without unmanageable amount of false positives) and within desired timeframe? (Hint: if for all your events you answer “yes” to this question during initial assessment, either you didn’t think of enough bad things that could happen, or your monitoring system is better than 99.98% of all monitoring systems deployed out there.)

3. You need to know in advance approximate impact such event will cause on your systems. Once you find out about an actual event, you will need to confirm that your anticipated impact is indeed occurring.

Example: Did you think that a loss of this router would only result in your loss of half of your backend capacity, while in reality (due to a recent change) you lost both 50% of backend and frontend?

4. You react. You should have a rough idea what you would do in response to a given event. As they say, "The best improvisation is one that was prepared in advance."

Note that “put up a ‘We are sorry, site is down’ static page and wait for provider to fix a problem upstream” is a valid reaction.

5. Assign likelihood to each event. The key here is absolute values are not very important - you only need to offer a hypothesis which event is more likely to occur than another event.

6. Assess expected impact relative to expected impacts of other events.

A combination of columns (5) and (6) should help you prioritize your ops work. I can’t think of universal guidance whether to go for “low likelihood, big impact” or “high likelihood, low impact” first - it probably makes the most sense to start somewhere in the middle. When prioritizing work, also pay attention to column (4) - especially if a current reaction is deemed insufficient.

For certain failures, you will have columns (2) and (4) automated - for example, leading NoSQL solutions can often tolerate a loss of one node in the cluster. I still recommend to have such events listed in this table, even though you might not need to do any additional manual reacting.

You also need to remember to revisit this document as you are working on your system, adding components, growing capacity in response to demand, develop new features, fix bugs, move between platforms, and so on.

And finally, you can grow this document ad infinitum. Even when you think you covered every single event imaginable, either wait a month or two or start planning for combinations of events (for example, loss of router AND independent traffic spike).

If you do this right, this exercise will help you weather your future operations storms with more confidence. But remember that if you have a sufficiently complex system, no matter how much planning you do sooner or later it will experience a "normal accident" caused by one or more unknown unknowns. Your goal is to postpone a normal accident for as long as you can.

Categories: cloud-computing | devops | infrastructure-development |

Coffee and Design for Failure

In the wake of the Judgment Day Outage, I would like to offer you a story.

One morning John Doe woke up and decided he wanted a grande mocha. Nearby Starbucks was 3 minutes away by car. John went to his garage but looked like his garage door opener wouldn’t work - power was out at his house. John pulled the door up manually and started the engine.

As he approached exit from his subdivision, he noticed that the road was closed and road crews were working on an emergency power issue (one that caused power outage at his house). ETA was 30 minutes. The other exit from his subdivision was closed for 60 days due to road resurfacing.

John returned home, left his car in the garage and decided to walk to his nearby Starbucks. He noticed cloudy sky and took an umbrella. But it started to rain with strong wind gusts and his umbrella was not helping him a lot. He was soaking.

At that point John decided to give up on the idea of grande mocha from Starbucks and settled on his home brew. He returned home and didn’t have a Starbucks coffee that day.

Now, let me ask you - did John do a good job at designing for failure? After all, he was responsible for designing his process of obtaining coffee. He owned that process so he was in charge.

Or is John a rational individual who acted rationally under a given set of circumstances?

Two other blog posts of mine that you may like in this context are Normal Accidents in Complex IT Systems and Are You a Responsible Owner of Your Availability?.

Categories: infrastructure-development |

Corporate Open Sourcing

The idea of open source could by now be familiar to most folks in the industry. When a new technology is open sourced by an individual, the situation seems to be well understood. But when a corporation opens source code for one of its products, it looks to me like not everyone is fully aware of all peculiarities.

I can think of 3 common situations.

Something is developed for internal needs and then is open sourced after some internal use, after it becomes obvious others may benefit. I perceive Google and Facebook to be favoring this approach. This approach is the closest a corporation can get to motives that drive individuals to open source their work - sharing of knowledge, earning street cred, recruiting, and so on. Key differentiator here is that owner pretty much doesn't care what others do with their technology - their internal plans more or less do not depend on whether open sourced project becomes popular or not.

Something is open sourced while being developed or at early stages (say right after the first shippable milestone) , without owner's aspirations to build an ecosystem or get it adopted as a standard, at least initially. This is a business move - it's conceivable to think of a situation where a company's product needs to have its source code publicly available, in order to help sales. How permissive a license is may or may not matter here, and neither does whether contributions are accepted or not or even encouraged.

Something is open sourced while being developed and the goal is right from the start to build an ecosystem, or for technology to become a standard. In this case, main driver to open source was the ecosystem or aspirations to become a standard or some sort of fundamental building block (such as a "Linux kernel of the cloud"). Talk of licensing, code ownership foundations, governance, commoditization, contributions, etc are all found in this approach.

In all three scenarios, open sourcing is great and I applaud every company that does it, no matter what their motives are. But #3 is extremely difficult. In fact, of all open source tools and technologies that we commonly use today (Linux, Pyton, Ruby, any project in Apache Incubator and many thounsands of others) - I can’t think of any that followed scenario #3.

For those who have not been following the industry lately, Open Compute project initiated by Facebook seems to follow approach #1 - they have a finished datacenter based on this spec (in Prineville, OR) and they shared their findings after it was completed. They seem to like their design and looks like they will continue building to this spec, regardless whether it becomes adopted by others or not.

Openstack and Cloudfoundry (led by Rackspace and VMware, respectively) seem to follow approach #3. It’s a great idea - all I am saying is that their job is more difficult, and they are truly blazing their trail, it hasn’t been attempted thus far. Or if it has been attempted, I can’t think of it or it didn’t work out. If they make it in their respective areas, they will be the first.

This post on GigaOm could be interesting read in this context.

Can Openstack and Cloudfoundry both succeed at the same time at something that has not worked out up until now?

Categories: internet |

Previous Page
Next Page