How I Organize Posts In Jekyll

04 Mar 2011

Since November 2010, I have been running Jekyll to power this blog. Before Jekyll, I was on self-hosted Wordpress, which could explain some of my decisions around how I organize my posts because I didn’t want to break any existing links.

You can find some interesting resources on Jekyll that I collected while building this site at http://www.delicious.com/dsamovskiy/jekyll.

If you are reading this post via RSS, you can find all code snippets below at https://gist.github.com/852008 or open this post in a browser. As of the time of this writing, I use Jekyll 0.7.0.

Post Counts by Category

In my templates I use site.sorted_categories as a list of categories sorted by the number of posts in each category in descending order:

And here is a plugin that I use to build this list:

For each post I build a list of related posts as N most recent posts in the same categories. Here is how I populate post.related:

Categories: ruby | blogging |

On Nuances of Jevons Paradox in Cloud Computing

15 Feb 2011

I am not a formally trained economist. I am a software engineer, and economics (and economics in IT in particular) is merely a hobby of mine.

I have now seen many mentions of Jevons paradox in the context of cloud computing. Initially, when I first learned about it, it sounded interesting and exciting, just as any proposition that on the surface defies common sense and explains something fundamental in one’s area of professional interest. But as I read how other people apply Jevons paradox in their lines of thinking, I realized that it’s not as clear cut as one might imagine.

What is Jevons Paradox?

While you can read the entire paper here, Jevons paradox in short states that “improvements in fuel efficiency tend to increase, rather than decrease, fuel use” (both link and interpretation are from Wikipedia).

Let me first outline a hypothetical scenario in cloud computing domain where a phenomenon similar to the one described by Jevons could be observed.

Imagine you are developing a model that can predict future spot prices in Amazon EC2. To be able to obtain results with 50% confidence level, your model needs to run for 5 hours on 100 cloud instances (500 machine-hours in total).

Now imagine that due to enhanced orchestration and coordination between various parallel tasks of your model, you no longer need to run all 100 instances for the entire duration - you can stop and start instances on demand at the right times such that your overall model now takes only 220 machine-hours. In other words, you switched to more efficient use of the resource (compute instances).

Looking at significant savings, you start contemplating whether to improve the results by increasing desired confidence level, which will obviously increase the amount of computation and hence will drive up the number of required machine hours.

Predict the future vs Explain the past

The last paragraph above is what’s known as rebound effect. If you choose not to require higher confidence level, your rebound effect is 0. It means that you fully realized all savings made possible by increased efficiency of your use of resources. This case is not Jevons paradox.

If you choose to require say 60% confidence level and it leads to 300 machine-hours, you still realized some savings (500 machine-hours vs 300), just not as much as you would have realized had you stayed with 50% level. This case is not Jevons paradox either.

Furthermore, if you choose to require 80% confidence level which happens to lead to your enhanced model requiring 500 machine-hours, your overall spend will remain the same (500 vs 500). Also not Jevons paradox.

And only if you choose to go still higher with confidence level - say 90% - will you end up with your model taking more than 500 machine-hours, and in doing so you would end up with situation described by Jevons.

All in all, Jevons paradox is a special case of rebound effect, and there is absolutely nothing that can tell you in general case in advance how big the rebound effect is going to be. Hence, you can't tell in advance if a rebound effect will be big enough to turn into Jevons effect or not. Therefore, predicting in general case that cloud computing will not save you money because of Jevons paradox alone is logically inaccurate.

On the other hand, if you saw me going from 500 machine-hours for 50% confidence level to 700 machine-hours for 90% level, you could explain what happened to me as Jevons paradox. Post factum.

Macro vs Micro

Jevons formulated his statement on a macro level. Rebound effect is also usually studied on macro level. Macroecnomics focuses on economy as a whole - or, in other words, on a collection of individual entities (firms and households) each of which is guided by their own self-interest, independent of that of others.

What happens to a set of entities as a result of their independent actions taken cumulatively can’t tell us how a given individual household or firm will behave - that’s on micro level.

Automatically assuming that whatever happens on macro level happens to every participant on micro level is a fallacy of division.

Therefore, even if cloud computing will not lead to reduction in overall IT spend, saying that a given company can't lower its IT spend by going to the cloud because of Jevons paradox alone is inaccurate.

Cloud As Application Data Exchange Point

02 Feb 2011

Among numerous technical components that comprise an infrastructure-as-a-service cloud, there is one that usually draws most criticism and causes most annoyance. I am talking about the network. Too flat, too inflexible, too slow, too unpredictable - cloud network has been accused of being each and every one of these. And while some of the complaints could very well be valid at times, it is the cloud network that holds huge untapped potential for many big things in cloud computing.

While a network in each cloud may be designed differently, there are usually two important characteristics that all networks in all clouds share - they are super fast (LAN speeds within a single region) and they are multi-tenant.

The latter is the key. It’s true that multi-tenancy is often presented as a drawback or undesirable side effect - it significantly increases the risks of running one’s systems in the cloud and could negatively impact the throughput of one’s system because of noisy neighbors.

But let’s look at it from another perspective. Applications want to exchange data. (And by “application” here I mean any piece of software that runs in the cloud - webapp, data store, queueing system, data warehouse, traffic encryption service, etc). A webapp wants to receive user’s request, send a query to data store, and simultaneously send the visitor’s IP address to a geo-location service. Data store wants to send the response back to webapp and send query statistics (how long it took to run the query, how many records were returned, etc) to a monitoring system. Monitoring system wants to analyze and send alarms. And so on and so forth.

But what if the webapp in this scenario is run by one company, monitoring system is run by another company which specializes in monitoring, data store is run by data store specialist company? This used to be nearly impossible. From vendor’s standpoint, providing real-time support for data stores located at each client’s own network was extremely hard and costly. From customer’s standpoint, connecting from their datacenter to vendor’s system over Internet or private links could be prohibitively slow. In the cloud it becomes a piece of cake, primarily due to the very fast network between different cloud tenants.

Furthermore, in clouds like Amazon EC2 there is no real difference in network connectivity between 2 servers under the same account and 2 servers under different accounts (this covers both speed and bandwidth).

Cloud network to application data is what Internet Exchange Point is to Internet traffic.

What you used to need to have in your own network can now be easily outsourced to a specialist, without sacrificing connectivity speeds or bandwidth, if both of you host your systems in the same cloud region. This is huge.

Additionally, and for some potentially even more importantly, you now get an opportunity to interconnect with your customers, vendors and partners at LAN speeds, without having to spend a fortune - all you need is for everybody to get their systems (or at least points of presence) running in the cloud. The power of network effect!

George Reese predicted that 2011 would be a year of the network in the cloud. I fully subscribe to the idea that 2011 is going to be the year when cloud network is going to start playing a new and significantly expanded role.

Categories: cloud-computing |

My Doubts About Idea Behind SpotCloud.com

19 Jan 2011

This is part 5 of my series on pricing in the cloud.

The moment a second provider joined Amazon Web Services in offering IaaS cloud computing service, people in the industry started talking about a cloud computing marketplace where providers would be able to list available resources and customers would be able to buy them - on-demand pay-as-go no-long-term-contracts priced-by-the-hour practically-infinite-capacity scalable infrastructure bliss would ensue.

And while most were still talking and dreaming, someone actually went ahead and started putting wheels in motion. Spotcloud.com was announced on November 1, 2010 on a personal blog of its founder. I was reminded of them when I came across an update its founder recently gave GigaOm, so I thought I’d post something on the topic.

Self-described as “the first cloud computing clearinghouse & marketplace” that follows “opaque” sales model like hotwire.com, SpotCloud aims high. But despite its reported growth on the supply side (see GigaOm link above), I personally have many doubts the time for this idea has come.

In a nutshell, I doubt there will be sufficient demand from legal workloads whose primary focus is on CPU cores and RAM; I doubt a single workload can remain on "opaque" infrastructure for long; I doubt a cloud capacity marketplace today can withstand pricing pressure from real clouds; and I doubt competing with real clouds primarily on price is a viable strategy. Let’s explore these in more detail.

Sufficient Demand

From the post on GigaOm I learned that currently SpotCloud has suppply commitments of 10,000-25,000 servers (please note that I am not aware of any independent sources that could confirm or deny this information). The numbers might look impressive to you, but remember that it’s only supply - providers with unused hardware that just sits in their datacenters are offered to sell it by the hour. It’s a “no lose” proposition so supply looks OK as was expected.

But as most of you are well aware, a marketplace can’t function well with only supply - it also needs matching demand. (Do you know what happens when supply overwhelms demand? Prices race down to zero or providers withdraw. Just FYI.)

Let’s start with all cloud workloads as if all of them could run on top of spotcloud.

First, we will have to eliminate workloads that require persistent storage or fast access to data. With “opaque” sales model, you don’t know where your application will run next time so you can’t expect to always be able to find your data nearby.

Second, we will need to eliminate all workloads that are constrained by compliance and/or audit requirements, as these will demand consistently getting the same operational environment which can’t be guaranteed with opaque model.

Then, we will need to eliminate all workloads that need to be discoverable and addressable from outside (in other words, services that publish their IPs to public DNS) - opaque model means you never know where you will end up and hence you’re stuck with dynamic IP addresses.

And finally, from all workloads that are left, eliminate ones that already run on-premises, in the cloud or in managed hosting environments and that will not want to disturb their status quo and will opt not to move.

What are we left with? CPU/RAM-intensive applications, without big needs to access a lot of data or access data fast enough, that don’t need to be listed in DNS, and that are reckless enough to try something new in terms of their hosting. Does it sound like sufficient demand to you?

Also, in the summary above I said “legal” because there are many workloads that can’t run in the cloud to begin with - malware distribution, spambots, botnets and other cyber creatures because these workloads violate clouds’ terms of service or acceptable use policy (or equivalents). Do you think a seller participating in spotcloud will be interested in running these?

Staying on "Opaque" Infrastructure

Here is what I mean here. In case of Hotwire, “opaque” sales model only works because most people end up going to different places every time they travel. Imagine you went to New York once and booked your hotel room at Hotwire. Now imagine you go there again and once again book through Hotwire and end up in the same hotel. If you are rational, next time you have to go to New York you will look up hotel’s direct phone number and see if you can get better terms. In other words, if you keep going to the same place, “opaque” sales model is irrelevant - you are better off dealing with supplier directly. Logically, a provider is very unlikely to be unable to give you better terms than an intermediary.

If you have a workload for SpotCloud, you run it and end up at provider X. (How can you tell? curl http://checkip.dyndns.org + whois). Say you have a similar workload and you once again end up at X. What prevents you from cutting out the middle man and renting your infrastructure straight from X?

(Answer: exclusivity clause in contract between SpotCloud and X. But can you think of a hotel anywhere in the world that willingly agrees to sell its inventory exclusively through Hotwire? Me neither.)

Pricing pressure

The front page of Spotcloud.com touts “up to 60% off cloud capacity” as one of the key benefits to buyers. Even if you think spotcloud can meet this claim initially, I strongly doubt it will manage to do so over time - IaaS prices are constantly under pressure to go down. Amazon alone has announced several price reductions over the past few years.

Without doing anything, a SpotCloud provider runs a risk that eventually real cloud will drop their prices to its levels, thus negating a key value proposition it might have. Competing on price against clouds, with their direct distribution channels and enormous economies of scale, may not be the best long-term strategy.

Conclusion

The only thing going for SpotCloud.com appears to be the first-mover advantage. Financial exchanges could put a lot of emphasis on it, as it matters a lot in their world. But realities of financial market are very different from those of compute capacity market.

Hats off to SpotCloud for pioneering the implementation, but I have doubts it will work out in the near future. The idea of cloud capacity marketplace could sound interesting and it will eventually materialize but not any time soon, in my opinion. Also, I don’t think a successful cloud capacity marketplace could follow the “opaque” sales model any time soon.

Categories: cloud-computing | economics |

chattr Against Sneaky Postinstall Scripts

04 Jan 2011

I was once reading my Twitter stream and came across a link to a systems monitoring service I had never heard about before. I went to their website, liked what I saw in their list of features and their pricing, and signed up for an account.

Then it was time to get a test system into their monitoring dashboard. As is quite common in this domain, they offer an agent to be installed on my host. Agent software was conveniently packaged as a DEB file (I was targeting Ubuntu as a test system) - one bonus point to the vendor!

Like with all new DEB packages that don’t come straight from Ubuntu mirrors, I like to download the file and inspect it before installing it.

If you have a DEB file, you can get to its contents without installing the package by running the following commands in a temporary directory:

$ mkdir /tmp/deb_analysis
$ cd /tmp/deb_analysis
$ cp /path/to/file.deb .
$ ar xf file.deb

You will end up with 2 files that you care about - data.tar.gz and control.tar.gz. Once you untar control.tar.gz, you will see the main control file of the package as well as various trigger scripts around various phases of package installation and removal (what commands to run before package is installed, after package is installed, etc).

In control I usually pay attention only to dependencies to make sure the list is not insane - I don’t like installing unneeded things and I look for things I intentionally don’t want on my system for one reason or the other. (To give a hypothetical example - if a package depends on a specific version of kernel or if a package depends on something that I installed locally from source).

Having not found anything bad in control, I moved on to postinst. It’s usually a shell script so it’s not very difficult to figure out what they are trying to do here.

In this particular case, I noticed that they wanted to add their library to /etc/ld.so.preload - thus ensuring that their library gets loaded into every single program that runs on my system. man ld.so

I immediately lost interest in contunuing to evaluate this solution, because this level of integration for a monitoring system was beyond what I am comfortable with. But it raised and interesting question - what can one do to protect against such changes without one’s explicit knowledge and consent?

There could be many things that can help here but recall that a package installation requires root powers - so postinst script is going to run as root. Things like apparmor won’t necessarily be effective - root can edit and reload apparmor to allow itself to do what it needs.

I like a very simple trick in this situation - chattr +i. This command turns on an immutable flag on a given file. From man chattr:

A file with the `i' attribute cannot be modified: it cannot be deleted or renamed, no link can be created to this file and no data can be written to the file. Only the superuser or a process possessing the CAP_LINUX_IMMUTABLE capability can set or clear this attribute.

In my case, creating an empty /etc/ld.so.preload and setting its ‘i’ attribute will usually be sufficient to make sure this file is not modified without my knowledge.

That is until people start putting chattr -i calls into their postinst scripts - which would clearly indicate evil intent. Don’t do it!

Categories: linux |