How Enterprise IT Gave Rise to Cloud

Have you ever noticed that enterprise IT organizations often times have (sometimes numerous) project managers but never have product managers? Did you know that this fact is directly responsible for the rise of cloud computing? Read on if you want to learn more.

Enterprise IT is an interconnected set of people, processes, services, hardware and software. While its main purpose in life is to support the business, you could argue a bulk of productivity gains we managed to achieve in recent decades as a society can be attributed to improving IT. So ultimately enterprise IT is a good thing.

Within an enterprise, IT has 2 different internal customers. Business itself is one, but there is another - it’s application software developers. We could debate significance of each but it’s not relevant here.

When you have something as complex as a typical enterprise IT organization, you need people to help manage it - that’s where technology project managers come in. They perform many important functions and it’s not my intention to question their value to organization. But project managers have a very specific mandate - essentially bring order to flow of work, both internal changes and change requests from customers.

What’s missing from project managers’ mandate is asking their customers “what do you want us to be able to do for you.” Because that’s a function of product managers. Unlike project managers who take something as a given and bring order to its surroundings, product managers actively shape what this something should look like in order to maximize its value.

Let me give you an example. A Rails developer can develop a web application in any field - healthcare, social, logistics management and so on. She doesn’t need specific knowledge about the domain she’s working with in order to be successful. It is product managers from whom this knowledge comes though - they are the ones who specify “requirements” and prioritize them.

Project managers and product managers are not interchangeable and these roles have totally different backgrounds. A PMP certification helps in project management but won’t do anything in product management. Knowing ins and outs of medical billing will help in product management at a medical billing company but won’t help much in project management.

Now that we’ve established the difference between project managers and product managers, let’s get back to our enterprise IT. Here is the main point of this post and I am going to highlight it.

Enterprise IT doesn't need product managers for their business customer because business requirements for IT are only very high level ("be able to run accounting", "have good uptime"). So there is no need for product managers here. But enterprise IT forgot that they have 2 customers. And application developers, because they are also technologists, have very specific detailed things they want enterprise IT to do for them but there is no one to shape enterprise IT from within to respond to these needs.

People managing enterprise IT as a product to internal application developers would be the ones working on the following problems:

  • what characteristics of data store do developers need
  • what access to production do you developers need to enable safe and effective troubleshooting and how enterprise IT can make it happen
  • how network should be partitioned to support developers' needs regarding moving apps between environments

And this is exactly what IaaS cloud computing did - it built IT as a product offering for application developers, considering their needs and wants.

If you are an enterprise IT organization today, IaaS is your competition. Your application developer customers will choose you only if your product is better. Or if they are forced to by corporate information security. Make your own conclusions.

Categories: cloud-computing |

Response To Simon Wardley: Innovation in Interface Implementations

You probably heard about the most recent episode of “AWS API in Openstack” saga. If you haven’t, head over to Nati Shalom's blog to read one of the best recaps I have seen.

My personal position in this discussion is very simple. I am not saying Openstack should fully adopt AWS API, nor am I saying Openstack should not fully adopt AWS API. What I am saying is which API is presented doesn’t matter, as long as API exists and remains stable, with reasonable version management (so that developers can write against it easily). And I am not alone. Just give us a reasonable API, everything else doesn’t matter.

Ideological footing for the viewpoint that Openstack should just adopt AWS APIs can be traced to works of Simon Wardley. Simon was the first to introduce cloud computing community to several important concepts from science, he is a well established thought leader in information technology widely followed by both cloud pundits and practitioners.

As part of the recent debate, Simon wrote a post titled The 'Innovation' Battle, in which he says in part:

[...] innovation is not about the interface but should be above and below the interface.

[...] [others] believe that innovation of the interface is paramount. They believe that what matters is the interface, they hence strongly oppose adoption of AWS APIs.

In this post, I would like to expose a flaw in Simon’s logic and convince you that both camps in API debate are actually in line with Simon’s guiding principle that innovation of the interface doesn’t matter.

What is an interface?

First, let’s think of what an interface is, in most general terms. Interface is a set of behaviors and intended use cases that something provides and adheres to. When 2 things provide same or very similar interface, they become interchangeable. Over time, as the number of things providing same or similar interface grows and interface itself is refined (gains new desired behaviors based on feedback and observations, gets rid of behaviors that customers found unnecessary), it becomes a commodity.

Take a regular automobile. What is the interface of a car, in most general terms? It consists of several behaviors. One has to have a way to get situated in a seating position in front of a steering wheel and instrument dashboard. One has to have a way to get a car to move forward, back, turn and stop. One has to have a way to see what’s behind a car. You could go further and say that over time car interface gained expectations of ability to steer with hands and brake and accelerate with feet. Maybe ability to listen to music on your iPhone is a must now - interfaces do evolve.

These are all general things that one expects from a car and that allow most people to operate any car of roughly the same size without too much difficulty or a user manual. This is car’s interface.

What’s the interface of IaaS cloud? In most general terms, it’s ability to programmatically request a running OS image (we now call it an instance), uniquely identifiable and addressible, connected to a specified network, with ability to stop it.

Now on to a key point. Try to think through historical evolution of cloud - from the first version of Amazon EC2 to other IaaS clouds to newer versions to EC2. Has the generic IaaS interface changed in any significant way? Not really - no one is innovating in the interface any more because a lot of interesting work there has already been done. Even mighty Google who came out with their IaaS cloud last year, haven’t done any significant innovation in the interface (this doesn’t mean they haven’t innovated in other areas which are more important to customers).

Interfaces don’t expand simply because vendors want to sell more things - interfaces evolve only when customers are actually using new behaviors and functionality.

API as Implementation of the Interface

Even though all car manufacturers implement the same interface, their implementations are all different. Doors can open to the side or up. Door handles could be flush or not. Same controls could sometimes be knobs or buttons. Steering wheels’ diameter is different. Controls signage is different.

Have you thought about why car manufacturers haven’t converged to a single implementation of the interface? I have. Because customers want differentiation and customers don’t mind the differences in what looks to them like details.

Similarly, IaaS API is implementation of the general interface. Order of parameters, POST or GET or PUT for API calls, details how to sign your request, instance ids hex with an “i-” in front or numeric, public IP address as NAT or directly on network interface. The list could go on and on and you can name tons of these little differences if you had a chance to work with many clouds (not merely talk about many clouds but actually write code that uses many clouds).

These details don’t matter to most developers (and every developer I have personally spoken to), just like whether windshield wipers are turned on by pushing lever up or down doesn’t matter to most car drivers.

Conclusion

I think Simon is correct to point out that innovation should be happening outside of the interface. And it is. But his assumption that generic interface to an IaaS cloud and API are one thing is flawed - they are not, and cloud is not the only commodity where this is evident, as I showed above. API is an implementation of the interface. Nobody is innovating in generic interface anymore on a large scale - the feature set is pretty solid shaped by current capabilities of available technologies.

One can very well believe that Openstack would be better off adopting AWS APIs but it can’t be solely because of Simon’s “innovation of the interface” argument.

Categories: cloud-computing |

Risk in IT Systems

TL;DR - This post is not about ways and techniques how to run IT systems better or more reliably or deal with failure more efficiently; it's about mathematical models (or lack thereof) of what risk in IT systems actually is. I bet at least 80% of you will stop reading this post now. For the other 20% - I am happy you are curious.

Ideas that I describe in this post were in large part shaped by literature on risk management and modeling in capital markets and portfolio management.

Do you know what our biggest problem in IT is? It’s our complete and utter inability to measure risk of the systems and services that we build. Often, people don’t even understand what the risk is and confuse risk with other concepts.

Why do we need to measure risk? To do A/B testing among other things. When you implement two approaches to solving a problem and you need to decide which one to put in production, you’d better be able to compare risk of each approach and act accordingly. Also, deeper understanding of risk should help us analytically determine risk of bigger systems consisting of individual smaller systems, components and services whose risk we know. Ability to measure risk analytically will allow us to forecast systems’ risk before spending time and resources building them - a huge win.

Let’s say you have a system S. It can be an individual service in your stack or it could be entire web-facing system that consists of multiple frontend and backend services. Let A(t) be a function that somehow measures overall health of your system S at time t such that 0 ≤ A(t) ≤ 1. There is no universal rule what A(t) could be. For some systems, it could be percentage of incoming queries answered correctly within a predetermined time. For others, it could be percentage of available workers.

Now let’s pick 2 numbers f and r such that 0 ≤ f < r ≤ 1 (as you may have guessed, “f” stands for failure and “r” stands for recovery). We will say that system S experienced a failure if its A(t) falls down to or below f, and we will say that system S has recovered from failure when after hitting f, A(t) climbs back up to at least r.

In this formalization, time to recovery (TTR) is going to be Tr - Tf such that:

  • A(Tf) ≤ f
  • A(Tr) ≥ r
  • Tf < Tr
  • t, Tf ≤ t < Tr : A(t) < r

If you average TTRs over a period of time, you get a well known measure called mean time to recovery (MTTR).

Similarly, time between failures (TBF) is going to be Tf2 - Tf1 such that:

  • A(Tf2) ≤ f
  • A(Tf1) ≤ f
  • Tf1 < Tf2
  • ∃! Tr , Tf1 < Tr < Tf2 : A(Tr) ≥ r
  • t, Tf1 < t < Tr : A(t) < r
  • t, Tr < t < Tf2 : A(t) > f

(last 3 conditions indicate there was exactly 1 recovery between failures at Tf1 and Tf2)

If you average TBFs over a period of time, you get mean time between failures (MTBF).

TBF is an indicator how prone S is to failure (could be bad system design, but also could be complex domain with a lot of external dependencies). TTR shows how good your tooling and processes around failure detection and remediation are. But by themselves neither one of them, nor their means, tell you anything about risk.

Instead, pick a duration for a period of time, sum all TTRs in each period and divide them by how long each period is. For example, let’s use monthly scale and in a month of January your TTRs were 3 hours, 5 hours and 1 hour. 3 + 5 + 1 = 9, divided by 24*31 (number of hours in January) - 1.21%.

If you repeat this arithmetic exercise for many months, you will end up with a series of numbers representing amount of failure in S per month. We will make an assumption here that the random variable of monthly amount of failure is normally distributed (this assumption could be proven wrong in the future but I am going with it by default because many processes in nature are normally distributed; our industry doesn’t publish enough datasets about operations metrics publicly so that confirming whether this is true is nearly impossible at this time - see below).

If you average these numbers, what you will end up is expected amount of failure per month (mean of normal distribution, corresponds to top of the bell curve). Is this risk? No.

Imagine a system which is down 5% of the time every single month for the past 5 years, no matter what you do, no matter how much business has grown over this period, how much more complex the system has become now and how much you improved your automation. Risk in this system is almost 0 - you are almost positive its amount of failure next month will once again be 5%.

Risk corresponds to standard deviation of monthly amounts of failure, not their mean (standard deviation is square root of variance). A system S1 with smaller variance (for example, every month S1 is in failure mode roughly same amount of time) has lower risk than a system S2 with larger variance (for example, one month it is in failure 10% of the time, another month 0.1% of the time), even if expected amount of failure of S1 is bigger than that of S2.

How about an individual change (say something that you get asked about when you attend a change management meeting), either repetitive like pushing out a new release or performed only once? Similar to a system, a change has a success function with values between 0 (change totally didn’t work) to 1 (change worked as expected without unexpected side effects). Same logic about expected success rate and risk as standard deviation of outcomes will apply here as well - you will be modeling this change as if it were performed over and over again.

To sum up:

  • TBF shows how prone your system is to failure
  • TTR shows how good your troubleshooting is
  • mean amount of failure per period shows expected amount of failure per period
  • standard deviation of amount of failure per period is risk
  • risk and expected amount of failure are two totally different measures

Last year Christopher Brown in his talk discussed devops as craft and as a science. In order to properly graduate from the former to the latter, our discipline needs to start sharing hard data about techops metrics so that reasearch can be done and new theories can be tested against real-life historical data. I am primarily looking at shops of techops excellence like Amazon, Etsy, Facebook, Github, Google, Netflix, Twitter and others.

Categories: devops |

What "Software Defined" Actually Means

There seems to be a pretty widespread belief held by many IT practitioners that “software defined” stands for something that can be dynamically configurable or something that offers all or most of its administration functions via API. While you won’t hear an argument from me against the fact that these are necessary features of “SDX” (software defined something), it’s a mistake to look at them as sufficient.

The term “software defined” first appeared in the context of networking and applied to a new paradigm in switches and/or routers.

Traditionally these units are built as relatively monolithic hardware devices. Their logical structure however can be viewed as consisting of 2 parts: data plane and control plane.

Data plane is where your data packets between hosts on the network are traveling on their way from point A to point B. Algorithms here are relatively straightforward, uniform, sensitive to speed and they rarely if ever need to change drastically - in short, a perfect candidate to be implemented in hardware.

Control plane however does not share these characteristics. As networks become more sophisticated, algorithms here can be pretty complex (and complexity often leads to bugs that need to be fixed). Algorithms here are far from uniform because they are expected to support a wide range of use cases and deployment scenarios. And they could be very dynamic.

The whole point of SDN was to realize that control plane is not a good match for rigid hardware. The idea was to implement each plane on top of something that is ideally fit for its characteristics - hardware for data plane, software for control plane.

This explains “software.” Let’s explain “defined” now.

Even in the past, you could control your network devices programmatically and they had extensive capabilities to reconfigure themselves dynamically. While this was not a intended use case and usually was an afterthought, it still was not impossible. But no matter how you did it, you still ended up managing an inflexible control plane within hardware. What SDN did was eliminate control plan from hardware. In SDN, there is no control plane within hardware - it’s all outside of hardware, implemented in software.

In SDN, there must be a clear deliniation between data plane and control plane. Data plane does not make any decisions by itself. It only performs its work and feeds runtime data to control plane and accepts commands from the latter. There can be no commands from data plane to control plane.

For example, you hear a vendor introduce API and describe it as SDN. You start digging and find out that their API is just a facade to their old control plane that still lives in hardware intermingled with the data plane such that they essentially form one inseparable piece. Is this a true SDN? No.

To sum up, “software defined” is an excellent term (not a misnomer) and it actually stands for some important design principles - separation of data plane from control plane and control plane being implemented in pure software.

All “software defined” is automation and dynamic reconfigurability but not all automation and dynamic reconfigurability is “software defined,” no matter how much marketing departments may want you to believe otherwise.

Categories: cloud-computing |

The Dilemma of API

I had an interesting conversation on Twitter earlier this week that indirectly helped me realize something very important.

We all know what API is and we all know lots of examples of successful services that were made possible by someone else’s API. We also could recall several examples when API backfired, at least in part. Take, for instance, Twitter itself and its relationship with their developer ecosystem, which probably could be best described as “rocky at times.”

But instead of looking at API from ecosystem’s perspective, let’s look at it from the point of view of a provider.

Imagine you lead a company that offers a way to consume your service’s functionality programmatically - i.e., through API or SDK.

It is indeed possible that your goal is to build an ecosystem around your main service. But, on the other hand, perhaps ecosystem is not in your plans. There are many services and use cases where it’s significantly easier to interact with a service programmatically. For example, when you are importing your HR data into a new system, you most likely want to automate this process, as opposed to manually enter each individual record one by one. In other words, maybe the reason you make API available is not to form an ecosystem but to allow your customers to automate their interactions with your service.

And here is the dilemma.

At the time API is introduced, a provider can’t credibly signal which way they want their API to be used - in other words, it can’t send a signal credible enough to indicate whether they want to foster an ecosystem or they only want to enable time-saving operations (history shows that executives’ interviews and blog posts are not credible enough).

Furthermore, while ecosystem-aspiring providers will never mind the use of its API for pure automation, automation-aspiring providers can’t control in advance whether their API is used by a developer to help kickstart an ecosystem, even against the will of the provider.

When a service puts up its API with the stated intention of growing an ecosystem, what they actually mean is “we want you developers to be a part of our ecosystem so that our audience grows and more people are using us (even if we share the spoils with you) but if you do something (and as of now, we have no idea what it might be) that we like and that we think we should have more control over, we will adjust the rules of our ecosystem as we see fit at that time.”

The dilemma is that a provider can’t outline its rules early enough to avoid having to change them in the future because it can’t know early enough what uses of its API it will like and what use cases it won’t want to tolerate.

In this context, you may want to revisit links from my blog post from about 2 years ago titled Ecosystems and Platforms.

Categories: internet |

Previous Page
Next Page