Devops - Solution to a Problem, Not a Cure for All Ills

With great interest I read a recent post by Chris Hoff on devops disconnect (make sure to read the comments too).

Devops as a way to promote “collaborative and communicative culture” (see John Allspaw’s comment) - "devops the culture" henceforth - was born out of frustration on both sides of the house (dev and ops) when it was time to push out and troubleshoot a new code release. In most web companies, a new release would happen on a weekly, bi-weekly or monthly basis - and the night of bad blood, fingerpointing, cursing, non-stop conference calls and IM conversations, all-nighters and so on would take center stage, with more of the same a couple of days later at a postmortem. Most frustratingly, anyone who has ever worked through anything similar inevitably realizes that the amount of stress could be considerably reduced by simple coordination beforehand when concerns could be raised and addressed.

Similarly, if you ever had to troubleshoot someone else’s code in a stressful situation, you would often realize that code in question was not written with such situations in mind - inconsistent and/or incomplete logging, lack of clear understanding of dependencies, failure to properly catch network socket exceptions. Again, this experience can be made much much better by close involvement of ops in parts of development process.

I believe it’s important to realize that “ops” in "devops the culture" does not stand for all of operations (which includes systems engineering, networking, storage and security). I specifically mean a subgroup of operations that most commonly is referred to as “application integration,” “application support” or “application engineering.” These are the folks who run the software infrastructure on top of which developers’ code is running. In bigger companies, these would be dedicated people. In smaller ones, the same people may also wear other ops hats such as networking or storage.

Now, don’t get me wrong - I would LOVE to see "devops the culture" applied to all silos in IT. But it’s easier said than done, at least for purely technical decisions. I have been a part of multi-silo technical teams tasked with making specific decisions. No matter how much discussion or coordination occurs, the decision still ends up being made by whoever knows the area best (have you ever witnessed developers giving suggestions to network engineers how to design the network? what about storage people advising security? and the best part - systems engineers advising developers how their code should run?). Hard to say it’s a bad thing either, especially from responsibility and accountability standpoint.

Hence, in my humble opinion, "devops the culture" is not about how to bring collaboration to all of IT. It’s about avoiding frustrating experiences related to running one’s own code in production.

Network, security as well as systems do play a big role in "devops the infrastructure as code" (see Adam Jacob’s comment) however - there’s no question about that in my mind. How well these areas will lend themselves to automation still remains to be seen however.

Categories: devops |

Activity Streams, Cross-Posting and Pareto Efficiency

I once logged in to LinkedIn to reply to an inMail, and on their front page noticed several tweets from people with whom I am both connected on LinkedIn and whom I follow on Twitter. These were the same tweets that I just read in Tweetdeck - and I ended up reading them twice! This got me thinking about cross-posting of social updates (Twitter, Facebook, Buzz, Foursquare, LinkedIn, etc).

First of all, it’s worth noting that incentives are misaligned here. For publishers, cross-posting of public updates may make some sense - the more channels their message appears in, presumably the better. For readers however, the opposite is true - the more times you have to read the same update, the worse. As readers, we sometimes may want to engage in a conversation - and it’s far from clear how one should do it if you only receive a cross-posted update, not the original update on the network on which it was originally published.

Secondly, this emphasizes our current service-centric model of social media, as opposed to person-centric as it should be. In person-centric model, each individual will probably have several public streams (most will have only one I guess) and several private streams (family, friends, work). You let others subscribe to an appropriate stream (with or without authorization, depending on your settings) and they configure their reading service to filter your updates in or out. For example, I may want to follow someone’s public stream but would like to filter out their Foursquare notifications (unless they have a payload which says something more than “I am at X”). For more on this, please check out Blaine Cook's recent post about identity.

And thirdly, this brings up a question of efficiency. Reading same tweets twice is inefficient no matter how you look at it. Un-connecting from someone on LinkedIn just because I don’t want to see their updates from Twitter is also inefficient. Unfollowing them on Twitter and reading their updates on LinkedIn may make sense, but then I don’t log in to my LinkedIn very often - and this ends up being inefficient too.

It’s interesting to note that the question of efficiency does not exist in the context of a single network. You either follow/friend an individual on a given network and receive their updates, or you don’t. This is a binary choice, which is relatively easy. It’s when the same updates get duplicated into multiple networks that the issue of efficiency is starting to play a role.

Efficiency is a tricky concept - it depends on the angle from which you are analyzing current situation. Therefore, in economics there is a concept of Pareto efficiency. An outcome is considered Pareto efficient if no one can be made better off without making someone worse off. Pareto-efficient outcomes have two interesting properties that apply to social media as well.

  1. There could be multiple Pareto-efficient outcomes. Imagine you and I are walking on the street and see $100 in $10 bills on the floor. We could split it 60-40 or we could split it 70-30 or we could split it 50-50. Each of these outcomes is Pareto efficient by definition. Similarly, I can follow you on Foursquare and get your location updates. Or I can follow you on Twitter and you could configure your Foursquare to cross-post to Twitter, and I will get the same updates.
  2. The fact that an outcome is Pareto efficient doesn't necessarily mean it's socially desirable or "the best" (consider your feelings if we split $100 above $70 for me and $30 to you).
We have yet to define “better off” in the context of social media. I can think of at least 2 forms here. The first, weaker, form could be “not miss an update in which one is interested in.” The second, stronger, form could be “weaker form + avoid receiving updates in which one is not interested in.”

Based on all this, it looks to me like what we have now is vendors racing to a weaker-form Pareto-efficient outcome - every service wants to be your destination for writing status updates and for reading those of your friends, without any regard for you, your wasted time re-reading same things many times, or the fact that most of users would actually prefer a stronger form. If such an outcome is achieved, it may be very difficult to change (because someone will have to be made worse off, by definition) if we want to get to a person-centric model.

What I would like to see instead of cross-posting (which essentially is cross-pushing of the same update to multiple channels) is cross-pulling and filtering - essentially creating customized activity streams based on reader’s preferences, not those of publisher. I am concerned however about the present lack of incentives for this behavior to emerge.

Categories: internet | economics |

Developing API Server - Practical Rules of Thumb

I have been doing a lot of reading lately on how one would go about developing an API server. It’s an interesting topic, with various established schools of thought and multiple real-world implementations to compare against. In this post, I am going to summarize my findings, for my own reference as well as for anyone who may find themselves in a similar position. These are my rules of thumb geared towards practicality. I may very well be wrong on these - if your experience tells you this makes no sense, I would love to hear your thoughts in the comments. Most examples and references below are from IaaS space.

Query API vs REST API

To start, one should read this blog post by Jan-Philip Gehrcke about various types of AWS APIs and differences between RESTful and query API, and this blog post by William Vambenepe where he analyzes various IaaS API implementations (it’s a series of 3 posts). Then read description of Richardson Maturity Model by Martin Fowler.

In a nutshell, I think from practical standpoint, if one’s domain maps easily to a set of entities (nouns) and API operations on these entities are primarily CRUD, in this case one’s best bet is to go with at least Level 2 REST. If either doesn’t work, I’d go with Level 0 REST, which is essentially what query API is.

My main reason for not going with Level 0 when entities and operations do map, is that I hate to see this meta data go to waste because it doesn’t cost almost anything to include.

Between Level 2 REST and Level 3 REST, I think Level 2 is more practical. According to Fowler, “Level 3 introduces discoverability, providing a way of making a protocol more self-documenting.” It’s certainly a nice feature but I am not sure this added benefit justifies extra development effort and slightly increased complexity (some might argue it may actually reduce complexity though).

API frontend vs API methods implementation

Keep implementation of your API methods separate from whatever frontend you are deploying (REST, SOAP, etc). API methods are probably going to be the same no matter how they are called, so they should be frontend-independent. This will make it easier for you to introduce new frontends (AMQP, for example) and should facilitate code maintenance.

HTTP verbs

Read and delete operations are easy - they map to GET and DELETE.

Create and update are trickier. Canonical description of HTTP verbs can be found in Section 9 of RFC 2616 and I use the table here as an addendum. In short, for both create and update, if an operation is idempotent and URI of entity on which this operation is being performed is known, use PUT. Otherwise, use POST (it is often used on entities representing “factories” - say a factory of new postings; you don’t know URI of a posting before you create it, so you POST to a factory which will create a new entity at a new URI; note that POST is not idempotent).

Note the RFC definition of idempotent methods (9.1.2) - it’s not defined as “multiple invocations must lead to the same result as a single invocation.” It’s “(aside from error or expiration issues) the side-effects of N > 0 identical requests is the same as for a single request.”

HTTP return codes

Section 10 of RFC 2616 is a canonical description of HTTP status codes.

Successful completion should be signaled as HTTP 200 OK and, if it’s important for client to know that an entity was created as a part of operation, HTTP 201 Created. The latter may be redundant - code that handles 200 and 201 most likely will be identical or very similar.

Speaking of errors, I don’t think it’s practical to map each type of error to its own HTTP error code. Unexpected server side errors (frontend exceptions or uncaught exceptions raised by your API methods) could be HTTP 500 Internal Server Error. If a resource is not found, it should be HTTP 404 Not Found. If your API server uses an external service to perform certain operations and upstream service did not respond or returned an unknown error, I would signal this fact with HTTP 502 Bad Gateway.

The rest of the errors are all client-side, and I like to classify them into 2 categories. When something is wrong with submitted request (missing header, missing argument, argument of wrong type), I think server should return HTTP 400 Bad Request. This way server is telling the client that no matter how many times this request will be submitted, it won’t work and will produce identical response.

I then group all other client-side errors together and think they should lead to HTTP 403 Forbidden. It means request by itself is fine, but something is preventing server from executing it - such as a missing prerequisite. Re-submitting the request may work in this case, because by the time the request is re-submitted, something might have happened and prerequisite is already in place.

Error response could include application-level exception and its description - this way you are letting the client know exactly what was wrong. Whether processing these ends up automated or not - it’s up to the client.

Message formats

I can’t easily justify this one, but I feel that bodies of request and response should be in the same format (there could be exceptions - for example, when client must upload a binary artifact). vCloud does it this way - request body is XML, and response is XML. EC2 API sends request arguments in query string (because all requests are GET, since it’s query API) and response is XML. OCCI API defines request body as form-urlencoded (application/x-www-form-urlencoded) and response is XML as well (all of the above might support JSON as well).

I have 2 weak justifications for this.

Firstly, it somewhat mimics our regular human behavior. If 2 people are communicating in real time, they usually use same medium and same format. It’s rare when one person is on IM speaking English, while the other is on the phone speaking French - not saying it’s impossible but relatively rare.

Secondly, in the future I foresee a greater use of messaging in API operations (read this post by George Reese). Notions of request/response come from HTTP, in messaging it doesn’t matter - the same message could be response to one message and request to another. For example, a message requesting server start may lead to a message saying “server started” to the client. At the same time, the same “server started” message may go to an internal billing system, where it would be a request to start billing.

Having these message in the same format might be beneficial.

Command line tool

AWS set the bar with EC2 here. For every API call, they ship a command line tool to perform said call. No matter what you think whether it’s right or wrong, I think every provider should match this behavior. It’s a good practice after all - when someone is about to try API, it’s much easier to get going using command line tools instead of embedding API calls straight into the application.

Instead of EC2 practice of one command line tool per API call however (even though inside they still call ec2-cmd), I favor Sun Cloud’s approach - they were planning a single unified tool where an API call would be identified by an option or a subcommand.

Conclusion

As the Zen of Python goes, ” practicality beats purity.” This should be your main guiding principle when designing API server side.

Categories: software-engineering |

Probing Ports in Remote Security Groups in EC2

This is the third part of my series on Amazon EC2 security groups. In part 1, I described how security groups are possibly the most underappreciated feature in EC2. In part 2, I described a UDP hole punching technique, which led to some interesting conclusions.

On several occasions, when troubleshooting a connectivity issue or verifying a deployment’s security, I needed to check local and/or remote security group rules from within my instances. I never have AWS credentials on the instances, so I couldn’t use the API. And EC2 metadata server (169.254.169.254) returns only the names of security groups, not the rules themselves. In other words, I needed to answer the following question - “do remote instance’s security groups allow or not allow communications from this instance on TCP/UDP port X.” Note that trying to open a connection and not getting a response doesn’t answer this question - packets could be blocked by security groups or by remote instance’s local firewall (iptables). It turns out there is a way…

Let me first start with a warning. You probably are OK using this technique against your own instances, or instances of your friends, partners, vendors and customers with their permission and when you know specific ports you need to probe, for verification or troubleshooting purposes. Probing someone else’s instances or random ports may get you in trouble - such activities could fall into a category prohibited by AWS Acceptable Use Policy. You have been warned.

Summary

This technique works using private IP addresses, so both instances must be running within the same EC2 region.

To check if remote instance’s security groups allow communications from your instance on UDP port $X, do this:

traceroute -p $X -w 1 -q 1 -m 1 $REMOTE_INSTANCE_PRIVATE_IP

To check if remote instance’s security groups allow communications from your instance on TCP port $X, do this:

tcptracerute -w 1 -q 1 -m 1 $REMOTE_INSTANCE_PRIVATE_IP $X

(tcptraceroute is not a standard tool installed by default; you can get it from here or apt-get install tcptraceroute on Debian or Ubuntu)

If you see ”1 {IP address of first hop} XX.XXX ms”, remote instance’s security groups allow such communications from your private IP address. If you see "1 *”, they don’t.

Background

This technique is based on one of the conclusions of my UDP hole punching post.

In it, it looked to me like each dom0 in the region knows about all security group rules in this region. Additionally, it looked like when dom0 knows that remote dom0 will not accept traffic, it doesn’t bother even sending it. If we assume they do it on dom0 with iptables, they probably simply DROP such traffic (at least I would if I were them). And if they do, they won’t send ICMP Time Exceeded packet back to us - and hence, traceroute won’t be able to report the first hop.

If we do get that ICMP packet from dom0 back, it means it didn’t drop our packet. Which means it knows the other dom0 currently has a rule in security groups allowing it.

I set max_ttl to 1, since first hop in traceroute is believed to represent dom0.

Obviously, I verified this theoretical hypothesis on a pair of EC2 instances in us-east-1, and didn’t see any indication that it could be wrong.

Protecting Against Such Probes

If you have instances in EC2, you may be wondering if you could somehow protect them from such probes. After all, AWS network monitoring is top-notch, but they have a lot of hosts to watch over, so some small amount of probing may still occur and you may not even know it.

I’ve got good and bad news on this front. The good news is that you can do it, the bad news is it’s going to be ugly. Essentially, to avoid disclosing the ports open in your security groups to such probes, you must configure your security groups not to allow connections from arbitrary instances in your EC2 region. In other words, you must exclude 10.0.0.0/8 from all IP-address-based rules (rules granting access by name of security group are not affected). It means that you have to replace each rule that references 0.0.0.0/0 with multiple rules individually referencing 128.0.0.0/1, 64.0.0.0/2, 32.0.0.0/3, 16.0.0.0/4, 1.0.0.0/5 (I would avoid 0.0.0.0 just in case - hence 1 as first octet), 12.0.0.0/6, 8.0.0.0/7 and 11.0.0.0/8. Yep, 8 rules instead of 1. You have to also repeat the same exercise for all IP subnets referenced in your rules that include 10.0.0.0/8.

Conclusion

In the future, of course, I hope security groups API could be extended to support functionality “allow connections from Internet excluding private IPs in this region”. I am going to add it to my previous wishlist.

Categories: cloud-computing |

IaaS, Hype and Marginal Cost

Theo Schlossnagle published a great piece titled The cloud is great. Stop the hype. As a technologist, I totally agree that hype is what’s killing it. In fact, on several occasions I mentioned to my co-workers that often I get the feeling that cloud computing is already a bubble that will burst sooner rather than later.

Speaking of public IaaS vs datacenter virtualization runoff, however, I would like to point out that Theo’s points make a lot of sense for someone who already has an established well-functioning server hardware and network bandwidth operation. If you do, indeed in a lot of cases, as Theo shows in his post, IaaS cloud may be a weak alternative.

But reality however is that not all organizations fall into that category. I needed to borrow a pen (!) the other day in the office, let alone have a screwdriver to rack up all those servers and insert more RAM into them. And I am not even talking about having a network cable punchdown tool at work. If I get tasked with building out a hardware platform to support business, I’d be quite stuck.

Dealing with hardware is a complicated process. Having an established relationship with vendors to get better service, building up volume to earn better prices, etc are not things that get accomplished overnight (btw, I have never dealt a lot with hardware, you can probably tell). Having space in a rack, having enough power cords, and so on and so forth - the list goes on and on.

Bandwidth is the same. It’s not like you can go to Best Buy to pick up whatever you need to get more bandwidth for the marketing campaign you are about to do.

It’s all about marginal cost - cost of adding one more server or one more Mbps of bandwidth. If you have nothing (or it’s dysfunctional due to required approvals, or ridiculous wait times, etc), IaaS cloud very well may look like a better alternative.

I hope you all follow Theo’s advice - “Use the cloud where it makes sense.” And please - stop the hype!

Categories: cloud-computing |

Previous Page
Next Page