Perlbal Reproxy and HTTP Auth

09 Jan 2009

I use Perlbal in one of the systems to reproxy requests to an internal URL. Reproxying to URL is a powerful feature that works like this.

An HTTP request comes to Perlbal.
Perlbal reverse-proxies it to one of its backend servers.
Backend server does some work (in my case, does extensive verification of URL) but instead of returning entire response (status, headers, body), it returns X-REPROXY-URL header which includes a list of URLs.
Seamlessly to end user, perlbal attempts to fetch content from one of these URLs, and returns that new content to the user.

The other day I found out that Perlbal can’t reproxy to URLs that require HTTP basic authentication. Here is a part of Perlbal that parses X-REPROXY-URL header and you can clearly see from regex, it treats URLs as (host, port, path) tuples (this is from ClientProxy.pm).

    # construct reproxy_uri list
    if (defined $urls) {
        my @uris = split /\s+/, $urls;
        $self->{currently_reproxying} = undef;
        $self->{reproxy_uris} = [];
        foreach my $uri (@uris) {
            next unless $uri =~ m!^http://(.+?)(?::(\d+))?(/.*)?$!;
            push @{$self->{reproxy_uris}}, [ $1, $2 || 80, $3 || '/' ];
        }
    }

And my backend Apaches do require http auth. What to do?

RTFM to the rescue! Apache provides a very cool feature - Satisfy (all|any) command. Essentially, it means that for a Directory or Location I can specify both http auth and IP based access control, and using Satisfy Any I can allow access if at least one of these conditions are met (default is Satisfy All).

Here is what it looks like in http config:

<Location /foo>
  # http auth
  AuthType basic
  AuthName "protected"
  AuthUserFile /etc/apache2/users
  Require valid-user

  order allow,deny
  # this is subnet where perlbal is running
  # backends see perlbal's reproxy requests from this subnet
  allow from 192.168.4 127.0.0.1
  satisfy any
</Location>

Alternatively, I would need to create a fake URI outside of http auth location and rewrite it with mod_rewrite, or possibly use a symlink - way less transparent.

Categories: software-engineering |

Identification Friend or Foe (IFF) in IaaS Clouds

05 Jan 2009

I was recently building a distributed system which will run in Amazon EC2 cloud. It consisted of several instances of the same AMI that were going to communicate with each other using private IP addresses assigned by EC2.

One interesting scenario popped up in my head. What if, after initial discovery of each peer’s internal IP address, one of the instances goes down (let’s say it was at IP1) and at least one other instance fails to notice this fact and continues to communicate with IP1. EC2 assigns IP addresses dynamically, and as far as I can say, IP1 can get assigned to someone else’s instance within same minute. So my instance will be unknowingly communicating with someone else’s instance - not something that I want to allow.

A solution can be something what the military call Identification Friend or Foe (IFF). You can read about it in Wikipedia or here. Note that you may want to consider an IFF anytime you are running applications in any IaaS cloud that assigns IP addresses dynamically and/or has no way of predicting which IP address your next host is going to get.

My Basic IFF Solution

First of all, my instances do not have access to AWS credentials (here is why). Secondly, I set up a requirement that all instances that needed to communicate with each other were to be launched with the same user-data (from a running instance, you can obtain user-data from http://169.254.169.254/latest/user-data).

I then created 2 checksums (SHA1 or MD5) - 4633e65fce4cf3b40648f574f4b60070 was a checksum of user-data plus some file in the AMI (say /usr/share/doc/coreutils/NEWS.gz) and 7a66a9361b14e95c14d98522502b9487 was a checksum of user-data plus another file in the AMI (say /bin/rmdir). Note that if user-data on each instance are the same, these checksums will be the same, because the files I selected are the same.

Then in Apache, I have the following configuration:

<Location /4633e65fce4cf3b40648f574f4b60070>
AuthType basic
AuthName 7a66a9361b14e95c14d98522502b9487
AuthUserFile /etc/apache2/users
Require valid-user
</Location>

Before establishing communications to a peer instance (and regularly afterwards), I set up my instances to get HTTP headers from above location (without actually submitting HTTP auth username and password), check WWW-Authenticate header and look for the second checksum there. Easy and efficient. If both checksums match, the other instance is a friend. If not, a foe. In this case I also assume that if I didn’t get a response, it’s not a friend - an instance might have gone down or it might not have apache listening on 80 or its web server might not know what to do with my URI.

You can further enhance this solution by creating new checksums every N minutes - this should work reliably for as long as EC2 infrastructure has no trouble keeping the clock accurate. You can also embed the timestamp in data used to generate checksums. Furthermore, if you monitor your access logs for bad interrogations (for example, old checksum or wrong checksum), you might be able to easily detect attacks against your IFF system.

A More Scalable IFF

Peer-to-peer IFF implementation that I described above may not work well for large deployments or for enterprise. If you are within either of these categories, I can recommend that you take a look at VPN-Cubed, a product offered by my employer CohesiveFT. Its features essentially serve as a scalable encrypted IFF.

Categories: cloud-computing | distributed |

Introducing Rabbitbal

18 Dec 2008

Inspired by Nanite, a very interesting project by Ezra Zygmuntowicz of EngineYard that uses RabbitMQ and eventmachine-based ruby amqp library by Aman Gupta, I sat down and wrote Rabbitbal, a reverse proxy for Rails (as well as other web frameworks, not necessarily limited to Ruby) on top of RabbitMQ. It’s now available on github at http://github.com/somic/rabbitbal. Rabbitbal code is based on Nanite.

Here are benefits of AMQP-based approach over traditional HTTP-based reverse proxies taken from rabbitbal README file (in no particular order) as I see them.

Model where workers fetch work as fast as they can (each going at its own pace) in theory should provide more efficient load balancing than a model where proxy assigns work based on certain criteria; methods based on round robin, arbitrary weights or least connections become simply unnecessary.
RabbitMQ broker implements intelligent failover out of the box - if a web server disconnects before ack'ing, the request will be automagically requeued for another server; all in all, RabbitMQ is way smarter than a bunch of low level TCP connections.
Enhanced security of actual web servers - servers behind Rabbitbal do not need inbound connectivity, they only need to be able to establish an outgoing connection to RabbitMQ broker(s).
Rabbitbal does not need to know IPs or have direct connectivity into its web servers (use case: Amazon EC2 without Elastic IPs)
Using Duplication pattern of RabbitMQ (see Resources below), you could be reading requests and responses off of the same broker in real time (access log aggregation, double-entry book keeping, logging, bot detection)
You could relatively easily have one request go to more than 1 web server
Add capacity as often and as much as you like - rabbitbal won't even know
By slightly readjusting mapping between queues and URIs, you could add or remove capacity per URI if needed
TCP overhead savings compared with HTTP proxies (AMQP uses persistent TCP connections)

Categories: ruby |