Standalone Web Front Door a Must in EC2?

Most of you have probably heard about a recent outage at BitBucket. In a nutshell, their systems hosted at AWS came under a UDP flood DDoS attack, which led to significantly increased traffic, which led to saturation of their local network interface, which led to their being unable to connect to their data stored on EBS, which led to their application becoming unresponsive.

This outage shed more light on some internal designs of EC2 itself, as described here. It might have also showcased our over-confidence in EC2’s ability to detect and defeat certain types of network attacks. But this post is about something else.

BitBucket was running their web front door and their backend application on the same instance. Front door is a part of the system which is facing the Internet and its task is to accept connections from clients. For obvious reasons, front door is running on the service's discoverable IP address - whether they used Elastic IP or not, bitbucket.org resolved to that IP. Note that front door (usually) doesn't need EBS.

Backend, however, is what needs EBS for disk persistence. At the same time, backend does not need to be publicly discoverable - as long as front door knows where its backend worker(s) is/are running, the app should be functioning just fine.

With front door and backend running on different instances, UDP flood would have saturated only the former's network interface and would have had no impact on the backend and its EBS.

I know that AWS reportedly fixed the flood issue, but looks to me like separating front door and application backend may still be a good preventive measure - after all, it’s considered a good practice for a reason.

Please note that I am not trying to accuse BitBucket of running a bad architecture and causing their own outage. All I am doing is trying to learn a lesson.

Categories: cloud-computing |

Comments (4)

Jesper Noehr // 13 Oct 2009

Our frontend needs disk access. A more reasonable way to fix the problem is to access EBS over an internal interface, which is what we assumed Amazon was already doing. That, or QoS (which was deployed, but wasn't working, Amazon reported.)

Dmitriy // 13 Oct 2009

@Jesper

Thanks for commenting. Totally agree with you - hosting provider dropped the ball on this one. Besides, this incident changed our understanding about how EBS was implemented in the first place - I did not foresee that EBS traffic shares NIC with regular traffic, for example.

I meant "front door" in a wider sense - anything where end-users' connections are terminated. A "network front door" as opposed to "webapp front door" (which gives out static content, etc).

"Network front door" can be implemented as reverse proxy for HTTP (nginx, squid, varnish) and as generic TCP forwarder for everything else (haproxy).

I think such "network front door" instance would not need access to your EBS disk.

Jesper Noehr // 13 Oct 2009

@Dmitryi,

I understand what you're saying, and that's part of what we're doing. nginx serves some static media, which is part of our repository, which lives on EBS.

Surely we could've prevented that from being the case, but any way you look at it, we would've been down due to the DDoS, and the site would've been down. The problem lies in not being able to tell *what* the problem was.

Dmitriy // 13 Oct 2009

@Jesper

Agreed. Detection and mitigation of DDoS attacks is usually best accomplished at hosting provider level, exactly because they have more visibility.

And I certainly agree that it took AWS slightly longer than I'd expect to diagnose the issue.