Most of you have probably heard about a recent outage at BitBucket. In a nutshell, their systems hosted at AWS came under a UDP flood DDoS attack, which led to significantly increased traffic, which led to saturation of their local network interface, which led to their being unable to connect to their data stored on EBS, which led to their application becoming unresponsive.
This outage shed more light on some internal designs of EC2 itself, as described here. It might have also showcased our over-confidence in EC2’s ability to detect and defeat certain types of network attacks. But this post is about something else.
BitBucket was running their web front door and their backend application on the same instance. Front door is a part of the system which is facing the Internet and its task is to accept connections from clients. For obvious reasons, front door is running on the service's discoverable IP address - whether they used Elastic IP or not, bitbucket.org resolved to that IP. Note that front door (usually) doesn't need EBS.
Backend, however, is what needs EBS for disk persistence. At the same time, backend does not need to be publicly discoverable - as long as front door knows where its backend worker(s) is/are running, the app should be functioning just fine.
With front door and backend running on different instances, UDP flood would have saturated only the former's network interface and would have had no impact on the backend and its EBS.
I know that AWS reportedly fixed the flood issue, but looks to me like separating front door and application backend may still be a good preventive measure - after all, it’s considered a good practice for a reason.
Please note that I am not trying to accuse BitBucket of running a bad architecture and causing their own outage. All I am doing is trying to learn a lesson.