For regular agile admin readers, I wanted to point out the post I did on the Bazaarvoice engineering blog, How Bazaarvoice Weathered The AWS Storm, on how we have designed for resiliency to the point where we had zero end user facing downtime during last year’s AWS meltdown and Leapocalypse. It’s a bit late, I wrote it like in July and then the BV engineering blog kinda fell dormant (guy who ran it left, etc.) and we’re just getting it reinvigorated. Anyway, go read the article and also watch that blog for more good stuff to come!
Tag Archives: outage
As I write this, our ops team is working furiously to bring up systems outside Amazon’s US East region to recover from the widespread outage they are having this morning. Naturally the Twitterverse, and in a day the blogosphere, and in a week the trade rag…axy? will be talking about how the cloud is unreliable and you can’t count on it for high availability.
And of course this is nonsense. These outages make the news because they hit everyone at once, but all the outages people are continually having at their own data centers are just as impactful, if less hit-generating in terms of news stories.
Sure, we’re not happy that some of our systems are down right now. But…
- The outage is only affecting one of our beta products, our production SaaS service is still running like a champ, it just can’t scale up right now.
- Our cloud product uptime is still way higher than our self hosted uptime. We have network outages, Web system outages, etc. all the time even though we have three data centers and millions of dollars in network gear and redundant Internet links and redundant servers.
People always assume they’d have zero downtime if they were hosting it. Or maybe that they could “make it get fixed” when it does go down. But that’s a false sense of security based on an archaic misconception that having things on premise gives you any more control over them.We run loads of on premise Web applications and a batch of cloud ones, and once we put some Keynote/Gomez/Alertsite against them we determined our cloud systems have much higher uptime.
Now, there are things that Amazon could do to make all this better on customers. In the Amazon SLAs, they say of course you can have super high uptime – if you are running redundantly across AZs and, in this case, regions. But Amazon makes it really unattractive and difficult to do this.
What Amazon Can Do Better
- We can work around this issue by bringing up instances in other regions. Sadly, we didn’t already have our AMIs transferred into those regions, and you can only bring up instances off AMIs that are already in those regions. And transferring regions is a pain in the ass. There is absolutely zero reason Amazon doesn’t provide an API call to copy an AMI from region 1 to region 2. Bad on them. I emailed my Amazon account rep and just got back the top Google hits for “Amazon AMI region migrate”. Thanks, I did that already.
- We weren’t already running across multiple regions and AZs because of cost. Some of that is the cost of redundancy in and of itself, but more importantly is the hateful way Amazon does reserve pricing, which very much pushes you towards putting everything in one AZ.
- Also, redundancy only really works if you have everything, including data, in that AZ. If you are running redundant app servers across 4 AZs, but have your database in one of them – 0r have a database master in one and slaves in the others – you still get hosed by a particular region downtime.
Amazon needs to have tools that inherently let you distribute your stuff across their systems and needs to make their pricing/reserve strategy friendlier to doing things in what they say is “the right way.”
What We Could Do Better
We weren’t completely prepared for this. Once that region was already borked, it was impossible to migrate AMIs out of it, and there are so many ticky little region specific things all through the Amazon config – security groups, ELBs, etc – doing that on the fly is not possible unless you have specifically done it before, and we hadn’t.
We have an automation solution (PIE) that will regen our entire cloud for us in a short amount of time, but it doesn’t handle the base images, some of which we modify and re-burn from the Amazon ones. We don’t have that process automated and the documentation was out of date since Fedora likes to move their crap around all the time.
In the end, Amazon started coming back just as we got new images done in us-west-1. We’ll certainly work on automating that process, and hope that Amazon will also step up to making it easier for their customers to do so.