DevOps Tip: Design for Failure

We have had some interesting internal discussions lately about application reliability. It’s probably not a surprise to many of you that the cloud is unreliable, on a small scale that is. Sure, on the large scale you use the cloud to make highly resilient environments. But a certain percentage of calls to the cloud fail – whether it’s Amazon’s or Azure’s management APIs, or hitting Amazon or Azure storage, or going through an Amazon ELB, or hitting SQL Azure. Heck, on Azure they plainly state that they will pull your instances out from under you, restart them, and move them to other hardware without notice. If you’re running 2 or more, they won’t do them all at the same time – so again, you get large scale resilience but at the cost of some small scale unreliability.

The problem is, that people sometimes come from the assumption that their application is always working fine, unless you can prove otherwise. This is fundamentally the wrong assumption. You have to assume your application has problems, unless you can prove it doesn’t. This changes your approach to testing, logging, and monitoring profoundly.

Take the all too common example of an app with intermittent failures. Let’s say it’s as bad as 1 in 20 times. 1 in 20 times a customer hits your application, it fails somehow. It is very likely you don’t know this. Because by default, you don’t know it. How would you? Well, obviously, by monitoring, logging, and testing. I’ll follow this up with a series of posts describing how and why those often fail to detect problems. The short form is that “ha ha, no they don’t.”

Here’s a bad story I’ll tell on myself. Here at NI, we rolled out a PDF instant quote generation widget. We have over 250 apps on ni.com, so we don’t put synthetic monitors on all of them (remind me to tell you about the time early at NI that I discovered synthetic monitoring was producing 30% of our site load). Apparently the logging wasn’t all that good either, it didn’t trigger any of our log monitoring heuristics. Anyway, come to find out later on that the app was failing in production about 75% of the time. This is an application on a “monitored” site, where a developer and a tester signed off on the app. Whoops. If you do a cursory test and assume it’ll work – well you know what they say about assumptions – they make an ass out of “you” and “mption.” 🙂

Anyway, to me part of the good part about the cloud is that they come out and say “we’re going to fail 2-5% of the time, code for it.” Because before the cloud, there were failures all the time too, but people managed to delude themselves into thinking there weren’t; that an application (even a complex Internet-based application) should just work, hit after hit, day after day, on into the future. So by having handling failure built in – like a lesser version of the Chaos Monkey – you’re not really just making your app cloud friendly, you’re making it better.

Real engineers who make cars and whatnot know better. That’s why there was a big ol’ maintenance hatch on the side of the Hubble Space Telescope; if any of you have watched the Hubble 3D IMAX film you get to see them performing maintenance on it. If a billion dollar telescope in fricking space has problems and needs to be maintainable, so does your little Web app.

But I see so many apps that don’t really take failure into account. Oh, maybe they retry some connections if they fail. But what if you get to the end of your retries? What if the response you get back is an unexpected HTTP code or unexpected payload? You’d think in the age of try/catch and easily integrated logging frameworks you wouldn’t see this any more, but I see it all the time. It’s a combination of not realizing that failure is ubiquitous, and not thinking about the impact (especially the customer facing impact) of that failure.

This is one of the (many) great DevOps learning experiences – Ops helping teach Devs all the things that can go wrong that don’t really go wrong much in a “frictionless” lab environment. “So, what do you do if your hard drive is suddenly not there?” (Common with Amazon EBS failures.) “What do you do if you took data off that queue and then your instance restarts before you put it into the database?” (Hopefully a transaction.) “What do you do if you can’t make that network connection, are you retrying every 5 ms and then filling up the system’s TCP connections?” (True story.) “Hey, I’m sure your app is pure as the driven snow right now, but is it always going to work the same when the PaaS vendor changes the OS version under you?”

In all circumstances, you should

Plan for failure (understand failure modes, retry, design for it)
Detect failure (monitor, log, etc.)
Plan for and detect failure of your schemes to plan for and detect failure!

We do some security threat modeling here. I wonder if there’s not a lightweight methodology like that which could be readily adapted for reliability modeling of apps. Seems like something someone would have done… But a simple one, not like lame complicated risk matrices. I’ll have to research that.

Leave a comment Cancel reply

Subscribe

Recent Comments

Recent Posts

Austinites

Cloud

DevOps

@wickett

@ernestmueller

@iteration1

Archives