As a follow-on to Why Your Monitoring Is Lying To You. How is it that you can have an application go through a whole test phase, with two-day-long load tests, and have surprising errors when you go to production? Well, here’s how… The same application I describe in the case study part of the monitoring article slipped through testing as well and almost went live with some issues. How, oh how could this happen…
I Didn’t See Any Errors!
Our developers quite reasonably said “But we’ve been developing and using this app in dev and test for months and haven’t seen this problem!” But consider the effects at work in But, You See, Your Other Customers Are Dumber Than We Are. There are a variety of levels of effect that prevent you from seeing intermittent problems, and confirmation bias ends up taking care of the rest.
The only fix here is rigor. If you hit your application and test and it errors, you can’t just ignore it. “I hit reload, it worked. Maybe they were redeploying. On with life!” Maybe it’s your layer, maybe it’s another layer, it doesn’t matter, you have to log that as a bug and follow up and not just cancel the bug as “not reproducible” if you don’t see it yourself in 5 minutes of trying. Devs sometimes get frustrated with us when we won’t let up on occurrences of transient errors, but if you don’t know why they happened and haven’t done anything to fix it, then it’s just a matter of time before it happens again, right?
We have a strict policy that every error is a bug, and if the error wasn’t detected it is multiple bugs – a bug with the monitoring, a bug with the testing, etc. If there was an error but “you don’t know why” – you aren’t logging enough or don’t have appropriate tools in place, and THAT’s a bug.
Our Load Test/Automated Tests Didn’t See Any Errors!
I’ll be honest, we don’t have much in the way of automated testing in place. So there’s that. But we have long load tests we run. “If there are intermittent failures they would have turned up over a two day load test right?” Well, not so fast. How confident are you this error is visible to and detected by your load test? I have seen MANY load test results in my lifetime where someone was happily measuring the response time of what turned out to be 500 errors. “Man, my app is a lot faster this time! The numbers look great! Wait… It’s not even deployed. I hit it manually and I get a Tomcat page.”
Often we build deliberate “lies” into our software. We throw “pretty” error pages that aren’t basic errors. We are trying not to leak information to customers so we bowderlize failures on the front end. We retry maniacally in the face of failed connections, but don’t log it. We have to use constrained sets of return codes because the client consuming our services (like, say, Silverlight) is lobotomized and doesn’t savvy HTTP 401 or other such fancy schmancy codes.
Be careful that load tests and automated tests are correctly interpreting responses. Look at your responses in Fiddler – we had what looked to the eye to be a 401 page that was actually passing back a 200 HTTP return code.
The best fix to this is test driven development. Run the tests first so you see them fail, then write the code so you see them work! Tests are code, and if you just write them on your working code then you’re not really sure if they’ll fail if somethings bad!
Also, you need to perform positive and negative fault testing. Test failures end to end, including monitoring and logging and scaling and other management stuff. At the far end of this you have the cool if a little crazy Chaos Monkey. Most of us aren’t ready or willing to jack up our production systems regularly, but you should at least do it in test and verify both that things work when they should and that they fail and you get proper notification and information if they do.
Try this. Have someone Chaos Monkey you by turning off something random – a database, making a file system read only, a back end Web service call. If you have redundancy built in to counter this, great, try it with one and see the failover, but then have them break “all of it” to provoke a failure. Do you see the failure and get alerted? More importantly, do you have enough information to tell what they broke? “One of the four databases we connect to” is NOT an adequate answer. Have someone break it, send you all the available logs and info, and if you can’t immediately pinpoint the problem, fix that.
How Complex Systems Fail, Invisibly
In the end, a lot of this boils down to How Complex Systems Fail. You can have failures at multiple levels – and not really failures, just assumptions – that stack on top of each other to both generate failures and prevent you from easily detecting those failures.
Also consider that you should be able to see those “short of failure” errors. If you’re failing over, or retrying, or whatnot – well it’s great that you’re not down, but don’t you think you should know if you’re having to fail over 100x more this week? Log it and turn it into a metric. On our corporate Web site, there’s hundreds of thousands of Web pages, so a certain level of 404s is expected. We don’t alert anyone on a 404. But we do metricize it and trend it and take notice if the number spikes up (or down – where’d all that bad content go?).
Whoelsale failures are easy to detect and mitigate. It’s the mini-failures, or things that someone would argue are not a failure, on a given level that line up with the same kinds of things on all the other layers and those lined up holes start letting problems slip through.