We all know from DevOps blameless retrospective wisdom that there is no such thing as a single “root cause.” One of the most common root causes people like to assign blame to is “human error”. Not to mince words, this is usually political, buck-passing CYA of the highest order.
I just read a great article on the recent U.S. Navy ship collision issues I wanted to pass on. If you have been keeping up with the news, there has been a rash of Navy ships colliding with other ships causing fatalities. When you go Google it up, you see a whole bunch of “Navy attributes it to human error…”
But now go read this article, Something’s Wrong In The Surface Fleet And We’re Not Talking About It. It’s written by Capt. Michael Junge, an experienced Naval officer. The TL;DR is that you can say “human error” all you want, fire someone, and call it case closed, but these accidents are a systemic amount of understaffing of Naval surface ships and massive undertraining and maintenance that is a leading indicator of even worse to come should an actual wartime deployment be necessary.
Even in engineering, we are tempted to push the problem down onto the person that made a mistake. Fully engaging with the system that caused the need for the action that caused the mistake, the lack of validation that makes mistakes possible, and so on is hard thinkin’. It is threatening when people point out flaws in processes and systems and code you had a hand in. But the only way to actually improve your situation is to soberly assess what the actual contributors to issues are, and work towards fixing them.
Got here a little late – not enough time in these breaks!!!
Dan Milstein (@danmil) of Hut 8 talking on how to build a learn-from-failure friendly culture.
1. and 2. – missed ’em!
3. Relish the absurdities of your system. Don’t be embarrassed when you get a new hire and you show them your sucky deployment. Own it, enjoy it.
Axioms to follow to have a good postmortem:
- Everyone involved acted in good faith
- Everyone involved is competent
- We’re doing this to find improvements
Human error is the question, not the answer. Restate the problem to include time to recovery. “Why” is fine but look at time to detection, time to resolution. Why so long?
“Which of these is the root cause?” That’s a stupid and irrelevant question. Usually there’s not one, it’s a conjunction of factors blee blee. Look for the “broadest fix.” [Ed: Need to get a “Root cause is a myth” shirt to go with my “Private cloud is a myth” one.]
Incrementalism or you’re fired! You can’t boil the ocean and “replace it wholesale.” Engineers love to say “it’s so terrible we just can’t touch it we have to replace it.” No. You have 4 hours to do the simplest thing to make it better, go.
“Well… OK I guess we could put a wrapper script around it…” OK, great! [Ed: We need to do that with all our database-affecting command line tools… Wrapper script that checks replication lag and also logs who ran it… Done and done!]
Don’t think about automation, think about tools. People think that computers are perfectly reliable and we should remove the humans. Evidence shows this doesn’t work well. Skynet syndrome – lots of power, often written by those who don’t do the job. Tools -> humans solve the problem, iterate on giving them better tools. Not everyone brings this baggage with automation but many do. “Do the routine task” – automate. “Should I do this task and how should it be done?” – human.
Things are in partial failure mode all the time. [Ed: Peco calls this “near miss” syndrome from the way they make flying safer – learn from near misses, not just crashes.]
To get started:
- Elect a postmortem boss
- Look for a goldilocks incodent
- Expect awkwardness (use some humor to defuse)
- THERE MUST BE FIXES
- incrementally improve the incremental improvements (and this process)
Reading list! Dang get the slides.