How to Run a Post-Mortem With Humans (Not Robots)
Got here a little late – not enough time in these breaks!!!
Dan Milstein (@danmil) of Hut 8 talking on how to build a learn-from-failure friendly culture.
1. and 2. – missed ’em!
3. Relish the absurdities of your system. Don’t be embarrassed when you get a new hire and you show them your sucky deployment. Own it, enjoy it.
Axioms to follow to have a good postmortem:
- Everyone involved acted in good faith
- Everyone involved is competent
- We’re doing this to find improvements
Human error is the question, not the answer. Restate the problem to include time to recovery. “Why” is fine but look at time to detection, time to resolution. Why so long?
“Which of these is the root cause?” That’s a stupid and irrelevant question. Usually there’s not one, it’s a conjunction of factors blee blee. Look for the “broadest fix.” [Ed: Need to get a “Root cause is a myth” shirt to go with my “Private cloud is a myth” one.]
Corrective actions/remediations/fixes
Incrementalism or you’re fired! You can’t boil the ocean and “replace it wholesale.” Engineers love to say “it’s so terrible we just can’t touch it we have to replace it.” No. You have 4 hours to do the simplest thing to make it better, go.
“Well… OK I guess we could put a wrapper script around it…” OK, great! [Ed: We need to do that with all our database-affecting command line tools… Wrapper script that checks replication lag and also logs who ran it… Done and done!]
Don’t think about automation, think about tools. People think that computers are perfectly reliable and we should remove the humans. Evidence shows this doesn’t work well. Skynet syndrome – lots of power, often written by those who don’t do the job. Tools -> humans solve the problem, iterate on giving them better tools. Not everyone brings this baggage with automation but many do. “Do the routine task” – automate. “Should I do this task and how should it be done?” – human.
Things are in partial failure mode all the time. [Ed: Peco calls this “near miss” syndrome from the way they make flying safer – learn from near misses, not just crashes.]
To get started:
- Elect a postmortem boss
- Look for a goldilocks incodent
- Expect awkwardness (use some humor to defuse)
- THERE MUST BE FIXES
- incrementally improve the incremental improvements (and this process)
Reading list! Dang get the slides.