Velocity 2013 Day 3 Liveblog: How to Run a Post-Mortem With Humans (Not Robots)

How to Run a Post-Mortem With Humans (Not Robots)

Got here a little late – not enough time in these breaks!!!

Dan Milstein (@danmil) of Hut 8 talking on how to build a learn-from-failure friendly culture.

1. and 2. – missed ’em!

3. Relish the absurdities of your system.  Don’t be embarrassed when you get a new hire and you show them your sucky deployment.  Own it, enjoy it.

Axioms to follow to have a good postmortem:

  • Everyone involved acted in good faith
  • Everyone involved is competent
  • We’re doing this to find improvements

Human error is the question, not the answer. Restate the problem to include time to recovery. “Why” is fine but look at time to detection, time to resolution. Why so long?

“Which of these is the root cause?” That’s a stupid and irrelevant question. Usually there’s not one, it’s a conjunction of factors blee blee. Look for the “broadest fix.” [Ed: Need to get a “Root cause is a myth” shirt to go with my “Private cloud is a myth” one.]

Corrective actions/remediations/fixes

Incrementalism or you’re fired! You can’t boil the ocean and “replace it wholesale.” Engineers love to say “it’s so terrible we just can’t touch it we have to replace it.” No. You have 4 hours to do the simplest thing to make it better, go.
“Well… OK I guess we could put a wrapper script around it…” OK, great! [Ed: We need to do that with all our database-affecting command line tools… Wrapper script that checks replication lag and also logs who ran it… Done and done!]

Don’t think about automation, think about tools. People think that computers are perfectly reliable and we should remove the humans.  Evidence shows this doesn’t work well. Skynet syndrome – lots of power, often written by those who don’t do the job.  Tools -> humans solve the problem, iterate on giving them better tools. Not everyone brings this baggage with automation but many do. “Do the routine task” – automate.  “Should I do this task and how should it be done?” – human.

Things are in partial failure mode all the time.  [Ed: Peco calls this “near miss” syndrome from the way they make flying safer – learn from near misses, not just crashes.]

To get started:

  • Elect a postmortem boss
  • Look for a goldilocks incodent
  • Expect awkwardness (use some humor to defuse)
  • THERE MUST BE FIXES
  • incrementally improve the incremental improvements (and this process)

Reading list!  Dang get the slides.

Leave a comment

Filed under Conferences, DevOps

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s