Velocity 2013 Liveblog Day 3: Managing Incidents In The Wild

Managing Incidents In The Wild

Got here late! By Jonathan Reichhold (@jreichhold) from Twitter.

Step 2: Set Expectations

set expectations for times of failure–set communication methods, test your escalation tree

Be realistic & ambitious. Prioritize what can be fixed and fix it in its due time

Postmortems – improvement has to be part of the process.

Teamwork – management has to support site reliability as a feature, burn out your ops guys

Distributed systems fail – have to be robust against things that don’t happen “a lot” at small scale.  A 1 in 1,000,000 issue is EVERY DAMN MINUTE at scale. Design more robust

Large systems take time to design, stabilize in prod.

Don’t assume.  Be rigorous and vigilant.

Degrade gracefully, shed load

Don’t “learn bad lessons” from retrospectives like “never touch the X!”

Capacity planning – do it just in time but be realistic.  Figure out real buffers. “Facebook with their huge custom datacenters is all nice but that’s not us.”

Hardware has lead time. [Ed: That’s why it’s for punks]

This is a marathon not a sprint.  You have to keep yourself healthy or you’ll crash.  Maintain your systems and yourself.

