Managing Incidents In The Wild
Got here late! By Jonathan Reichhold (@jreichhold) from Twitter.
“Facebook is for useless posts, Twitter is for making fun of celebrities, and Instagram is for young people.” -My 11 year old
Step 2: Set Expectations
set expectations for times of failure–set communication methods, test your escalation tree
Be realistic & ambitious. Prioritize what can be fixed and fix it in its due time
Postmortems – improvement has to be part of the process.
Teamwork – management has to support site reliability as a feature, burn out your ops guys
Distributed systems fail – have to be robust against things that don’t happen “a lot” at small scale. A 1 in 1,000,000 issue is EVERY DAMN MINUTE at scale. Design more robust
Large systems take time to design, stabilize in prod.
Don’t assume. Be rigorous and vigilant.
Degrade gracefully, shed load
Don’t “learn bad lessons” from retrospectives like “never touch the X!”
Capacity planning – do it just in time but be realistic. Figure out real buffers. “Facebook with their huge custom datacenters is all nice but that’s not us.”
Hardware has lead time. [Ed: That’s why it’s for punks]
This is a marathon not a sprint. You have to keep yourself healthy or you’ll crash. Maintain your systems and yourself.