Velocity 2013 Liveblog Day 3: Managing Incidents In The Wild

Managing Incidents In The Wild

Got here late! By Jonathan Reichhold (@jreichhold) from Twitter.

“Facebook is for useless posts, Twitter is for making fun of celebrities, and Instagram is for young people.” -My 11 year old

Step 2: Set Expectations

set expectations for times of failure–set communication methods, test your escalation tree

Be realistic & ambitious. Prioritize what can be fixed and fix it in its due time

Postmortems – improvement has to be part of the process.

Teamwork – management has to support site reliability as a feature, burn out your ops guys

Distributed systems fail – have to be robust against things that don’t happen “a lot” at small scale.  A 1 in 1,000,000 issue is EVERY DAMN MINUTE at scale. Design more robust

Large systems take time to design, stabilize in prod.

Don’t assume.  Be rigorous and vigilant.

Degrade gracefully, shed load

Don’t “learn bad lessons” from retrospectives like “never touch the X!”

Capacity planning – do it just in time but be realistic.  Figure out real buffers. “Facebook with their huge custom datacenters is all nice but that’s not us.”

Hardware has lead time. [Ed: That’s why it’s for punks]

This is a marathon not a sprint.  You have to keep yourself healthy or you’ll crash.  Maintain your systems and yourself.

Leave a comment

Filed under Conferences, DevOps

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.