In the afternoon, we move into full session mode. There’s two tracks, and I can only cover one, but that’s what I have Peco and Robert around for! Well, that and to have someone to outdrink. (Ooo burn!) They’ll be posting their writeups at some point as well – you can go to the Velocity schedule page to see the other sessions and to the presentations page to get slides where they exist.
First afternoon session: My panel! I am on the “Measuring Performance“ panel with Steve Souders, Ryan Breen of Gomez, Bill Scott of Netflix, and Scott Ruthfield from whitepages.com (a fellow Rice U/Lovetteer!) It went well. We talked about end user performance monitoring, all the other kinds of tools you can use and their drawbacks, and about “newfangled” monitoring of perf w/AJAX, SOA, RIAs, etc. No questions; not sure if the audience liked it or not. But I did get a number of people saying “good work” later so I’ll declare victory. 🙂
“Actionable Logging for Smoother Operation and Faster Recovery,” by Mandi Walls of AOL. It’s a quick 30 minute session. Logging should be actionable – concise, express symptoms. Anything logged is something fixable. It should be giving you less downtime – shorter time to resolution. Logging takes resources, so make it worth it.
Filter down your logs to be concise and actionable. Production logging has different goals from dev/QA logging. You’re looking for problem diagnosis and recovery, and then statistics and monitoring. Insight into what the app’s doing.
You need a standard log file location. On our UNIX servers, the UNIX team gives us “/opt/apps” as the place where we can put stuff and gets cranky about any files outside of that. We make everyone log to one place – /opt/apps/logs/<appname> for this reason. Makes it easy to manage disk space, rotate logs, run “find”s, etc.
Roll your logs and have a standard file naming format. We prefer log.YYYYMMDD[HHMMSS] because it’s then sorted in date order.
You want standard, good timestamps, formats, etc. Ideally. Hard to do in practice, which is why at NI we use Splunk for log file management – it can detect/be told about different formats, timestamps, etc. and it’ll do this for you. Have a standard, that’s fine, but most 3p software and some of your programmers won’t follow it.
Use log levels. Don’t log too much or not enough, and standards for levels help with that. Log lines should be helpful – what program module? What were the variables at hand?
Don’t log passwords, usernames, etc. Splunk has facilities to automatically suppress these by the way. I don’t own stock in them or anything, I’m just sayin’.
Logs are often the first line of information for troubleshooting, so the better it is, the better you can recover quickly.
My take on this session – all pretty basic, but solid. Logging 101.
Third session, another 30 minute quickie, is by Goranka Bjedov from Google, on stress, load, and performance testing in QA. She focuses on the back end, as opposed to Steve’s client side focus. She analyzes scalability, bottlenecks, probable issues, etc. and feeds them to ops.
QA is not brain surgery, she says, and it should be expected for them to provide this kind of information. And you don’t have to perfectly reproduce the production environment for it. You can learn 80% of it on a modest server under modest load. She totally eliminates the network, which “someone else should be looking at” (who?).
Tests aren’t 100% reproducible. You have to go statistical – run the tests several times and see averages and deviation. She prefers JMeter, The Grinder, and FunkLoad – consider OpenSTA in Windows. She finds they are as good as LoadRunner etc. They use log replay, not sure with what tool.
And that’s it! She writes about performance on the Google blog. I’ll check it out!
This session needed slides – “performance testing is easy” and “use open source” aren’t much to get out of one of these sessions.
Next, another longer 45 minute session – “Incident Command for IT: What We Can Lean From The Fire Department,” by Brent Chapman from Great Circle.
The core idea is that public safety agencies all deal with emergencies all the time. What are some best practices we can glean from them? They organize on the fly, coordinate efforts of multiple agencies, and evolve the organization as the incident progresses.
Example: a car hits a fire hydrant. You have fire, ambulance, police, water, power people all involved and in a specific order, and it’s a time critical event. Another example is SoCal wildfires. Obvious IT analogies (data center outage…).
So an “Incident Command System” was developed to address questions like this. It’s a set of standard tools for command, control, and coordination of incidents. Started in SoCal but has evolved into a national standard.
ICS recommends a modular, scalable org structure, consisting of command, ops, logistics, planning, and admin sections. Can be one person until more folks show up. Command section plans. Operations section does the work, and assists command in development of a consolidated action plan. It’s usually the largest. Planning maintains status & plans. Logistics section gets stuff. Admin/finance pays, tracks costs, etc. Sections are created/grown as needed.
The senior-most first responder is usually incident commander and transfer of command is explicit. Delegates work as necessary and possible.
Maintain a manageable span of control. Each supervisor should have 3-7 subordinates (5 ideal). New levels are created as needed.
Unity of command. In an incident each person has one boss, period. Matrixes have to be avoided in an emergency.
Transfers of responsibility are always explicit, and more senior arriving doesn’t necessarily take over.
Clear communications. All comms have to be clear and complete (no code). Talk directly to resources when possible, traversing the tree to get to them (keeping management informed).
Consolidated action plans. Command communicates high level action plan per operational period (hour to shift to day to whatever). Write it down, especially if it crosses organizational or specialty boundaries.
Management by objective. Tell people what to accomplish, not how.
Comprehensive resource management. All assets & personnel tracked via Admin section. Sign in and be assigned.
Designated incident facilities – a command post. And a staging area for resources.
Then he walks through a case study involving one of two data centers going offline. Hopefully the slides’ll be available because this is a lot of typing. It’s engaging though. We have tended to “roughly” follow this model in practice just by instinct – like I always make sure there’s one person who “has the ball” during an incident (command). I think one of the biggest takeaways is to understand as first on you’re mainly Command – and Ops, and Status – until you spin it off explicitly. Too many ops folks just do the ops and don’t do command or status.
In closing, you should practice ICS and use it for planned events like moves/upgrades. Download preso from www.greatcircle.com.
We do some things like that on our team. I am disappointed that this is basically a “what if” preso, not something he’s implemented in IT organizations… Seems like more of an Ignite candidate.
Now to try to hunt down treats… Apparently the Marriott staff brought out some snacks out in the hall during the session, and quickly took them away before the break started. Boo.