Stop the Guessing: Performance Methodologies for Production Systems
Brendan Gregg, Joyent
Note to the reader – this session ruled.
He’s from dtrace but he’s talking about performance for the rest of us. Coming soon, Systems Performance: Enterprises and the Cloud book.
Performance analysis – where do I start and what do I do? It’s like troubleshooting, it’s easy to fumble around without a playbook. “Tools” are not the answer any more than they’re the answer to “how do I fix my car?”
Guessing Methodologies and Not Guessing Methodologies (Former are bad)
Traffic light anti-method
Monitors green? You’re fine. But of course thresholds are a coarse grained tool, and performance is complex. Is X bad? Well sometimes, except when X, but then when Y, but…” Flase positives and false negatives abound.
You can improve it by more subjective metrics (like weather icons) – onjective is errors, alerts, SLAs – facts.
see dtrace.org status dashboard blog post
So traffic light is intuitive and fast to set up but it’s misleading and causes thrash.
Measure the average/mean, assume a normal-like unimodal distribution and then focus your investigation on explaining the average.
This misses multiple peaks, outliers.
Fix this by adding histograms, density plots, frequency trails, scatter plots, heat maps
Concentration game anti-method
Pick a metric, find another that looks like it, investigate.
Simple and can discover correlations, but it’s time consuming and mostly you get more symptoms and not the cause.
Workload characterization method
Who is causing the load, why, what, how. Target is the workload not the performance.
lets you eliminate unnecessary work. Only solves load issues though, and most things you examine won’t be a problem.
[Ed: When we did our Black Friday performance visualizer I told them “If I can’t see incoming traffic on the same screen as the latency then it’s bullshit.”]
For every resource, check utilization, saturation, errors.
util: time resource busy
sat: degree of queued extra work
Finds your bottlenecks quickly
Metrics that are hard to get become feature requests.
You can apply this methodology without knowledge of the system (he did the Apollo 11 command module as an example).
See the use method blog post for detailed commands
For cloud computing you also need the “virtual” resource limits – instance network caps. App stuff like mutex locks and thread pools. Decompose the app environment into queueing systems.
[Ed: Everything is pools and queues…]
So go home and for your system and app environment, create a USE checklist and fill out metrics you have. You know what you have, know what you don’t have, and a checklist for troubleshooting.
So this is bad ass and efficient, but limited to resource bottlenecks.
Thread State Analysis Method
Six states – executing, runnable, anon paging, sleeping, lock, idle
Getting this isn’t super easy, but dtrace, schedstats, delay accounting, I/O accounting, /proc
Based on where the time is leads to direct actionables.
Compare to e.g. database query time – it’s not self contained. “Time spent in X” – is it really? Or is it contention?
So this identifies, quantifies, and directs but it’s hard to measure all the states atm.
There’s many more if perf is your day job!
Stop the guessing and go with ones that pose questions and seek metrics to answer them. P.S. use dtrace!