Velocity 2013 Day 2 Liveblog: Performance Troubleshooting Methodology

Stop the Guessing: Performance Methodologies for Production Systems

Slides are on Slideshare!

Brendan Gregg, Joyent

Note to the reader – this session ruled.

He’s from dtrace but he’s talking about performance for the rest of us. Coming soon, Systems Performance: Enterprises and the Cloud book.

Performance analysis – where do I start and what do I do?  It’s like troubleshooting, it’s easy to fumble around without a playbook. “Tools” are not the answer any more than they’re the answer to “how do I fix my car?”

Guessing Methodologies and Not Guessing Methodologies (Former are bad)

Anti-Methods

Traffic light anti-method

Monitors green?  You’re fine. But of course thresholds are a coarse grained tool, and performance is complex.  Is X bad?  Well sometimes, except when X, but then when Y, but…” Flase positives and false negatives abound.

You can improve it by more subjective metrics (like weather icons) – onjective is errors, alerts, SLAs – facts.

see dtrace.org status dashboard blog post

So traffic light is intuitive and fast to set up but it’s misleading and causes thrash.

Average anti-method

Measure the average/mean, assume a normal-like unimodal distribution and then focus your investigation on explaining the average.

This misses multiple peaks, outliers.

Fix this by adding histograms, density plots, frequency trails, scatter plots, heat maps

Concentration game anti-method

Pick a metric, find another that looks like it, investigate.

Simple and can discover correlations, but it’s time consuming and mostly you get more symptoms and not the cause.

Workload characterization method

Who is causing the load, why, what, how. Target is the workload not the performance.

lets you eliminate unnecessary work. Only solves load issues though, and most things you examine won’t be a problem.

[Ed: When we did our Black Friday performance visualizer I told them “If I can’t see incoming traffic on the same screen as the latency then it’s bullshit.”]

USE method

For every resource, check utilization, saturation, errors.

util: time resource busy

sat: degree of queued extra work

Finds your bottlenecks quickly

Metrics that are hard to get become feature requests.

You can apply this methodology without knowledge of the system (he did the Apollo 11 command module as an example).

See the use method blog post for detailed commands

For cloud computing you also need the “virtual” resource limits – instance network caps. App stuff like mutex locks and thread pools.  Decompose the app environment into queueing systems.

[Ed: Everything is pools and queues…]

So go home and for your system and app environment, create a USE checklist and fill out metrics you have. You know what you have, know what you don’t have, and a checklist for troubleshooting.

So this is bad ass and efficient, but limited to resource bottlenecks.

Thread State Analysis Method

Six states – executing, runnable, anon paging, sleeping, lock, idle

Getting this isn’t super easy, but dtrace, schedstats, delay accounting, I/O accounting, /proc

Based on where the time is leads to direct actionables.

Compare to e.g. database query time – it’s not self contained. “Time spent in X” – is it really? Or is it contention?

So this identifies, quantifies, and directs but it’s hard to measure all the states atm.

There’s many more if perf is your day job!

Stop the guessing and go with ones that pose questions and seek metrics to answer them.  P.S. use dtrace!

dtrace.org/blogs/brendan

Leave a comment

Filed under Conferences, DevOps

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.