Tag Archives: APM

Ensuring Performance In Complex Architectures

The Agile Admin’s very own Peco Karayanev (@bproverb) gave this talk at Velocity this year.  Learn you some monitoring theory!

Leave a comment

Filed under Conferences, Monitoring

Velocity 2013 Day 2 Liveblog: Performance Troubleshooting Methodology

Stop the Guessing: Performance Methodologies for Production Systems

Slides are on Slideshare!

Brendan Gregg, Joyent

Note to the reader – this session ruled.

He’s from dtrace but he’s talking about performance for the rest of us. Coming soon, Systems Performance: Enterprises and the Cloud book.

Performance analysis – where do I start and what do I do?  It’s like troubleshooting, it’s easy to fumble around without a playbook. “Tools” are not the answer any more than they’re the answer to “how do I fix my car?”

Guessing Methodologies and Not Guessing Methodologies (Former are bad)


Traffic light anti-method

Monitors green?  You’re fine. But of course thresholds are a coarse grained tool, and performance is complex.  Is X bad?  Well sometimes, except when X, but then when Y, but…” Flase positives and false negatives abound.

You can improve it by more subjective metrics (like weather icons) – onjective is errors, alerts, SLAs – facts.

see dtrace.org status dashboard blog post

So traffic light is intuitive and fast to set up but it’s misleading and causes thrash.

Average anti-method

Measure the average/mean, assume a normal-like unimodal distribution and then focus your investigation on explaining the average.

This misses multiple peaks, outliers.

Fix this by adding histograms, density plots, frequency trails, scatter plots, heat maps

Concentration game anti-method

Pick a metric, find another that looks like it, investigate.

Simple and can discover correlations, but it’s time consuming and mostly you get more symptoms and not the cause.

Workload characterization method

Who is causing the load, why, what, how. Target is the workload not the performance.

lets you eliminate unnecessary work. Only solves load issues though, and most things you examine won’t be a problem.

[Ed: When we did our Black Friday performance visualizer I told them “If I can’t see incoming traffic on the same screen as the latency then it’s bullshit.”]

USE method

For every resource, check utilization, saturation, errors.

util: time resource busy

sat: degree of queued extra work

Finds your bottlenecks quickly

Metrics that are hard to get become feature requests.

You can apply this methodology without knowledge of the system (he did the Apollo 11 command module as an example).

See the use method blog post for detailed commands

For cloud computing you also need the “virtual” resource limits – instance network caps. App stuff like mutex locks and thread pools.  Decompose the app environment into queueing systems.

[Ed: Everything is pools and queues…]

So go home and for your system and app environment, create a USE checklist and fill out metrics you have. You know what you have, know what you don’t have, and a checklist for troubleshooting.

So this is bad ass and efficient, but limited to resource bottlenecks.

Thread State Analysis Method

Six states – executing, runnable, anon paging, sleeping, lock, idle

Getting this isn’t super easy, but dtrace, schedstats, delay accounting, I/O accounting, /proc

Based on where the time is leads to direct actionables.

Compare to e.g. database query time – it’s not self contained. “Time spent in X” – is it really? Or is it contention?

So this identifies, quantifies, and directs but it’s hard to measure all the states atm.

There’s many more if perf is your day job!

Stop the guessing and go with ones that pose questions and seek metrics to answer them.  P.S. use dtrace!


Leave a comment

Filed under Conferences, DevOps

Monitorin’ Ain’t Easy

The DevOps space has been aglow with discussion about monitoring.  Monitoring, much like pimping, is not easy. Everyone does it, but few do it well.

Luckily, here on the agile admin we are known for keeping our pimp hands strong, especially when it comes to monitoring. So let’s talk about monitoring, and how to use it to keep your systems in line and giving you your money!

In November, I posted about why your monitoring is lying to you. It turns out that this is part of a DevOps-wide frenzy of attention to monitoring.

John Vincent (@lusis) started it with his Why Monitoring Sucks blog post, which has morphed into a monitoringsucks github project to catalog monitoring tools and needs. Mainstream monitoring hasn’t changed in oh like 10 years but the world of computing certainly has, so there’s a gap appearing. This refrain was quickly picked up by others (Monitoring  Sucks. Do Something About It.) There was a Monitoring Sucks panel at SCALE last week and there’s even a #monitoringsucks hashtag.

Patrick Debois (@patrickdebois) has helped step into the gap with his series of “Monitoring Wonderland” articles where he’s rounding up all kinds of tools. Check them out…

However it just shows how fragmented and confusing the space is. It also focuses almost completely on the open source side – I love open source and all but sometimes you have to pay for something. Though the “big ol’ suite” approach from the HP/IBM/CA lot makes me scream and flee, there’s definitely options worth paying for.

Today we had a local DevOps meetup here in Austin where we discussed monitoring. It showed how fragmented the current state is.  And we met some other folks from a “real” engineering company, like NI, and it brought to mind how crap IT type monitoring is when compared to engineering monitoring in terms of sophistication.  IT monitoring usually has better interfaces and alerting, but IT monitoring products are very proud when they have “line graphs!” or the holy grail, “histograms!” Engineering monitoring systems have algorithms where they can figure out the difference between a real problem and a monitoring problem.  They apply advanced algorithms when looking at incoming metrics (hint: signal processing).  When is anyone in IT world who’s all delirious about how cool “metrics” going to figure out some math above the community college level?

To me, the biggest gap especially in cloud land – partially being addressed by New Relic and Boundary – in the space is agent based real user monitoring.  I want to know each user and incoming/outgoing transaction, not at the “tcpdump” level but at the meaningful level.  And I don’t want to have to count on the app to log it – besides the fact that devs are notoriously shitful loggers, there are so many cases where something goes wrong – if tomcat’s down, it’s not logging, but requests are still coming in…  Synthetic monitoring and app metrics are good but they tend to not answer most of the really hard questions we get with cloud apps.

We did a big APM (application performance management) tool eval at NI, and got a good idea of the strengths and weaknesses of the many approaches. You end up wanting many/all of them really. Pulling box metrics via SNMP or agents, hitting URLs via synthetic monitors locally or across the Internet, passive network based real user monitoring, deep dive metric gathering (Opnet/AppDynamics/New Relic/etc.)…  We’ll post more about our thoughts on all these (especially Peco, who led that eval and is now working for an APM company!).

Your thoughts on monitoring?  Hit me!

Leave a comment

Filed under DevOps

Application Performance Management in the Cloud

Cloud computing has been the buzz and hype lately and everybody is trying to understand what it is and how to use it. In this post, I wanted to explore some of the properties of “the cloud” as they pertain to Application Performance Management. If you are new to cloud offerings, here are a few good materials to get started, but I will assume the reader of this post is somewhat familiar with cloud technology.

As Spiderman would put it, “With great power comes great responsibility” … Cloud abstracts some of the infrastructure parts of your system and gives you the ability to scale up resource on demand. This doesn’t mean that your applications will magically work better or understand when they need to scale, you still need to worry about measuring and managing the performance of your applications and providing quality service to your customers. In fact, I would argue APM is several times more important to nail down in a cloud environment for several reasons:

  1. The infrastructure your apps live in is more dynamic and volatile. For example, cloud servers are VMs and resources given to them by the hypervisor are not constant, which will impact your application. Also, VMs may crash/lock up and not leave much trace of what was the cause of issues with your application.
  2. Your applications can grow bigger and more complex as they have more resources available in the cloud. This is great but it will expose bottlenecks that you didn’t anticipate. If you can’t narrow down the root cause, you can’t fix it. For this, you need scalable APM instrumentation and precise analytics.
  3. On-demand scaling is a feature that is touted by all cloud providers but it all hinges on one really hard problem that they won’t solve for you: Understanding your application performance and workload behave so that you can properly instruct the cloud scaling APIs what to do. If my performance drops do I need more machines? How many? APM instrumentation and analytics will play a key part of how you can solve the on-demand scaling problem.

So where is APM for the cloud, you ask? It is in its very humble beginnings as APM solution providers have to solve the same gnarly problems application developers and operations teams are struggling with:

  1. Cloud is dynamic and there are no fixed resources. IP addresses, hostnames, and MAC addresses change, among other things. Properly addressing, naming and connecting the machines is a challenge – both for deep instrumentation agent technology and for the apps themselves.
  2. Most vendors’ licensing agreements are based on fixed count of resources billing rather than a utility model, and are therefore difficult to use in a cloud environment. Charging a markup on the utility bill seems to be a common approach among the cloud-literate but introduces a kink in the regular budgeting process and no one wants to write a variable but sizable blank check.
  3. Well the cloud scales… a lot. The instrumentation / monitoring technology has to be low enough overhead and be able to support hundreds of machines.

On the synthetic monitoring end there are plenty of options. We selected AlertSite, a distributed synthetic monitoring SaaS provider similar to Keynote and Gomez, for simplicity but in only provides high level performance and availability numbers for SLA management and “keep the lights are on” alerting. We also rely on CloudKick, another SaaS provider, for the system monitoring part. Here is a more detailed use case on the Cloudkick implementation.

For deeper instrumentation we have worked with OPNET to deploy their AppInternals Xpert (former Opnet Panorama) solution in the Amazon EC2 cloud environment. We have already successfully deployed AppInternals Xpert on our own server infrastructure to provide deep instrumentation and analysis of our applications. The cloud version looks very promising for tackling the technical challenges introduced by the cloud, and once fully deployed, we will have a lot of capability to harness the cloud scale and performance. More on this as it unfolds…

In summary, as vendors sell you the cloud but be prepared to tackle the traditional infrastructure concerns and stick to your guns. No, the cloud is not fixing your performance problems and is not going to magically scale for you. You will need some form of APM to help you there. Be on the lookout for providers and do challenge your existing partners to think of how they can help you. The APM space in the cloud is where traditional infrastructure was in 2005 but things are getting better.

To the cloud!!! (And do keep your performance engineers on staff, they will still be saving your bacon.)


Leave a comment

Filed under Cloud