The Agile Admin’s very own Peco Karayanev (@bproverb) gave this talk at Velocity this year. Learn you some monitoring theory!
The Agile Admin’s very own Peco Karayanev (@bproverb) gave this talk at Velocity this year. Learn you some monitoring theory!
Filed under Conferences, Monitoring
Brendan Gregg, Joyent
Note to the reader – this session ruled.
He’s from dtrace but he’s talking about performance for the rest of us. Coming soon, Systems Performance: Enterprises and the Cloud book.
Performance analysis – where do I start and what do I do? It’s like troubleshooting, it’s easy to fumble around without a playbook. “Tools” are not the answer any more than they’re the answer to “how do I fix my car?”
Monitors green? You’re fine. But of course thresholds are a coarse grained tool, and performance is complex. Is X bad? Well sometimes, except when X, but then when Y, but…” Flase positives and false negatives abound.
You can improve it by more subjective metrics (like weather icons) – onjective is errors, alerts, SLAs – facts.
see dtrace.org status dashboard blog post
So traffic light is intuitive and fast to set up but it’s misleading and causes thrash.
Measure the average/mean, assume a normal-like unimodal distribution and then focus your investigation on explaining the average.
This misses multiple peaks, outliers.
Fix this by adding histograms, density plots, frequency trails, scatter plots, heat maps
Pick a metric, find another that looks like it, investigate.
Simple and can discover correlations, but it’s time consuming and mostly you get more symptoms and not the cause.
Who is causing the load, why, what, how. Target is the workload not the performance.
lets you eliminate unnecessary work. Only solves load issues though, and most things you examine won’t be a problem.
[Ed: When we did our Black Friday performance visualizer I told them “If I can’t see incoming traffic on the same screen as the latency then it’s bullshit.”]
For every resource, check utilization, saturation, errors.
util: time resource busy
sat: degree of queued extra work
Finds your bottlenecks quickly
Metrics that are hard to get become feature requests.
You can apply this methodology without knowledge of the system (he did the Apollo 11 command module as an example).
See the use method blog post for detailed commands
For cloud computing you also need the “virtual” resource limits – instance network caps. App stuff like mutex locks and thread pools. Decompose the app environment into queueing systems.
[Ed: Everything is pools and queues…]
So go home and for your system and app environment, create a USE checklist and fill out metrics you have. You know what you have, know what you don’t have, and a checklist for troubleshooting.
So this is bad ass and efficient, but limited to resource bottlenecks.
Six states – executing, runnable, anon paging, sleeping, lock, idle
Getting this isn’t super easy, but dtrace, schedstats, delay accounting, I/O accounting, /proc
Based on where the time is leads to direct actionables.
Compare to e.g. database query time – it’s not self contained. “Time spent in X” – is it really? Or is it contention?
So this identifies, quantifies, and directs but it’s hard to measure all the states atm.
There’s many more if perf is your day job!
Stop the guessing and go with ones that pose questions and seek metrics to answer them. P.S. use dtrace!
Filed under Conferences, DevOps
The DevOps space has been aglow with discussion about monitoring. Monitoring, much like pimping, is not easy. Everyone does it, but few do it well.
Luckily, here on the agile admin we are known for keeping our pimp hands strong, especially when it comes to monitoring. So let’s talk about monitoring, and how to use it to keep your systems in line and giving you your money!
In November, I posted about why your monitoring is lying to you. It turns out that this is part of a DevOps-wide frenzy of attention to monitoring.
John Vincent (@lusis) started it with his Why Monitoring Sucks blog post, which has morphed into a monitoringsucks github project to catalog monitoring tools and needs. Mainstream monitoring hasn’t changed in oh like 10 years but the world of computing certainly has, so there’s a gap appearing. This refrain was quickly picked up by others (Monitoring Sucks. Do Something About It.) There was a Monitoring Sucks panel at SCALE last week and there’s even a #monitoringsucks hashtag.
Patrick Debois (@patrickdebois) has helped step into the gap with his series of “Monitoring Wonderland” articles where he’s rounding up all kinds of tools. Check them out…
However it just shows how fragmented and confusing the space is. It also focuses almost completely on the open source side – I love open source and all but sometimes you have to pay for something. Though the “big ol’ suite” approach from the HP/IBM/CA lot makes me scream and flee, there’s definitely options worth paying for.
Today we had a local DevOps meetup here in Austin where we discussed monitoring. It showed how fragmented the current state is. And we met some other folks from a “real” engineering company, like NI, and it brought to mind how crap IT type monitoring is when compared to engineering monitoring in terms of sophistication. IT monitoring usually has better interfaces and alerting, but IT monitoring products are very proud when they have “line graphs!” or the holy grail, “histograms!” Engineering monitoring systems have algorithms where they can figure out the difference between a real problem and a monitoring problem. They apply advanced algorithms when looking at incoming metrics (hint: signal processing). When is anyone in IT world who’s all delirious about how cool “metrics” going to figure out some math above the community college level?
To me, the biggest gap especially in cloud land – partially being addressed by New Relic and Boundary – in the space is agent based real user monitoring. I want to know each user and incoming/outgoing transaction, not at the “tcpdump” level but at the meaningful level. And I don’t want to have to count on the app to log it – besides the fact that devs are notoriously shitful loggers, there are so many cases where something goes wrong – if tomcat’s down, it’s not logging, but requests are still coming in… Synthetic monitoring and app metrics are good but they tend to not answer most of the really hard questions we get with cloud apps.
We did a big APM (application performance management) tool eval at NI, and got a good idea of the strengths and weaknesses of the many approaches. You end up wanting many/all of them really. Pulling box metrics via SNMP or agents, hitting URLs via synthetic monitors locally or across the Internet, passive network based real user monitoring, deep dive metric gathering (Opnet/AppDynamics/New Relic/etc.)… We’ll post more about our thoughts on all these (especially Peco, who led that eval and is now working for an APM company!).
Your thoughts on monitoring? Hit me!
Filed under DevOps
Cloud computing has been the buzz and hype lately and everybody is trying to understand what it is and how to use it. In this post, I wanted to explore some of the properties of “the cloud” as they pertain to Application Performance Management. If you are new to cloud offerings, here are a few good materials to get started, but I will assume the reader of this post is somewhat familiar with cloud technology.
As Spiderman would put it, “With great power comes great responsibility” … Cloud abstracts some of the infrastructure parts of your system and gives you the ability to scale up resource on demand. This doesn’t mean that your applications will magically work better or understand when they need to scale, you still need to worry about measuring and managing the performance of your applications and providing quality service to your customers. In fact, I would argue APM is several times more important to nail down in a cloud environment for several reasons:
So where is APM for the cloud, you ask? It is in its very humble beginnings as APM solution providers have to solve the same gnarly problems application developers and operations teams are struggling with:
On the synthetic monitoring end there are plenty of options. We selected AlertSite, a distributed synthetic monitoring SaaS provider similar to Keynote and Gomez, for simplicity but in only provides high level performance and availability numbers for SLA management and “keep the lights are on” alerting. We also rely on CloudKick, another SaaS provider, for the system monitoring part. Here is a more detailed use case on the Cloudkick implementation.
For deeper instrumentation we have worked with OPNET to deploy their AppInternals Xpert (former Opnet Panorama) solution in the Amazon EC2 cloud environment. We have already successfully deployed AppInternals Xpert on our own server infrastructure to provide deep instrumentation and analysis of our applications. The cloud version looks very promising for tackling the technical challenges introduced by the cloud, and once fully deployed, we will have a lot of capability to harness the cloud scale and performance. More on this as it unfolds…
In summary, as vendors sell you the cloud but be prepared to tackle the traditional infrastructure concerns and stick to your guns. No, the cloud is not fixing your performance problems and is not going to magically scale for you. You will need some form of APM to help you there. Be on the lookout for providers and do challenge your existing partners to think of how they can help you. The APM space in the cloud is where traditional infrastructure was in 2005 but things are getting better.
To the cloud!!! (And do keep your performance engineers on staff, they will still be saving your bacon.)
Peco
Filed under Cloud