Tag Archives: monitoring

Monitoring and Observability

Ah, observability, the new buzzword of the day. Monitoring vendors aplenty are using the word, to basically mean “better monitoring!” You know, #monitoringlove not #monitoringsucks. Because monitoring doesn’t help with debugging and doesn’t have app instrumentation right?

Well, I have to say “bah” to that.  So here’s the thing.  I’m an electrical engineer by education, and I spent a lot of time working at National Instruments, an engineering test and measurement company.  You may be surprised to know these terms have actual definitions that don’t require Twitter arguments to discover.

Monitoring is an activity you perform. It’s simply observing the state of a system over a period of time.

Why do we monitor? For three reasons, in general.

  • Problem Detection – you know, alerting, or seeing issues on dashboards.
  • Problem Resolution – root cause and troubleshooting.
  • Continuous Improvement – capacity planning, financial planning, trending, performance engineering, reporting.

How do we monitor?  Well, that’s called instrumentation. You can instrument your systems and get CPU and stuff, you can use synthetic probes, you can use JavaScript bugs to get end user monitoring, you can emit metrics from applications, you can introspect services and apps via whatever parts are exposed (from JMX to nginx stats to sysdig traces), you can take network traces… (Some folks are similarly trying to redefine “instrumentation” to just mean application instrumentation, which is lame, and in defiance of the fact that application performance management tools that do app instrumentation have existed for decades.)

You can instrument metrics or events; metrics have certain sampling frequency and resolution…

So what is observability?  This isn’t a new term. It comes from system control theory. You know, the stuff that makes your A/C system and electrical plants and your car work.

Observability is a measure of how well the internal states of a system can be inferred from knowledge of its external outputs.

Observability is a property of a system. You can monitor a system using various instrumentation, but if the system doesn’t externalize its state well enough that you can figure out what’s actually going on in there, then you’re stuck.

So is observability hippy bullcrap?  No, of course not. In a DevOps world, it’s very important that the apps and systems concentrate on making themselves both observable and controllable (I leave it to the reader to research controllability, unless I get agitated enough to post about that too). Do you make yourself “easy to monitor”?

Externalizing custom metrics contributes to observability (you know, like with dropwizard metrics).  So does good logging.  So does proper architecture!  Take a system that sticks all kinds of messages into one message queue rather than using separate queues for separate types – the latter is more observable; you can more readily see how many of what is flowing through.  (It’s more controllable too, as you can shut off one queue or another.)

Making your system observable is therefore important, so that if you monitor it with appropriate instrumentation, you understand the state of the system and can make short or long term plans to change it.

While a monitoring tool can definitely contribute to this via its innovation in instrumentation, analysis, and visualization, in large part observability is a battle won or lost before you start sticking tools on top of the system. It’s very important to take it into account when designing and implementing services. No tool is going to “give you” observability and that’s the usual silver bullet fallacy heard from someone who wants to sell you something.

I’m not saying every vendor is using the term wrongly (in fact I just came across this New Relic post that is very well done), but I have to say I am less than impressed when common engineering terms are so widely misused and misunderstood widely in our industry.

Would you like to know more?  Peco and I are working on a new lynda.com course on monitoring and observability!  There’ll be real engineering, a broad canvas of the different kinds of monitoring instrumentation, tips on implementation and use… We’ve both been using and/or building monitoring tools for decades now so we hope to have some useful info for you.

1 Comment

Filed under DevOps, Monitoring

Monitoring Survey

James Turnbull (@kartar) has this year’s monitoring survey up, so am reposting his call for participants…

TL;DR – Please take the 2015 Monitoring Survey at

Last year I ran a monitoring survey, whose data I also reviewed as a
series of posts on [my] blog
(http://kartar.net/2014/11/monitoring-survey—background/). I was
interested in running the survey because I think we’re seeing the
beginnings of a significant change in the maturity of the monitoring
landscape and I’d like to track that change.

I’ve decided to make the survey a yearly event and am coinciding the
launch of this year’s survey with Monitorama in Portland.

The survey takes about 5 minutes to fill out and the results will again
be presented on this blog, in some conference talks and made available
as Creative Commons licensed data. The survey is totally anonymous and
the data won’t be used for any commercial purposes.

You can find the survey here –

In related news, if you can’t be at Monitorama try to watch along at http://monitorama.com/#watch!

Leave a comment

Filed under Monitoring

An article I wrote for InfoWorld’s New Tech Forum on all the various monitoring techniques: Know your options for infrastructure monitoring

Leave a comment

by | July 3, 2014 · 2:43 pm

Meet The Agile Admins At Velocity/DevOpsDays Silicon Valley!

Three of the four agile admins (James, Karthik, and myself) will be out at Velocity and DevOpsDays this week. Say hi if you see us!

James will be doing a workshop with Gareth Rushgrove on Tuesday 9-10:30 AM, “Battle-tested Code without the Battle – Security Testing and Continuous Integration.” Get hands on with gauntlt and other tools! [Conference site] [Lanyrd]

Ernest is doing a 5 minute sponsor keynote on Thursday, “A 5 Minute Checklist for Application Monitoring.” OK, so it’s during the USA vs Germany game – come see me anyway!  I hate keynote sales pitches so I’m not doing one, I’ll be talking about a Lean approach to monitoring and stuff to cover in your MVP. There’s a free white paper too since what can you really say in 5 minutes? And so you know what to expect, the hashtag you’ll want to use is #getprobed! [Conference site] [Lanyrd]

Leave a comment

Filed under Conferences

Monitoring and the State of DevOps

If you haven’t read the new  2014 State of DevOps Report from Puppet Labs and other luminaries, check it out now!

I also pulled out some of their findings on monitoring to inspire a post for the Copperegg blog, Monitoring and the State of DevOps, which I thought I’d mention here too.

Leave a comment

Filed under DevOps, Monitoring

Filtering Your Datadog Event Stream

At both NI and Bazaarvoice I was a Datadog user; I wrote a piece for them on filtering the event stream that has just been published on the Datadog blog.  Check it out!

Leave a comment

Filed under DevOps

Velocity 2013 Day 2 Liveblog: Performance Troubleshooting Methodology

Stop the Guessing: Performance Methodologies for Production Systems

Slides are on Slideshare!

Brendan Gregg, Joyent

Note to the reader – this session ruled.

He’s from dtrace but he’s talking about performance for the rest of us. Coming soon, Systems Performance: Enterprises and the Cloud book.

Performance analysis – where do I start and what do I do?  It’s like troubleshooting, it’s easy to fumble around without a playbook. “Tools” are not the answer any more than they’re the answer to “how do I fix my car?”

Guessing Methodologies and Not Guessing Methodologies (Former are bad)


Traffic light anti-method

Monitors green?  You’re fine. But of course thresholds are a coarse grained tool, and performance is complex.  Is X bad?  Well sometimes, except when X, but then when Y, but…” Flase positives and false negatives abound.

You can improve it by more subjective metrics (like weather icons) – onjective is errors, alerts, SLAs – facts.

see dtrace.org status dashboard blog post

So traffic light is intuitive and fast to set up but it’s misleading and causes thrash.

Average anti-method

Measure the average/mean, assume a normal-like unimodal distribution and then focus your investigation on explaining the average.

This misses multiple peaks, outliers.

Fix this by adding histograms, density plots, frequency trails, scatter plots, heat maps

Concentration game anti-method

Pick a metric, find another that looks like it, investigate.

Simple and can discover correlations, but it’s time consuming and mostly you get more symptoms and not the cause.

Workload characterization method

Who is causing the load, why, what, how. Target is the workload not the performance.

lets you eliminate unnecessary work. Only solves load issues though, and most things you examine won’t be a problem.

[Ed: When we did our Black Friday performance visualizer I told them “If I can’t see incoming traffic on the same screen as the latency then it’s bullshit.”]

USE method

For every resource, check utilization, saturation, errors.

util: time resource busy

sat: degree of queued extra work

Finds your bottlenecks quickly

Metrics that are hard to get become feature requests.

You can apply this methodology without knowledge of the system (he did the Apollo 11 command module as an example).

See the use method blog post for detailed commands

For cloud computing you also need the “virtual” resource limits – instance network caps. App stuff like mutex locks and thread pools.  Decompose the app environment into queueing systems.

[Ed: Everything is pools and queues…]

So go home and for your system and app environment, create a USE checklist and fill out metrics you have. You know what you have, know what you don’t have, and a checklist for troubleshooting.

So this is bad ass and efficient, but limited to resource bottlenecks.

Thread State Analysis Method

Six states – executing, runnable, anon paging, sleeping, lock, idle

Getting this isn’t super easy, but dtrace, schedstats, delay accounting, I/O accounting, /proc

Based on where the time is leads to direct actionables.

Compare to e.g. database query time – it’s not self contained. “Time spent in X” – is it really? Or is it contention?

So this identifies, quantifies, and directs but it’s hard to measure all the states atm.

There’s many more if perf is your day job!

Stop the guessing and go with ones that pose questions and seek metrics to answer them.  P.S. use dtrace!


Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 1 Liveblog – Bringing the Noise

Next up it’s the Etsy Crew!  A great bunch of guys.  Rembetsy is cutely nervous and proud about his guys presenting. Slides are available here!

And the topic is Bring the Noise: Making Effective Use of a Quarter Million Metrics by @abestanway and @jonlives. Anomaly detection is hard…

At Etsy we want to deploy lots – we have 250 committers, everyone has to deploy code, coder or not. Big “deploy to production” button. 30 deploys/day.

How can we control that kind of pace? Instead of fearing error, we put in the means to detect and recover quickly.

They use ganglia, graphite, and nagios – and they wrote statsd, supergrep, skyline, and oculus as well.

First line of defense – node daemon tailing log files and looking for errors using supergrep.

But not everything throws errors. 😦

So they use statsd to collect zillions of metrics and put them onto dashboards. But dashboards are manually curated “what’s important” – and if you have .25M metrics you just can’t do that.  So the dashboard approach has fallen over here.  And if no one’s watching the graph, why do you have it?

So that’s why Satan invented Nagios, to alert when you can’t look at a graph, but again it breaks down at scale.

Basically you have unknown anomalies and unknown correlations.

They have “kale,” their monitoring stack to try to solve this – skyline solves anomaly detection and oculus solves metrics correlation.


A realtime anomaly detection system (where realtime means ~90s). They have a 10s flush on statsd and a 1 min res on ganglia so that’s still fast.

They had to do this in memory and not disk, using Redis. But how to stream them all in?  They looked around and realized that the carbon-relay on graphite could be used to fork into it by pretending it’s another backup graphite destination.

They import from ganglia too via graphite reading its RRDs. Skyline also has other listeners.

To store time-series data in Redis, minimizing I/O and memory… redis.append() is constant time.

Tried to store in JSON but that was slow (half the CPU time was decoding JSON).

Found Messagepack, a binary-based serialization protocol. Much faster.

So they keep appending, but had to have a process go through and clean up old data past the defined duration. Hence “roomba.py.” Python because of all the good stats libraries. They just keep 24 hours of operational data.

But so what is an anomaly and how do you detect it?

Skyline uses the consensus model. [Ed. This is a common way of distinguishing sensor faults from process faults in real-world engineering.]

Using statistical process control – a metric is anomanouls if its latest datapoint is over three standard deviations above its moving average.

Use “Grubb’s test” and “ordinary least squares”… OK, most of the crowd is lost now. Histogram binning.

Problems – seasonality, spike influence (big spike biases average masking smaller spikes), normality (stddev is for normal distributions, and most data isn’t normal), and parameters. They are trying to further their algorithms.

OK, how about correlations?

Oculus does this.  Can we just compare the graphs? Image comparison is expensive and slow. Numerical comparison is not a hard problem.

“Euclidean Distance” is the most basic comparison of two time series. Dynamic Time Warping helps with phase shifts from time. But that’s expensive – O(n^2).

So how can we discard “obviously” dissimilar data?  Use a shape description alphabet – “basically flat, sharp increment,” etc.  Apply to graphs, cluster using elasticsearch, run dynamic time algorithm on that smaller sample size to polish it. But that’s still slow.  Luckily there’s a fast DTW variant that’s O(n).

So they do an elastic search phrase query with a high slop against the shape fingerprints.
Populate elastic search from redis using resque workers, but it makes it slow to update and search. Solved with rotating pool of elastic search servers – new index/last index. Allows you to purge the index and reindex. They cron-rotate every 2 min. Takes 25s to import, but queries take a while and you don’t want to rotate out from under it.
Sinatra frontend to query ES and render results off the live ES index.

Save collections of interesting correlations and then index those, so that later searches match against current data but also old fingerprints.

Devops is the key to us being able to do this. Abe the dev and Jon the ops guy managed to work all this out in a pretty timely manner.

Demo: Draw your query! He schetched a waveform and it found matching metrics -nice.

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 1 Liveblog – Monitoring and Observability

I’m in San Jose, California for this year’s Velocity Conference! James, Karthik and I flew in on the same flight last night.  I gave them a ride in my sweet rental minivan – a quick In-n-Out run, then to the hotel where we ended up drinking and chatting with Gene Kim, James Turnbull, Marcus, Rembetsy, and some other Etsyers, and even someone from our client Nordstrom’s.

Check out our coverage of previous Velocity events – Peco and I have been to every single one.

I always take notes but then don’t have time to go back and clean them up and post them all – so this time I’m just going to liveblog and you get what you get!

Theo Schlossnagle of OmniTI, getting back to his roots by rocking a psycho hillbilly hairstyle, kicked off the first workshop of the day on Monitoring and Observability. The slides are on Slideshare.

Theo Schlossnagle

The talk starts with a bunch of basic term definitions.

  • Observability is about measuring “things” or state changes and not alter things too bad while observing them.
  • A measurement is a single value from a point in time that you can perform operations upon.

“JSON makes all this worse, being the worst encoding format ever.” JSON lets you describe for example arbitrarily large numbers but the implementations that read/write it are inconsistent.

  • A metric is the thing you are measuring.  Version, cost, # executed, # bugs, whatever.

Basic engineering rule – Never store the “rate” of something.  Collect a measurement/timestamp for a given metric and calculate a rate over time.  Direct measurement of rates generates data loss and ignorance.

  • Measurement velocity is the rate at which new measurements are taken
  • Perspective is from where you’re taking the measurement
  • Trending means understanding the direction/pattern of your measurements on a metric
  • Alerting, durr
  • Anomaly detection is determining that a specific measurement is not within reason

All this is monitoring.  Management is different, we’re just going to talk about observation. Most people suck at monitoring and monitor the wrong things and miss the real important things.

Prefer high level telemetry of business KPIs, team KPIs, staff KPIs… Not to say don’t measure the CPU, but it’s more important to measure “was I at work on time?” not “what’s my engine tach?” That’s not “someone else’s job.”

He wrote reconnoiter (open source) and runs Circonus (service) to try to fix deficiencies.

“Push vs pull” is a dumb question, both have their uses. [Ed. In monitoring, most “X vs Y” debates are stupid because both give you different valid intel.]

Why pull?

  • Synthesized obervations desirable (e.g. “URL monitor”)
  • Observable activity infrequent
  • Alterations in observation/frequency are useful

Why push?

  • Direct observation is desirable
  • Discrete observed actions are useful (e.g. real user monitoring)
  • Discrete observed actions are frequent

“Polling doesn’t scale” – false. This is the age where Google scrapes every Web site in the world, you can poll 10,000 servers from a small VM just fine.

So many protocols to use…

  • SNMP can push (trap) and pull (query)
  • collectd v5/v5 push only
  • statsd push only
  • JMX, etc etc etc.

Do it RESTy. Use JSON now.  XML is better but now people stop listening to you when you say “XML” – they may be dolts but I got tired of swimming upstream. PUT/POST for push and GET for pull.

nad – Node Agent Daemon, new open source widget Theo wrote, use this if you’re trying to escape from the SNMP helhole.  Runs scripts, hands back in JSON. Can push or pull. Does SSL. Tiny.

But that’s not methodology, it’s technology. Just wanted to get “but how?” out of the way. The more interesting question is “what should I be monitoring?”  You should ask yourself this before, during, and after implementing your software. If you could only monitor one thing, what would it be?  Hint: probably not CPU. Sure, “monitor all the things” but you need to understand what your company does and what you really need to watch.

So let’s take an example of an ecomm site.  You could monitor if customers can buy stuff from your site (probably synthetic) or if they are buying stuff from your site (probably RUM). No one right answer, has to do with velocity.  1 sale/day for $600k per order – synthetic, want to know capability. 10 sales/minute with smooth trends – RUM, want to know velocity.

We have this whole new field of “data science” because most of us don’t do math well.

Tenet: Always synthesize, and additionally observe real data when possible.

Synthesizing a GET with curl gets you all kinds of stuff – code, timings (first byte, full…), SSL info, etc.

You can curl but you could also use a browser – so try phantomjs. It’s more representative, you see things that block users that curl doesn’t interpret.

Demo of nad to phantomjs running a local check with start and end of load timings.

Passive… Google Analytics, Omniture.  Statsd and Metrics are a mediocre approach here. But if you have lots of observable data, e.g. the average of N over the last X time is not useful. NO RATES I TOLD YOU DON’T MAKE ME STICK YOU! At least add stddev, cardinality, min/max/95th/99th… But these things don’t follow standard distributions so e.g. stddev is deceptive.  If you take 60k API hits and boil it down to 8 metrics you lose a lot.

How do you get more richness out of that data? We use statsd to store all the data and shows histograms. Oh look, it’s a 3-mode distribution, who knew.

A heat map of histograms doesn’t take any more space than a line graph of averages and is a billion times more useful.  Can use some tools, or build in R.

Now we’ll talk about dtrace… Stop having to “wonder” if X is true about your software in production right now. “Is that queue backed up? Is my btree imbalanced?” Instrument your software. It’s easy with DTrace but only a bit more work otherwise.

Use case – they wrote a db called Sauna that’s a metrics db. They can just hit and get a big JSON telemetry exposure with all the current info, rollups, etc.

Monitoring everything is good but make sure you get the good stuff first and then don’t alert on things without specific remediation requirements.

Collect once and then split streams – if you collect and alert in Zabbix but graph in graphite it’s just confusing and crappy.

Tenet: Never make an alert without a failure condition in plain English, the business impact of the failure condition, a concise and repeatable remediation procedure, and an escalation path. That doesn’t have to all be “in the alert” but linking to a wiki or whatever is good.

How to get there? Do alerting postmortems. Understand why it alerted, what was done to fix, bring in stakeholders, have the stakeholder speak to the business impact. [Ed. We have super awful alerting right now and this is a good playbook to get started!]

Q: How do you handle alerts/oncall?  Well, the person oncall is on call during the day too, so they handle 24×7. [Ed. We do that too…]

Q: How does your monitoring system identify the root cause of an issue?  That’s BS, it can’t without AI.  Human mind is required for causation.  A monitoring system can show you highly correlated behavior to guide that determination. Statistical data around a window.

Q: How to set thresholds?  We use lots. Some stock, some Holt-Winters, starting into some Markov… Human train on which algorithms are “less crappy.”

Q: Metrics db? We use a commercial one called Snowth that is cool, but others use cassandra successfully.

Q: How much system performance compromise is OK to get the data? I hate sampling because you lose stuff, and dropping 12 bytes into UDP never hurt anyone… Log to the network, transmit everything, then decide later how to store/sample.

Don’t forget to check out his conference, SURGE.


Filed under Conferences, DevOps

Monitorin’ Ain’t Easy

The DevOps space has been aglow with discussion about monitoring.  Monitoring, much like pimping, is not easy. Everyone does it, but few do it well.

Luckily, here on the agile admin we are known for keeping our pimp hands strong, especially when it comes to monitoring. So let’s talk about monitoring, and how to use it to keep your systems in line and giving you your money!

In November, I posted about why your monitoring is lying to you. It turns out that this is part of a DevOps-wide frenzy of attention to monitoring.

John Vincent (@lusis) started it with his Why Monitoring Sucks blog post, which has morphed into a monitoringsucks github project to catalog monitoring tools and needs. Mainstream monitoring hasn’t changed in oh like 10 years but the world of computing certainly has, so there’s a gap appearing. This refrain was quickly picked up by others (Monitoring  Sucks. Do Something About It.) There was a Monitoring Sucks panel at SCALE last week and there’s even a #monitoringsucks hashtag.

Patrick Debois (@patrickdebois) has helped step into the gap with his series of “Monitoring Wonderland” articles where he’s rounding up all kinds of tools. Check them out…

However it just shows how fragmented and confusing the space is. It also focuses almost completely on the open source side – I love open source and all but sometimes you have to pay for something. Though the “big ol’ suite” approach from the HP/IBM/CA lot makes me scream and flee, there’s definitely options worth paying for.

Today we had a local DevOps meetup here in Austin where we discussed monitoring. It showed how fragmented the current state is.  And we met some other folks from a “real” engineering company, like NI, and it brought to mind how crap IT type monitoring is when compared to engineering monitoring in terms of sophistication.  IT monitoring usually has better interfaces and alerting, but IT monitoring products are very proud when they have “line graphs!” or the holy grail, “histograms!” Engineering monitoring systems have algorithms where they can figure out the difference between a real problem and a monitoring problem.  They apply advanced algorithms when looking at incoming metrics (hint: signal processing).  When is anyone in IT world who’s all delirious about how cool “metrics” going to figure out some math above the community college level?

To me, the biggest gap especially in cloud land – partially being addressed by New Relic and Boundary – in the space is agent based real user monitoring.  I want to know each user and incoming/outgoing transaction, not at the “tcpdump” level but at the meaningful level.  And I don’t want to have to count on the app to log it – besides the fact that devs are notoriously shitful loggers, there are so many cases where something goes wrong – if tomcat’s down, it’s not logging, but requests are still coming in…  Synthetic monitoring and app metrics are good but they tend to not answer most of the really hard questions we get with cloud apps.

We did a big APM (application performance management) tool eval at NI, and got a good idea of the strengths and weaknesses of the many approaches. You end up wanting many/all of them really. Pulling box metrics via SNMP or agents, hitting URLs via synthetic monitors locally or across the Internet, passive network based real user monitoring, deep dive metric gathering (Opnet/AppDynamics/New Relic/etc.)…  We’ll post more about our thoughts on all these (especially Peco, who led that eval and is now working for an APM company!).

Your thoughts on monitoring?  Hit me!

Leave a comment

Filed under DevOps