I’m in San Jose, California for this year’s Velocity Conference! James, Karthik and I flew in on the same flight last night. I gave them a ride in my sweet rental minivan – a quick In-n-Out run, then to the hotel where we ended up drinking and chatting with Gene Kim, James Turnbull, Marcus, Rembetsy, and some other Etsyers, and even someone from our client Nordstrom’s.
Check out our coverage of previous Velocity events – Peco and I have been to every single one.
I always take notes but then don’t have time to go back and clean them up and post them all – so this time I’m just going to liveblog and you get what you get!
The talk starts with a bunch of basic term definitions.
- Observability is about measuring “things” or state changes and not alter things too bad while observing them.
- A measurement is a single value from a point in time that you can perform operations upon.
“JSON makes all this worse, being the worst encoding format ever.” JSON lets you describe for example arbitrarily large numbers but the implementations that read/write it are inconsistent.
- A metric is the thing you are measuring. Version, cost, # executed, # bugs, whatever.
Basic engineering rule – Never store the “rate” of something. Collect a measurement/timestamp for a given metric and calculate a rate over time. Direct measurement of rates generates data loss and ignorance.
- Measurement velocity is the rate at which new measurements are taken
- Perspective is from where you’re taking the measurement
- Trending means understanding the direction/pattern of your measurements on a metric
- Alerting, durr
- Anomaly detection is determining that a specific measurement is not within reason
All this is monitoring. Management is different, we’re just going to talk about observation. Most people suck at monitoring and monitor the wrong things and miss the real important things.
Prefer high level telemetry of business KPIs, team KPIs, staff KPIs… Not to say don’t measure the CPU, but it’s more important to measure “was I at work on time?” not “what’s my engine tach?” That’s not “someone else’s job.”
He wrote reconnoiter (open source) and runs Circonus (service) to try to fix deficiencies.
“Push vs pull” is a dumb question, both have their uses. [Ed. In monitoring, most “X vs Y” debates are stupid because both give you different valid intel.]
- Synthesized obervations desirable (e.g. “URL monitor”)
- Observable activity infrequent
- Alterations in observation/frequency are useful
- Direct observation is desirable
- Discrete observed actions are useful (e.g. real user monitoring)
- Discrete observed actions are frequent
“Polling doesn’t scale” – false. This is the age where Google scrapes every Web site in the world, you can poll 10,000 servers from a small VM just fine.
So many protocols to use…
- SNMP can push (trap) and pull (query)
- collectd v5/v5 push only
- statsd push only
- JMX, etc etc etc.
Do it RESTy. Use JSON now. XML is better but now people stop listening to you when you say “XML” – they may be dolts but I got tired of swimming upstream. PUT/POST for push and GET for pull.
nad – Node Agent Daemon, new open source widget Theo wrote, use this if you’re trying to escape from the SNMP helhole. Runs scripts, hands back in JSON. Can push or pull. Does SSL. Tiny.
But that’s not methodology, it’s technology. Just wanted to get “but how?” out of the way. The more interesting question is “what should I be monitoring?” You should ask yourself this before, during, and after implementing your software. If you could only monitor one thing, what would it be? Hint: probably not CPU. Sure, “monitor all the things” but you need to understand what your company does and what you really need to watch.
So let’s take an example of an ecomm site. You could monitor if customers can buy stuff from your site (probably synthetic) or if they are buying stuff from your site (probably RUM). No one right answer, has to do with velocity. 1 sale/day for $600k per order – synthetic, want to know capability. 10 sales/minute with smooth trends – RUM, want to know velocity.
We have this whole new field of “data science” because most of us don’t do math well.
Tenet: Always synthesize, and additionally observe real data when possible.
Synthesizing a GET with curl gets you all kinds of stuff – code, timings (first byte, full…), SSL info, etc.
You can curl but you could also use a browser – so try phantomjs. It’s more representative, you see things that block users that curl doesn’t interpret.
Demo of nad to phantomjs running a local check with start and end of load timings.
Passive… Google Analytics, Omniture. Statsd and Metrics are a mediocre approach here. But if you have lots of observable data, e.g. the average of N over the last X time is not useful. NO RATES I TOLD YOU DON’T MAKE ME STICK YOU! At least add stddev, cardinality, min/max/95th/99th… But these things don’t follow standard distributions so e.g. stddev is deceptive. If you take 60k API hits and boil it down to 8 metrics you lose a lot.
How do you get more richness out of that data? We use statsd to store all the data and shows histograms. Oh look, it’s a 3-mode distribution, who knew.
A heat map of histograms doesn’t take any more space than a line graph of averages and is a billion times more useful. Can use some tools, or build in R.
Now we’ll talk about dtrace… Stop having to “wonder” if X is true about your software in production right now. “Is that queue backed up? Is my btree imbalanced?” Instrument your software. It’s easy with DTrace but only a bit more work otherwise.
Use case – they wrote a db called Sauna that’s a metrics db. They can just hit and get a big JSON telemetry exposure with all the current info, rollups, etc.
Monitoring everything is good but make sure you get the good stuff first and then don’t alert on things without specific remediation requirements.
Collect once and then split streams – if you collect and alert in Zabbix but graph in graphite it’s just confusing and crappy.
Tenet: Never make an alert without a failure condition in plain English, the business impact of the failure condition, a concise and repeatable remediation procedure, and an escalation path. That doesn’t have to all be “in the alert” but linking to a wiki or whatever is good.
How to get there? Do alerting postmortems. Understand why it alerted, what was done to fix, bring in stakeholders, have the stakeholder speak to the business impact. [Ed. We have super awful alerting right now and this is a good playbook to get started!]
Q: How do you handle alerts/oncall? Well, the person oncall is on call during the day too, so they handle 24×7. [Ed. We do that too…]
Q: How does your monitoring system identify the root cause of an issue? That’s BS, it can’t without AI. Human mind is required for causation. A monitoring system can show you highly correlated behavior to guide that determination. Statistical data around a window.
Q: How to set thresholds? We use lots. Some stock, some Holt-Winters, starting into some Markov… Human train on which algorithms are “less crappy.”
Q: Metrics db? We use a commercial one called Snowth that is cool, but others use cassandra successfully.
Q: How much system performance compromise is OK to get the data? I hate sampling because you lose stuff, and dropping 12 bytes into UDP never hurt anyone… Log to the network, transmit everything, then decide later how to store/sample.
Don’t forget to check out his conference, SURGE.