Next up it’s the Etsy Crew! A great bunch of guys. Rembetsy is cutely nervous and proud about his guys presenting. Slides are available here!
And the topic is Bring the Noise: Making Effective Use of a Quarter Million Metrics by @abestanway and @jonlives. Anomaly detection is hard…
At Etsy we want to deploy lots – we have 250 committers, everyone has to deploy code, coder or not. Big “deploy to production” button. 30 deploys/day.
How can we control that kind of pace? Instead of fearing error, we put in the means to detect and recover quickly.
They use ganglia, graphite, and nagios – and they wrote statsd, supergrep, skyline, and oculus as well.
First line of defense – node daemon tailing log files and looking for errors using supergrep.
But not everything throws errors.😦
So they use statsd to collect zillions of metrics and put them onto dashboards. But dashboards are manually curated “what’s important” – and if you have .25M metrics you just can’t do that. So the dashboard approach has fallen over here. And if no one’s watching the graph, why do you have it?
So that’s why Satan invented Nagios, to alert when you can’t look at a graph, but again it breaks down at scale.
Basically you have unknown anomalies and unknown correlations.
They have “kale,” their monitoring stack to try to solve this – skyline solves anomaly detection and oculus solves metrics correlation.
A realtime anomaly detection system (where realtime means ~90s). They have a 10s flush on statsd and a 1 min res on ganglia so that’s still fast.
They had to do this in memory and not disk, using Redis. But how to stream them all in? They looked around and realized that the carbon-relay on graphite could be used to fork into it by pretending it’s another backup graphite destination.
They import from ganglia too via graphite reading its RRDs. Skyline also has other listeners.
To store time-series data in Redis, minimizing I/O and memory… redis.append() is constant time.
Tried to store in JSON but that was slow (half the CPU time was decoding JSON).
Found Messagepack, a binary-based serialization protocol. Much faster.
So they keep appending, but had to have a process go through and clean up old data past the defined duration. Hence “roomba.py.” Python because of all the good stats libraries. They just keep 24 hours of operational data.
But so what is an anomaly and how do you detect it?
Skyline uses the consensus model. [Ed. This is a common way of distinguishing sensor faults from process faults in real-world engineering.]
Using statistical process control – a metric is anomanouls if its latest datapoint is over three standard deviations above its moving average.
Use “Grubb’s test” and “ordinary least squares”… OK, most of the crowd is lost now. Histogram binning.
Problems – seasonality, spike influence (big spike biases average masking smaller spikes), normality (stddev is for normal distributions, and most data isn’t normal), and parameters. They are trying to further their algorithms.
OK, how about correlations?
Oculus does this. Can we just compare the graphs? Image comparison is expensive and slow. Numerical comparison is not a hard problem.
“Euclidean Distance” is the most basic comparison of two time series. Dynamic Time Warping helps with phase shifts from time. But that’s expensive – O(n^2).
So how can we discard “obviously” dissimilar data? Use a shape description alphabet – “basically flat, sharp increment,” etc. Apply to graphs, cluster using elasticsearch, run dynamic time algorithm on that smaller sample size to polish it. But that’s still slow. Luckily there’s a fast DTW variant that’s O(n).
So they do an elastic search phrase query with a high slop against the shape fingerprints.
Populate elastic search from redis using resque workers, but it makes it slow to update and search. Solved with rotating pool of elastic search servers – new index/last index. Allows you to purge the index and reindex. They cron-rotate every 2 min. Takes 25s to import, but queries take a while and you don’t want to rotate out from under it.
Sinatra frontend to query ES and render results off the live ES index.
Save collections of interesting correlations and then index those, so that later searches match against current data but also old fingerprints.
Devops is the key to us being able to do this. Abe the dev and Jon the ops guy managed to work all this out in a pretty timely manner.
Demo: Draw your query! He schetched a waveform and it found matching metrics -nice.