Velocity 2013 Day 1 Liveblog – Bringing the Noise

Next up it’s the Etsy Crew!  A great bunch of guys.  Rembetsy is cutely nervous and proud about his guys presenting. Slides are available here!

And the topic is Bring the Noise: Making Effective Use of a Quarter Million Metrics by @abestanway and @jonlives. Anomaly detection is hard…

At Etsy we want to deploy lots – we have 250 committers, everyone has to deploy code, coder or not. Big “deploy to production” button. 30 deploys/day.

How can we control that kind of pace? Instead of fearing error, we put in the means to detect and recover quickly.

They use ganglia, graphite, and nagios – and they wrote statsd, supergrep, skyline, and oculus as well.

First line of defense – node daemon tailing log files and looking for errors using supergrep.

But not everything throws errors. 😦

So they use statsd to collect zillions of metrics and put them onto dashboards. But dashboards are manually curated “what’s important” – and if you have .25M metrics you just can’t do that.  So the dashboard approach has fallen over here.  And if no one’s watching the graph, why do you have it?

So that’s why Satan invented Nagios, to alert when you can’t look at a graph, but again it breaks down at scale.

Basically you have unknown anomalies and unknown correlations.

They have “kale,” their monitoring stack to try to solve this – skyline solves anomaly detection and oculus solves metrics correlation.

Skyline

A realtime anomaly detection system (where realtime means ~90s). They have a 10s flush on statsd and a 1 min res on ganglia so that’s still fast.

They had to do this in memory and not disk, using Redis. But how to stream them all in?  They looked around and realized that the carbon-relay on graphite could be used to fork into it by pretending it’s another backup graphite destination.

They import from ganglia too via graphite reading its RRDs. Skyline also has other listeners.

To store time-series data in Redis, minimizing I/O and memory… redis.append() is constant time.

Tried to store in JSON but that was slow (half the CPU time was decoding JSON).

Found Messagepack, a binary-based serialization protocol. Much faster.

So they keep appending, but had to have a process go through and clean up old data past the defined duration. Hence “roomba.py.” Python because of all the good stats libraries. They just keep 24 hours of operational data.

But so what is an anomaly and how do you detect it?

Skyline uses the consensus model. [Ed. This is a common way of distinguishing sensor faults from process faults in real-world engineering.]

Using statistical process control – a metric is anomanouls if its latest datapoint is over three standard deviations above its moving average.

Use “Grubb’s test” and “ordinary least squares”… OK, most of the crowd is lost now. Histogram binning.

Problems – seasonality, spike influence (big spike biases average masking smaller spikes), normality (stddev is for normal distributions, and most data isn’t normal), and parameters. They are trying to further their algorithms.

OK, how about correlations?

Oculus does this.  Can we just compare the graphs? Image comparison is expensive and slow. Numerical comparison is not a hard problem.

“Euclidean Distance” is the most basic comparison of two time series. Dynamic Time Warping helps with phase shifts from time. But that’s expensive – O(n^2).

So how can we discard “obviously” dissimilar data?  Use a shape description alphabet – “basically flat, sharp increment,” etc.  Apply to graphs, cluster using elasticsearch, run dynamic time algorithm on that smaller sample size to polish it. But that’s still slow.  Luckily there’s a fast DTW variant that’s O(n).

So they do an elastic search phrase query with a high slop against the shape fingerprints.
Populate elastic search from redis using resque workers, but it makes it slow to update and search. Solved with rotating pool of elastic search servers – new index/last index. Allows you to purge the index and reindex. They cron-rotate every 2 min. Takes 25s to import, but queries take a while and you don’t want to rotate out from under it.
Sinatra frontend to query ES and render results off the live ES index.

Save collections of interesting correlations and then index those, so that later searches match against current data but also old fingerprints.

Devops is the key to us being able to do this. Abe the dev and Jon the ops guy managed to work all this out in a pretty timely manner.

Demo: Draw your query! He schetched a waveform and it found matching metrics -nice.

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 1 liveblog: Avoiding web performance regression

Avoiding web performance regression

By Marcel Duran (@marcelduran)

Works on the #web-core team at twitter.
Check out #flight

Problem: After a new release, apps get slower sometimes…

Monitoring is a reactive solution to solve performance issues.

Tools used: http archive (har) for yslow, yslow, cuzillion, fiddler, showslow

Har’s can be generated by- fiddler, phantomjs, yslow,
Install yslow locally (needs nodejs)

Ci and cd at yahoo
Crazy amounts of tests but no performance tests…

Phantomjs is a simple repeatable way to test web page performance times amongst other things.

Make performance tests a part of your ci process….

Next up: instead of just having perf tests in your ci process, graduate to a new level by measuring custom metrics on each performance run…

Peregrine is a tool used in twitter based on webpagetest
Peregrine takes code and deploys to performance boxes and integrates with webpagetest to run perf tests.

Peregrin will likely be open sourced soon..

Leave a comment

Filed under DevOps

Operations Level Up Storify Notes from @wickett

These are some notes from the Operations Level Up talk at the Velocity 2013 Conference. The Agile Admin crew is out at Velocity Conference this year and live-blogging as we go.

Leave a comment

by | June 18, 2013 · 12:47 pm

Velocity 2013 Day 1 Liveblog – Monitoring and Observability

I’m in San Jose, California for this year’s Velocity Conference! James, Karthik and I flew in on the same flight last night.  I gave them a ride in my sweet rental minivan – a quick In-n-Out run, then to the hotel where we ended up drinking and chatting with Gene Kim, James Turnbull, Marcus, Rembetsy, and some other Etsyers, and even someone from our client Nordstrom’s.

Check out our coverage of previous Velocity events – Peco and I have been to every single one.

I always take notes but then don’t have time to go back and clean them up and post them all – so this time I’m just going to liveblog and you get what you get!

Theo Schlossnagle of OmniTI, getting back to his roots by rocking a psycho hillbilly hairstyle, kicked off the first workshop of the day on Monitoring and Observability. The slides are on Slideshare.

Theo Schlossnagle

The talk starts with a bunch of basic term definitions.

  • Observability is about measuring “things” or state changes and not alter things too bad while observing them.
  • A measurement is a single value from a point in time that you can perform operations upon.

“JSON makes all this worse, being the worst encoding format ever.” JSON lets you describe for example arbitrarily large numbers but the implementations that read/write it are inconsistent.

  • A metric is the thing you are measuring.  Version, cost, # executed, # bugs, whatever.

Basic engineering rule – Never store the “rate” of something.  Collect a measurement/timestamp for a given metric and calculate a rate over time.  Direct measurement of rates generates data loss and ignorance.

  • Measurement velocity is the rate at which new measurements are taken
  • Perspective is from where you’re taking the measurement
  • Trending means understanding the direction/pattern of your measurements on a metric
  • Alerting, durr
  • Anomaly detection is determining that a specific measurement is not within reason

All this is monitoring.  Management is different, we’re just going to talk about observation. Most people suck at monitoring and monitor the wrong things and miss the real important things.

Prefer high level telemetry of business KPIs, team KPIs, staff KPIs… Not to say don’t measure the CPU, but it’s more important to measure “was I at work on time?” not “what’s my engine tach?” That’s not “someone else’s job.”

He wrote reconnoiter (open source) and runs Circonus (service) to try to fix deficiencies.

“Push vs pull” is a dumb question, both have their uses. [Ed. In monitoring, most “X vs Y” debates are stupid because both give you different valid intel.]

Why pull?

  • Synthesized obervations desirable (e.g. “URL monitor”)
  • Observable activity infrequent
  • Alterations in observation/frequency are useful

Why push?

  • Direct observation is desirable
  • Discrete observed actions are useful (e.g. real user monitoring)
  • Discrete observed actions are frequent

“Polling doesn’t scale” – false. This is the age where Google scrapes every Web site in the world, you can poll 10,000 servers from a small VM just fine.

So many protocols to use…

  • SNMP can push (trap) and pull (query)
  • collectd v5/v5 push only
  • statsd push only
  • JMX, etc etc etc.

Do it RESTy. Use JSON now.  XML is better but now people stop listening to you when you say “XML” – they may be dolts but I got tired of swimming upstream. PUT/POST for push and GET for pull.

nad – Node Agent Daemon, new open source widget Theo wrote, use this if you’re trying to escape from the SNMP helhole.  Runs scripts, hands back in JSON. Can push or pull. Does SSL. Tiny.

But that’s not methodology, it’s technology. Just wanted to get “but how?” out of the way. The more interesting question is “what should I be monitoring?”  You should ask yourself this before, during, and after implementing your software. If you could only monitor one thing, what would it be?  Hint: probably not CPU. Sure, “monitor all the things” but you need to understand what your company does and what you really need to watch.

So let’s take an example of an ecomm site.  You could monitor if customers can buy stuff from your site (probably synthetic) or if they are buying stuff from your site (probably RUM). No one right answer, has to do with velocity.  1 sale/day for $600k per order – synthetic, want to know capability. 10 sales/minute with smooth trends – RUM, want to know velocity.

We have this whole new field of “data science” because most of us don’t do math well.

Tenet: Always synthesize, and additionally observe real data when possible.

Synthesizing a GET with curl gets you all kinds of stuff – code, timings (first byte, full…), SSL info, etc.

You can curl but you could also use a browser – so try phantomjs. It’s more representative, you see things that block users that curl doesn’t interpret.

Demo of nad to phantomjs running a local check with start and end of load timings.

Passive… Google Analytics, Omniture.  Statsd and Metrics are a mediocre approach here. But if you have lots of observable data, e.g. the average of N over the last X time is not useful. NO RATES I TOLD YOU DON’T MAKE ME STICK YOU! At least add stddev, cardinality, min/max/95th/99th… But these things don’t follow standard distributions so e.g. stddev is deceptive.  If you take 60k API hits and boil it down to 8 metrics you lose a lot.

How do you get more richness out of that data? We use statsd to store all the data and shows histograms. Oh look, it’s a 3-mode distribution, who knew.

A heat map of histograms doesn’t take any more space than a line graph of averages and is a billion times more useful.  Can use some tools, or build in R.

Now we’ll talk about dtrace… Stop having to “wonder” if X is true about your software in production right now. “Is that queue backed up? Is my btree imbalanced?” Instrument your software. It’s easy with DTrace but only a bit more work otherwise.

Use case – they wrote a db called Sauna that’s a metrics db. They can just hit and get a big JSON telemetry exposure with all the current info, rollups, etc.

Monitoring everything is good but make sure you get the good stuff first and then don’t alert on things without specific remediation requirements.

Collect once and then split streams – if you collect and alert in Zabbix but graph in graphite it’s just confusing and crappy.

Tenet: Never make an alert without a failure condition in plain English, the business impact of the failure condition, a concise and repeatable remediation procedure, and an escalation path. That doesn’t have to all be “in the alert” but linking to a wiki or whatever is good.

How to get there? Do alerting postmortems. Understand why it alerted, what was done to fix, bring in stakeholders, have the stakeholder speak to the business impact. [Ed. We have super awful alerting right now and this is a good playbook to get started!]

Q: How do you handle alerts/oncall?  Well, the person oncall is on call during the day too, so they handle 24×7. [Ed. We do that too…]

Q: How does your monitoring system identify the root cause of an issue?  That’s BS, it can’t without AI.  Human mind is required for causation.  A monitoring system can show you highly correlated behavior to guide that determination. Statistical data around a window.

Q: How to set thresholds?  We use lots. Some stock, some Holt-Winters, starting into some Markov… Human train on which algorithms are “less crappy.”

Q: Metrics db? We use a commercial one called Snowth that is cool, but others use cassandra successfully.

Q: How much system performance compromise is OK to get the data? I hate sampling because you lose stuff, and dropping 12 bytes into UDP never hurt anyone… Log to the network, transmit everything, then decide later how to store/sample.

Don’t forget to check out his conference, SURGE.

2 Comments

Filed under Conferences, DevOps

Welcome to Velocity 2013!

All the agile admins – James (@wickett), Peco (@bproverb), and Ernest (@ernestmueller) are reunited at the Web performance and operations conference Velocity again this year!  And with us we have newer cool guys, dev extraordinaire Karthik (@iteration1) from Mentor Graphics and operations muscleman Bryan (@bryguy1211) from Bazaarvoice, and some of our old NI colleagues, Eric and Matt.  Then some of us are staying over for DevOpsDays Mountain View.

So buckle in and experience one of the handiest Web/DevOps conferences by proxy! We’re encouraging everyone to liveblog along here on the agile admin.  I always take notes but always run out of time to prettify and post them after so I’m trying liveblogging in hopes of staying caught up. Comment if you are getting value out of it to encourage us to keep it up!

I hear the Velocity marketing stuff is using one of my quotes from the blog, which is cool; it credits the old defunct webadminblog.com, but we’ve moved here now!

Leave a comment

Filed under Conferences, DevOps

DevOpsDays Austin 2013 Is Upon Us

Well, I hope you got a ticket for next week’s event because we’re sold out, sponsor slots sold out, all filled up with volunteers, the train has left the station.  It’s going to be a sweet ride.  Check out the program – Patrick Debois, John Willis, Gene Kim, Nick Galbreath, and many more will be speaking.

We have a sweet venue, the Marchesa; and on the first evening, Tuesday, you should free up your night.  We’ve got a happy hour from Dell, the band Lord Buffalo is playing, and then the Austin Film Society will be doing a private screening of Office Space for us! Then at 10 if you’re still rarin’ to go we can hook you up. We’re providing breakfast and lunch both days; expect breakfast tacos, barbecue, all the Texas standbys.

Of the agile admins, James and I have been working (along with the many other great volunteers who put many hours into putting together the event) to make it a fun and informative time for everyone who’s signed up!  Come early, leave late!

Leave a comment

Filed under Conferences, DevOps

Chef your haproxy load balancer and add encryption

As of last September, HAProxy supports ssl so you no longer have to put stud/stunnel/nginx in front of HAProxy and it can also connect to SSL on backend servers so you can have encrypted traffic the whole way to the app server.  Most people decrypt on the load balancer and then pass it to their app servers unencrypted but I am not a big fan of that architecture.  This post shows you how to set up HAProxy with chef and we will be setting up ssl all the way to the app servers. Big thanks to @jtimberman for his post on encrypted data bags which helped me figure this out.

Setup your Chef encrypted data bag to store your ssl cert

The first step is to create a secret key for your data bag to use. This will be used to encrypt your data bag and later by chef nodes to decrypt the data bag so that they can read from the data bag. Do not store the encrypted_data_bag_secret in source control as-is. Instead, you can put this into a keepass database and then store that in source control if you want to.

openssl rand -base64 512 > ~/.chef/encrypted_data_bag_secret

Next you have to create the databag which we have aptly called secrets

knife data bag create secrets

Now we can store our wildcard cert in the secrets databag. This will open an editor and you can copy and paste your cert and key into it. This will go to the chef server and not on local disk. I set these id, cert and key.

knife data bag create secrets wildcard --secret-file ~/.chef/encrypted_data_bag_secret

The last step uploaded your wildcard cert to the chef server and encrypted it.   The next step allows us to save off the json export of our encrypted wildcard cert which we can check into source control and version.  Later if we get in a bind we can tell chef to import the databag using this json export.

mkdir data_bags/secrets
knife data bag show secrets wildcard -Fj > data_bags/secrets/wildcard.json

This next step is to just do a sanity check to make sure the databag export looks good.  It should look like this:

cat data_bags/secrets/wildcard.json { "id": "wildcard", "cert": "encrypted string here", "key": "encrypted string here" }

Create your own wrapper cookbook

Now in your chef cookbook you can access this wildcard cert. The next step requires you to write your own wrapper cookbook which doesn’t do very much other than set default attributes, pull the wildcard cert from the databag, write it to a file and then call the haproxy cookbook to do the install.  (My cookbook for this is in a private github repo because we do some custom steps and set some settings that don’t apply to everyone, but if you create a new cookbook and follow these steps, you should be set.)
Create a cookbook

knife cookbook create my-loadbalancer

Next, change the recipes/default.rb to look like this:

# Pull the certs from the encrypted databag
wildcard_cert = Chef::EncryptedDataBagItem.load("secrets","wildcard")
my_cert = wildcard['cert'].chomp # you may not need this chomp, but I did
my_key = wildcard['key'].chomp # you may not need this chomp, but I did
# feed the cert and key into the chef template
template "/etc/ssl/private/haproxy.pem" do
source "haproxy_pem.erb"
owner "root"
group "root"
mode 0400
variables(:wildcard_key => my_key,
:wildcard_crt => my_cert)
end
# Install haproxy and we are using a forked version of haproxy to install 1.5-17 from source and add SSL
include_recipe "haproxy::app_lb"

Add this template to your cookbook. The template we used for this file haproxy.pem is pretty basic. Here are the contents of templates/default/haproxy_pem.erb

<%= @wildcard_crt %>
<%= @wildcard_key %>

The line that calls include_recipe “haproxy::app_lb” is actually installing our forked version of the haproxy cookbook which adds the below line to the file templates/default/haproxy-app_lb.cfg.erb to setup ssl binding.

bind 0.0.0.0: ssl crt /etc/ssl/private/haproxy.pem

You can check out our fork for the chef-haproxy cookbook to see how we install from source, what default attributes you can set and how we have our haproxy.cfg template using the ssl certs.

To recap, we uploaded our cert to an encrypted databag, added a recipe to pull that out and put it in a file (haproxy.pem) and we changed the haproxy cookbook to use that file to handle ssl certs. Hope this helps and if you run into any problems let me know.

3 Comments

Filed under DevOps, Security

Must Read: The Phoenix Project

The Phoenix ProjectHave you read the famous systems-management novel The Goal?  No, I know you haven’t, don’t feel bad, I only got to it this year myself.

Well, Gene Kim, entrepreneur, consultant, founder of Tripwire, and general insatiable Tweeter, has written a sequel of sorts his Visible Ops coauthors Kevin Behr and George Spafford.  Bearing the tongue-twisting title The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win, it’s like The Goal in that it’s written as a novel about the people in a large IT shop and how they’re faced will all the usual soul-crushing BS that we all get faced with, but use Lean and DevOps and pluck and courage to overcome it.

I got to be a pre-reader on large chunks of the book and I like it; it definitely has characters and situations directly torn from your IT department. (Man, the security guy… Just like every security guy…) And I’ve seen these techniques work, so I know it’s not just a wish-fulfillment novel.

If you’re wondering how DevOps can help you because “you live in the real world, man,” this is a good read that’ll give you some ideas along those lines! I’ve seen Gene speak at everywhere from AppSec USA to South by Southwest Interactive to DevOpsDays to Velocity… Go read the testimonials from everyone from Cockroft to Humble to even yours truly and then buy the book!

3 Comments

Filed under DevOps, Security

Awesome Austin Events!

Besides the “big one,” DevOpsDays Austin 2013, there’s a bunch of great events going on in Austin for techies.

The Agile Austin DevOps SIG meets every last Wednesday over lunch at Bazaarvoice; lunch is provided.  This month’s meeting on January 30 is Breaking the Barriers.

The Austin Cloud User Group meets every third Tuesday in the evening at Pervasive; dinner is provided. This month’s meeting is January 15, sponsored by Canonical, and there is a talk on Openstack Quantum, the network virtualization platform.

South by Southwest Interactive is here of course on March 8-12.

Hip security event BSides Austin is on March 21-22.

Data Day Austin is on January 29.

Texas Linux Fest will be May 31 – June 1.

It’s never been a better time to be a techie in Austin!

3 Comments

Filed under Cloud, Conferences, DevOps

DevOpsDays Austin 2013 Is Coming!

devopsdaysaustinIt’s been quiet here on the blog but we’ve been busy… And one of those items of busy-ness is setting up DevOpsDays Austin 2013!

Many of you came to DevOpsDays Austin 2012, the largest super-awesome Austin DevOps event ever!  Well, it’s even bigger this year.  Registration is open for DevOpsDays Austin 2013! Sign up quick to be assured a spot.

It’ll be April 30th and March 1st at the Marchesa in the middle of Austin. Because we had to pay for a venue this time to let more people attend, and to prevent no-shows from taking up the limited slots, we’ve instituted a $120 early bird fee for the event, which covers the food/venue/shirts for both days.

Proposals are open too, as are opportunities for companies to sponsor – the more sponsorships, the more cool activities and goodies for everyone involved!

I had a blast at DevOpsDays Austin in 2012 and this year stands to be huge.  Come on out and share expert tips with other elite DevOps practitioners from around the world! Patrick Debois and a growing list of other respected DevOps ninjas will be in attendance.

Email organizers-austin-2013@devopsdays.org with questions!

Leave a comment

Filed under Conferences, DevOps