Tag Archives: velocityconf

Velocity 2013 Day 3 Liveblog: The Keynotes

Day Three of a week of convention. Convention Humpday.  The day it stops becoming a mini-vacation and you start earning your salary again.

As usual the keynotes are being livestreamed, so this liveblog is perhaps more for those who want the Cliff’s Notes summary later.  Yesterday’s keynotes were certainly compressible, so here we go! Also follow along in twitter hashtag #veocityconf to see what people are saying about the show.

Clarification on Keynote’s RUM – they announced it yesterday but I was like “haven’t they been trying to sell this to me for two years?” Apparently they were but it was in beta. And congrats to fast.ly who just got a $10M round of funding!

Winners of the survey follies…  Souders’ book, Release It!/Web Operations, and so on. Favorite co-host: Souders!

“Send us your comments!  Except about the wi-fi!” Actually it’s working OK so far this morning. Show is good, though they added yet another ‘vendor track’ which is unfortunate.  They have a front end dev track, an ops track, and a mobile track.  Last year they added a fourth track – “vendor talks.”  This year there is another fifth track – “more vendor talks.” Boo, let’s make space for real content.

Gamedays On The Obama Campaign

@dylanr on revamping the Obama site 18 mos. before election day.  40 engineers in 7 teams, ~300 repos, 200 deployed products, 3000 servers (AWS), millions of hits a day, million volunteers, 8000 staff. He had redone threadless’ site and was on to the next big thing!

Plan, build, execute, get out the vote.

Planning is the not-fast dreamtime. But for the tech folks, it means start building the blocks.

Build is when everyone starts building teams and soliciting $. Then tech builds the apps.

Execute is when everyone starts using it all, more and more of everything. Tech starts getting feedback (been building blind till now).

Get out the vote – final 4-day sprint. For tech, this means scale. A couple orders of magnitude in that span.

Funny picture of a “Don’t Fuck This Up” cake.  [Ed: That was my second standing order for my old WebOps team.  1. Make It Happen, 2. Don’t Fuck Up, 3. There’s the right way, the wrong way, and the standard way.]

They got one shot at this. So how do you do it?

Talk to your stakeholders, but they want every feature ever.  But working is better.  No feature is better than having a working app.  So frame the conversation as “if things fail what still needs to work?” Graceful degradation.

Failure – you can try to make it not fail, and learn to deal with failure.  You should do some of the former but not delude yourself into not doing the latter.  But not just the tech – make the team resilient to failure via practice.

“Game day” 6 weeks pre-election.  Prod-size staging, simulate, break and react. Two week hardening sprint and then on game day had a long agenda of “things to break.”  He lied about the order and timing though.

Devops (aggressors) vs engineers (defenders), organized in campfire, maintaining updated google doc

Learned that there were broken things we thought were fixed, learned what failure really looks like, how things fail, how to fix it

Made runbooks

They had a stressful day, went home, came back in – and databases started failing! AWS failure.  Utilized the runbooks and were good.

Build resilient apps by planning for failure, defining what matters, making your plan clear to stakeholders, and fail to the things that matter. And resilient teams – practice failing and learn from it, use your instruction manual.

Ops School

Check it!  Etsy and O’Reilly video classes on  how to be an ops engineer! Oh. They’ll be for sale, I got excited for a minute thinking it would be a free way to get a whole generation of ops engineers trained from schools etc. that know nothing about ops.  Guess not.  Damn.

W3C Web Performance Working Group Status

Arvind from Google gives us an update on the W3C. The web perf working group is working on navigation timing, user timing, page visibility, resource priorities, client side error logging, and other fundamental standards that will help with web performance measurement and improvement.  Good stuff if very boringly presented, but that’s standards groups for you.

Eliminating Web Site Performance Theft

Neustar tells us the world is online and brand reputation and revenue are at stake.  Quite.

Performance can affect your reputation and revenue!  Quite.

This talk is a great one for vaguely befuddled director level and above types, not the experts at this conference.  Twitter agrees.

Mobitest, the Latest

Akamai has cell nodes and devices for Mobitest, you can become a webpagetest node on your phone!  If you have an unlimited data plan 😛 The app is waiting on release in the App Store. See mobitest.akamai.com for more.

If You Don’t Understand People, You Don’t Understand Ops

Go to techleadershipnews.com and get Kate’s newsletter!

Influence – Gangster style!

How do you earn respect and influence without authority? Even if you’re not a “manager” you need to be able to do this to get things done.

You want people to hear what you have to say – need 3 things.

  • Accountability
  • Everyone is your ally
  • Reciprocity

Accountability – lead by example. Be the person who can get “it” done.  Always followed through on commitments. Generates the graph of trust. Treat everyone with respect. Be a reliable person. Be a superstar – always be hustling.

Everyone is an ally – make them your friend. It’s a small world. Who was nice to you? Who made you feel bad?  How about in return? Make every interaction positive.

Reciprocity – all about giving. You get what you give. What is your currency?  What do you have of value with others and how can you share it? How can you improve the lives of other people?

Success is about people. Influence is success.  Yay Kate!

Lightning Demos

httparchive and BigQuery

@igrigorik from Google about httparchive.org which crawls the Web twice a month and keeps stats. It’s all freely available for download – a subset is shown online at the site.

Google built Dremel, which is an interactive ad hoc query system for analysis of read only nested data.  So they put BigQuery + http archive! Go to bigquery.cloud.google.com and it’s in there! Most comon JS framework (jquery btw)? Median page load times? You can query it all.

In Google Docs you can script and make them interoperate (like send an email when this spreadsheet gets filled in).  Created dynamic query straight to bigquery. Oh look, dynamic graph! bit.ly/ha-query for more!

Patrick Meenan on Webpagetest

You can now do a tcpdump with your test! (Advanced tab). He shows an analysis with cloudshark – wireshark in the cloud! Nice.

Patrick Lightbody from New Relic

Real user monitoring is cool. newrelic.com/platform

Steve Reilly from Riverbed/Aptimize

An application aware infrastructure? We have abstractions for some layers – middleware, compute, storage – but not really for transport.  Software defined networking will be the next “washing” trend. It’s just a transport abstraction. Then we can make the infrastructure a function of the application. “Middle boxes” are now app fetures – GSLB, WAFs, etc.

Slightly confusing at this point – a lot of abstract words and not enough concrete.  Which is better than a thinly disguised product pitch, so still better than yesterday!

Decentralized decisionmaking… Location is no longer a constraint but a feature.  This makes me think of Facebook’s talk yesterday with Sonar and rewriting DNS/GTM/LB.

Jonathan LeBlanc from Paypal on API Design

Started with SOAP/XML SOA. But then the enlightenment happened and REST made your life less sucky and devs more efficient.

“Sure we support REST!  You can GET and POST!” Boo. And also religious REST principle following, instead of innovation.

Our lessons learned: Lower perceived latency, use HTTP properly, build in automation, offload complexity.

With no details this was very low value.


Leave a comment

Filed under Conferences

Velocity 2013 Day 2 Liveblog: mobile performance and engagement

Guilin Chen from Facebook is the presenter…

UX is important and mobile users are less tolerant than desktop developers.

What does performance mean for the Facebook study
– page load times
– scroll performance (how smooth)
– prefetch delay (infinite scrolling)

– page load times showed a strong correlation between slowness and user drop off.
– consistent scrolling experience matters more; slower scrolling is better than jittery scrolling.
– prefetch delay studies weren’t as conclusive and thus, didn’t matter as much..

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 2 Liveblog – Application Resilience Engineering and Operations at Netflix

Application Resilience Engineering and Operations at Netflix  by Ben Christensen (@benjchristensen)

Netflix and resilience.  We have all this infrastructure failover stuff, but once you get to the application each one has dozens of dependencies that can take them down.

Needed speed of iteration, to provide client libraries (just saying “here’s my REST service” isn’t good enough), and a mixed technical environment.

They like the Bulkheading pattern (read Michael Nygard’s Release it! to find out what that is). Want the app to degrade gracefully if one of its dozen dependencies fails. So they wrote Hystrix.

1. Use a tryable semaphonre in front of every library they talk to. Use it to shed load. (circuit breaker)

2. Replace that with a thread pool, which adds the benefit of thread isolation and timeouts.

A request gets created and goes throug hthe circuit breaker, runs, then health gets fed back into the front. Errors all go back into the same channel.

The “HystrixCommand” class provides fail fast, fail silent (intercept, especially for optional functionality, and replace with an appropriate null), stubbed fallback (try with the limited data you have – e.g. can’t fetch the video bookmark, send you to the start instead of breaking playback), fallback via network (like to a stale cache or whatever).

Moved into operational mode.  How do you know that failures are generating fallbacks?  Line graph syncup in this case (weird). But they use instrumentation of this as part of a purty dashboard. Get lots of low latency granular metrics about bulkhead/circuit breaker activity pushed into a stream.

Here’s where I got confused.  I guess we moved on from Hystrix and are just on to “random good practices.”

Make low latency config changes across a cluster too. Push across a cluster in seconds.

Auditing via simulation (you know, the monkeys).

When deploying, deploy to canary fleet first, and “Zuul” routing layer manages it. You know canarying.  But then…

“Squeeze” testing – every deploy is burned as an AMI, we push it to perf degradation point to know rps/load it can take. Most lately, added

“Coalmine” testing – finding the “unknown unknowns” – an env on current code capturing all network traffic esp, crossing bulkheads. So like a new feature that’s feature flagged or whatever so not caught in canary and suddenly it starts getting traffic.

So when a problem happens – the failure is isolated by bulkheads and the cluster adapts by flipping its circuit breaker.

Distributed systems are complex.  Isolate relationships between them.

Auditing and operations are essential.

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 2 Liveblog: a baseline for web performance with phantomjs

The talk I’m most excited about for today is next! I made sure to get here early…

@wesleyhales from apigee and @ryanbridges from CNN

Quick overview on load testing tools- firebug, charles, har viewers and whatnot; but its super manual.
Better- selenium, but it’s old yo and not hip anymore.
There are services out there: harstorage.com, harviewer etc that you can use too.
Webpagetest.org is pimped again, but apparently caused an internal argument in CNN.

Performance basics
– caching
– gzip: don’t gzip that’s already compressed (like jpegs)
– know when to pull from cdn’s

Ah! New term bingo! “Front end ops”- aka sucks to code something and then realize there need to be a ton of things to do to make things perform even more. Continued definition:
– keep an eye on perf
– manager of builds and dependencies (grunt etc)
– expert on delivering content from server to browser
– critical of new http requests/file sizes and load times

I’m realizing that building front ends is a lot more like building server side code….

Wes recommends having a front end performance ops position and better analytics.

A chart of CNN’s web page load times is shown.

So basically, every time CNN.com is built by bamboo, the page load time is analyzed, saved and analyzed. They use phantomjs for this which became Loadreport.js.

Loadreport.wesleyhales.com is the URL for it.

Filmstrip is a cool idea that stores filmstrips of all pages loaded.
Speed reports is another visualization that was written.

Hard parts
– performance issues needs more thought; figure out your baseline early
– advertisers use document.write
– server location
– browser types: DIY options are harder
– CPU activity: use a consistent environment

All in all, this has many of the same concerns when you’re doing server side performance

CI setup
– bamboo
– Jenkins
– barebones Linux without x11
– vagrant

Demo was shown that used Travis ci as the ci system.

All in all, everyone uses phantomjs for testing; check it out; look at fluxui on github for more!

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 2 Liveblog: Performance Troubleshooting Methodology

Stop the Guessing: Performance Methodologies for Production Systems

Slides are on Slideshare!

Brendan Gregg, Joyent

Note to the reader – this session ruled.

He’s from dtrace but he’s talking about performance for the rest of us. Coming soon, Systems Performance: Enterprises and the Cloud book.

Performance analysis – where do I start and what do I do?  It’s like troubleshooting, it’s easy to fumble around without a playbook. “Tools” are not the answer any more than they’re the answer to “how do I fix my car?”

Guessing Methodologies and Not Guessing Methodologies (Former are bad)


Traffic light anti-method

Monitors green?  You’re fine. But of course thresholds are a coarse grained tool, and performance is complex.  Is X bad?  Well sometimes, except when X, but then when Y, but…” Flase positives and false negatives abound.

You can improve it by more subjective metrics (like weather icons) – onjective is errors, alerts, SLAs – facts.

see dtrace.org status dashboard blog post

So traffic light is intuitive and fast to set up but it’s misleading and causes thrash.

Average anti-method

Measure the average/mean, assume a normal-like unimodal distribution and then focus your investigation on explaining the average.

This misses multiple peaks, outliers.

Fix this by adding histograms, density plots, frequency trails, scatter plots, heat maps

Concentration game anti-method

Pick a metric, find another that looks like it, investigate.

Simple and can discover correlations, but it’s time consuming and mostly you get more symptoms and not the cause.

Workload characterization method

Who is causing the load, why, what, how. Target is the workload not the performance.

lets you eliminate unnecessary work. Only solves load issues though, and most things you examine won’t be a problem.

[Ed: When we did our Black Friday performance visualizer I told them “If I can’t see incoming traffic on the same screen as the latency then it’s bullshit.”]

USE method

For every resource, check utilization, saturation, errors.

util: time resource busy

sat: degree of queued extra work

Finds your bottlenecks quickly

Metrics that are hard to get become feature requests.

You can apply this methodology without knowledge of the system (he did the Apollo 11 command module as an example).

See the use method blog post for detailed commands

For cloud computing you also need the “virtual” resource limits – instance network caps. App stuff like mutex locks and thread pools.  Decompose the app environment into queueing systems.

[Ed: Everything is pools and queues…]

So go home and for your system and app environment, create a USE checklist and fill out metrics you have. You know what you have, know what you don’t have, and a checklist for troubleshooting.

So this is bad ass and efficient, but limited to resource bottlenecks.

Thread State Analysis Method

Six states – executing, runnable, anon paging, sleeping, lock, idle

Getting this isn’t super easy, but dtrace, schedstats, delay accounting, I/O accounting, /proc

Based on where the time is leads to direct actionables.

Compare to e.g. database query time – it’s not self contained. “Time spent in X” – is it really? Or is it contention?

So this identifies, quantifies, and directs but it’s hard to measure all the states atm.

There’s many more if perf is your day job!

Stop the guessing and go with ones that pose questions and seek metrics to answer them.  P.S. use dtrace!


Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 2 Liveblog: CSS and gpu cheat sheet

I was headed to the CSS and gpu talk by Colt McAnlis (#perfmatters on twitter)

CSS properties and their paint times aren’t free. Depending on what properties you use, you could end up with slow rendering speeds. Box shadows and border radius strokes are the slowest (1.09ms) per render. That is pretty crazy, and I didn’t realize that it could be that slow.

We’re mostly taking about CSS optimizations that can be used by using the gpu, CPU on chrome.

Kinds of Layering controls
– load time layer promotion: some elements get their own layer by default. (Ex canvas, plugins, video, I frame)
– assign time layer promotion: (translate z, rotatex/y/z)
– animations
– stacking context and relative scrolling

– Too many layers uses additional memory; and you fill up the gpu tile cache.
– chrome prepaints tiles that are visible and not yet visible.

Side note: Colt loves ducks, and is sad about losing his hair 😦

– large images resized take forever. The resized images aren’t cached in the gpu. Think more about this for mobile devices.

– turn on show layer borders in devtools in chrome. It’ll help with translate z issues etc.
– use continuous paint mode to continuously paint the page to see

– gpu and layers helps with faster rendering
– too many layers is a bad idea
– CSS tags impact page loads and rendering

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 2 Liveblog – The Keynotes

Had some tasty Afghan food last night and turned in reasonably early to prepare for the deluge today!

So, the keynotes. Steve Souders & John Allspaw kick us off as the MCs. It’s streamed live so  you should be able to watch it (this will let you know what parts to skip… Hint, everything but the Swede.)

The wireless is completely borked.  I’m having to come back to my hotel room over lunch to upload this.  Boo.

Allspaw is rocking a New York shirt.  “New York!” Very light applause, lol.  There’s now a NYC Velocity, London, and China.  Maybe it’s my own MC style talking but there’s not near enough ass jokes.

Allspaw is the philosopher of the group. First night we were here, Gene Kim and I were talking with Marcus from Etsy about him.  Gene: “He’s a philosopher!  He’s a warrior poet!”  Me: “Yep, he sure Yodas that shit up!” Drinks were involved.

Go to bit.ly/VelocityFavorites and vote for your favorite books and stuff!

They also want speaker feedback, give 5 and get a signed O’Reilly book at 6 tonight! Ok, you asked for it…

What, Where And When Is Risk In System Design?

In what turned out to be the best part of all the keynotes, Johan Bergstrom fromn Lund U in Sweden spoke about risk in system design (when will Amazon go down again).

Is risk from unreliable components or from complexity?  Traditional risk evaluation is about determining the likelihood of every single failure event and its impact.

It’s reliable when all the parts work according to the rules; reductionist.

The most unreliable component is the human actor – that’s what gets blamed by AWS etc for outages.Exampleof monetizing tech debt/risk with incremental risk of outage * cost of outage.

So what do we do to mitigate this risk?  Redundant barriers, the defense in depth or “layers of Swiss cheese.”

Or reduce variability by removing humans from the mix. Process and automation.

But what if risk is a product of non-linear interactions and relations (complexity)?

An ecosystem model, hard to completely characterize and barriers may increase interactions.

So risk as a path dependent process and as a control problem.

Path dependency – software is so complex now no one can fully understand, evaluate, or test it.

Technical debt vs normalization of deviance

Control problem.  Have boundaries of unacceptable functionalityrisk, workload, and finances/efficiency. You can only know when you’ve crossed the risk boundary when you’ve passed it.  The other boundaries provide pressure to a least effort/most efficient solution.

risk and safety are both products of performance variability.

So to manage risk in this sense,

Keep talking about risk even when things look safe

  • Invite minority opinion and doubt
  • debate boundaries
  • monitor gap between work as prescribedand performed
  • Focus on how people make the tradeoffs guaranteeing safety

Hollnagel – Safety management is not about avoiding – it is about achieving

Which is it? We ask the wrong question ha ha!

Risk is a game played between values and frames of reference.

Make your values explicit.

slides at jbsafety.se


Vik Chaudhary from Keynote for his annual sales pitch

I like Keynote and we’re a Keynote customer, but I like Keynote a little less every time I have to sit through this crap.


Alois Reitbauer on Compuware APM. “We do mobile now!” Another sales pitch.


 Obama for America

Kyle Rush on the Obama for America site (dir of tech, new yorker)

Started with small simple site, load balancer to 7 web notes and 2 payment nodes.

Added a reverse proxied payment API

Then went to Jekyll Ruby CMS and github for version control, static in S3

Added Akamai as a CDN, did other front end perf engineering

Much faster and lighter

optimize.ly for A/B testing and faster page had 14% higher conversion rate ($32M)

GTM failover to 2 regions under route 53 round robin

1101 front end deploys, 4k lines js, 240 a/b tests


Lightning demos!

Guy (@guypod)  from Akamai on Akamai IO, the Internet Observatory, check out Web-wide stats. Basically their massive Web logs as data graphs.


@ManishLachwani from Appurify on their mobile continuous integration and testing platform

Runtime HTML5 and native debugger for mobile.

100k SDK will be free.


@dougsillars from AT&T on Application Resource Optimizer (developer.att.com/ARO)

See data flow from app, suggest improvements

Takes pcap traces from mobile, grades against best practices

Nice, like ACE+YSlow for mobile.


 Making the Web Faster

Arvind Jain from Google on making the Web faster.

Peak connection speeds have tripled in 5 years

Latency going down, cable 26 ms avg

js speed improvements

But, pages are getting fatter – 1.5 MB average!!!

Net YOY is desktop 5% faster, mobile 30%.

devs will keep adding in till they hit about 3s

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 1 Liveblog – Hands-on Web Performance Optimization Workshop

OK we’re wrapping up the programming on Day 1 of Velocity 2013 with a Hands-on Web Performance Optimization Workshop.

Velocity started as equal parts Web front end performance stuff and operations; I was into both but my path lead me more to the operations side, but now I’m trying to catch up a bit – the whole CSS/JS/etc world has grown so big it’s hard to sideline in it.  But here I am!  And naturally performance guru Steve Souders is here.  He kindly asked about Peco, who isn’t here yet but will be tomorrow.

One of the speakers is wearing a Google Glass, how cute.  It’s the only other one I’ve seen besides @victortrac’s. Oh, the guy’s from Google, that explains it.

@sergeyche (TruTV), @andydavies (Asteno), and @rick_viscomi (Google/YouTube) are our speakers.

We get to submit URLs in realtime for evaluation at man.gl/wpoworkshop!

Tool Roundup

Up comes webpagetest.org, the great Web site to test some URLs. They have a special test farm set up for us, but the abhorrent conference wireless largely prevents us from using it. “It vill disappear like pumpkin vunce it is over” – sounds great in a Russian accent.

YSlow the ever-popular browser extension is at yslow.org.

Google Pagespeed Insights is a newer option.

showslow.com trends those webpagetest metrics over time for your site.

Real Page Tests

Hmm, since at Bazaarvoice we don’t really have pages per se, we’re just embedded in our clients’ sites, not sure what to submit!  Maybe I’ll put in ni.com for old times’ sake, or a BV client. Ah, Nordstrom’s already submitted, I’ll add Yankee Candle for devious reasons of my own.

redrobin.com – 3 A’s, 3 F’s. No excuse for not turning on gzip. Shows the performance golden rule – 10% of the time is back end and 90% is front end.

“Why is my time to first byte slow?”  That’s back end, not front end, you need another tool for that.

nsa.gov – comes back all zeroes.  General laughter.

Gus Mayer – image carousel, but the first image it displays is the very last it loads.  See the filmstrip view to see how it looks over time. Takes like 6 seconds.

Always have a favicon – don’t have it 404. And especially don’t send them 40k custom 404 error pages. [Ed. I’ll be honest, we discovered we were doing that at NI many years ago.] It saves infrastructure cost to not have all those errors in there.

Use 85% lossy compression on images.  You can’t tell even on this nice Mac and it saves so much bandwidth.

sitespeed.io will crawl your whole site

speedcurve is a paid service using webpagetest.

Remember webpagetest is open source, you can load it up yourself (“How can we trust your dirty public servers!?!” says a spectator).


webpagetest has some mobile agents

httpwatch for iOS

1 Comment

Filed under Conferences, DevOps

Velocity 2013 Day 1 Liveblog – Using Amazon Web Services for MySQL at Scale

Next up is Using Amazon Web Services for MySQL at Scale. I missed the first bit, on RDS vs EC2, because I tried to get into Choose Your Weapon: A Survey For Different Visualizations Of Performance Data but it was packed.

AWS Scaling Options

Aside: use boto

vertical scaling – tune, add hw.

table level partitioning – smaller indexes, etc. and can drop partitions instead of deleting

functional partitioning (move apps out)

need more reads? add replicas, cache tier, tune ORM

replication lag? see above, plus multiple schemas for parallel rep (5.6/tungsten). take some stuff out of db (timestamp updates, queues, nintrans reads), pre-warm caches, relax durability

writes? above plus sharding

sharding by row range requires frequent rebalancing

hash/modulus based- better distro but harder to rebalance; prebuilt shards

lookup table based


In EC2 you have regions and AZs. AZs are supposed to be “separate” but have some history of going down with each other.

A given region is about 99.2% up historically.

RDS has multi-AZ replica failover

Pure EC2 options:

  • master/replicas – async replication. but, data drift, fragile (need rapid rebuild). MySQL MHA for failover. haproxy (see palomino blog)
  • tungsten – replaced replication and cluster manager. good stuff.
  • galera – galera/xtradb/mariadb synchronous replication


io/storage: provisioned IOPS.  Also, SSD for ephemeral power replicas

rds has better net perf, the block replication affects speed

instance types – gp, cpu op, memory op, storage op.  Tend to use memory op, EBS op.  cluster and dedicated also available.

EC2 storage – ephemeral, epehemeral SSD (superfast!), EBS slightly slower, EBS PIOPS faster/consistent/expensive/lower fail

Mitigating Failures

Local failures should not be a problem.  AZs, run books, game days, monitoring.

Regional failures – if you have good replication and fast DNS flipping…

You may do master/master but active/active is a myth.

Backups – snap frequently, put some to S3/glacier for long term. Maybe copy them out of Amazon time to time to make your auditors happy.


Remember, you spend money every minute.  There’s some tools out there to help with this (and Netflix released Ice today to this end).

Leave a comment

Filed under Cloud, Conferences, DevOps

Velocity 2013 Day 1 Liveblog – Bringing the Noise

Next up it’s the Etsy Crew!  A great bunch of guys.  Rembetsy is cutely nervous and proud about his guys presenting. Slides are available here!

And the topic is Bring the Noise: Making Effective Use of a Quarter Million Metrics by @abestanway and @jonlives. Anomaly detection is hard…

At Etsy we want to deploy lots – we have 250 committers, everyone has to deploy code, coder or not. Big “deploy to production” button. 30 deploys/day.

How can we control that kind of pace? Instead of fearing error, we put in the means to detect and recover quickly.

They use ganglia, graphite, and nagios – and they wrote statsd, supergrep, skyline, and oculus as well.

First line of defense – node daemon tailing log files and looking for errors using supergrep.

But not everything throws errors. 😦

So they use statsd to collect zillions of metrics and put them onto dashboards. But dashboards are manually curated “what’s important” – and if you have .25M metrics you just can’t do that.  So the dashboard approach has fallen over here.  And if no one’s watching the graph, why do you have it?

So that’s why Satan invented Nagios, to alert when you can’t look at a graph, but again it breaks down at scale.

Basically you have unknown anomalies and unknown correlations.

They have “kale,” their monitoring stack to try to solve this – skyline solves anomaly detection and oculus solves metrics correlation.


A realtime anomaly detection system (where realtime means ~90s). They have a 10s flush on statsd and a 1 min res on ganglia so that’s still fast.

They had to do this in memory and not disk, using Redis. But how to stream them all in?  They looked around and realized that the carbon-relay on graphite could be used to fork into it by pretending it’s another backup graphite destination.

They import from ganglia too via graphite reading its RRDs. Skyline also has other listeners.

To store time-series data in Redis, minimizing I/O and memory… redis.append() is constant time.

Tried to store in JSON but that was slow (half the CPU time was decoding JSON).

Found Messagepack, a binary-based serialization protocol. Much faster.

So they keep appending, but had to have a process go through and clean up old data past the defined duration. Hence “roomba.py.” Python because of all the good stats libraries. They just keep 24 hours of operational data.

But so what is an anomaly and how do you detect it?

Skyline uses the consensus model. [Ed. This is a common way of distinguishing sensor faults from process faults in real-world engineering.]

Using statistical process control – a metric is anomanouls if its latest datapoint is over three standard deviations above its moving average.

Use “Grubb’s test” and “ordinary least squares”… OK, most of the crowd is lost now. Histogram binning.

Problems – seasonality, spike influence (big spike biases average masking smaller spikes), normality (stddev is for normal distributions, and most data isn’t normal), and parameters. They are trying to further their algorithms.

OK, how about correlations?

Oculus does this.  Can we just compare the graphs? Image comparison is expensive and slow. Numerical comparison is not a hard problem.

“Euclidean Distance” is the most basic comparison of two time series. Dynamic Time Warping helps with phase shifts from time. But that’s expensive – O(n^2).

So how can we discard “obviously” dissimilar data?  Use a shape description alphabet – “basically flat, sharp increment,” etc.  Apply to graphs, cluster using elasticsearch, run dynamic time algorithm on that smaller sample size to polish it. But that’s still slow.  Luckily there’s a fast DTW variant that’s O(n).

So they do an elastic search phrase query with a high slop against the shape fingerprints.
Populate elastic search from redis using resque workers, but it makes it slow to update and search. Solved with rotating pool of elastic search servers – new index/last index. Allows you to purge the index and reindex. They cron-rotate every 2 min. Takes 25s to import, but queries take a while and you don’t want to rotate out from under it.
Sinatra frontend to query ES and render results off the live ES index.

Save collections of interesting correlations and then index those, so that later searches match against current data but also old fingerprints.

Devops is the key to us being able to do this. Abe the dev and Jon the ops guy managed to work all this out in a pretty timely manner.

Demo: Draw your query! He schetched a waveform and it found matching metrics -nice.

Leave a comment

Filed under Conferences, DevOps