Tag Archives: velocityconf

Velocity 2013 Day 1 liveblog: Avoiding web performance regression

Avoiding web performance regression

By Marcel Duran (@marcelduran)

Works on the #web-core team at twitter.
Check out #flight

Problem: After a new release, apps get slower sometimes…

Monitoring is a reactive solution to solve performance issues.

Tools used: http archive (har) for yslow, yslow, cuzillion, fiddler, showslow

Har’s can be generated by- fiddler, phantomjs, yslow,
Install yslow locally (needs nodejs)

Ci and cd at yahoo
Crazy amounts of tests but no performance tests…

Phantomjs is a simple repeatable way to test web page performance times amongst other things.

Make performance tests a part of your ci process….

Next up: instead of just having perf tests in your ci process, graduate to a new level by measuring custom metrics on each performance run…

Peregrine is a tool used in twitter based on webpagetest
Peregrine takes code and deploys to performance boxes and integrates with webpagetest to run perf tests.

Peregrin will likely be open sourced soon..

Leave a comment

Filed under DevOps

Velocity 2013 Day 1 Liveblog – Monitoring and Observability

I’m in San Jose, California for this year’s Velocity Conference! James, Karthik and I flew in on the same flight last night.  I gave them a ride in my sweet rental minivan – a quick In-n-Out run, then to the hotel where we ended up drinking and chatting with Gene Kim, James Turnbull, Marcus, Rembetsy, and some other Etsyers, and even someone from our client Nordstrom’s.

Check out our coverage of previous Velocity events – Peco and I have been to every single one.

I always take notes but then don’t have time to go back and clean them up and post them all – so this time I’m just going to liveblog and you get what you get!

Theo Schlossnagle of OmniTI, getting back to his roots by rocking a psycho hillbilly hairstyle, kicked off the first workshop of the day on Monitoring and Observability. The slides are on Slideshare.

Theo Schlossnagle

The talk starts with a bunch of basic term definitions.

  • Observability is about measuring “things” or state changes and not alter things too bad while observing them.
  • A measurement is a single value from a point in time that you can perform operations upon.

“JSON makes all this worse, being the worst encoding format ever.” JSON lets you describe for example arbitrarily large numbers but the implementations that read/write it are inconsistent.

  • A metric is the thing you are measuring.  Version, cost, # executed, # bugs, whatever.

Basic engineering rule – Never store the “rate” of something.  Collect a measurement/timestamp for a given metric and calculate a rate over time.  Direct measurement of rates generates data loss and ignorance.

  • Measurement velocity is the rate at which new measurements are taken
  • Perspective is from where you’re taking the measurement
  • Trending means understanding the direction/pattern of your measurements on a metric
  • Alerting, durr
  • Anomaly detection is determining that a specific measurement is not within reason

All this is monitoring.  Management is different, we’re just going to talk about observation. Most people suck at monitoring and monitor the wrong things and miss the real important things.

Prefer high level telemetry of business KPIs, team KPIs, staff KPIs… Not to say don’t measure the CPU, but it’s more important to measure “was I at work on time?” not “what’s my engine tach?” That’s not “someone else’s job.”

He wrote reconnoiter (open source) and runs Circonus (service) to try to fix deficiencies.

“Push vs pull” is a dumb question, both have their uses. [Ed. In monitoring, most “X vs Y” debates are stupid because both give you different valid intel.]

Why pull?

  • Synthesized obervations desirable (e.g. “URL monitor”)
  • Observable activity infrequent
  • Alterations in observation/frequency are useful

Why push?

  • Direct observation is desirable
  • Discrete observed actions are useful (e.g. real user monitoring)
  • Discrete observed actions are frequent

“Polling doesn’t scale” – false. This is the age where Google scrapes every Web site in the world, you can poll 10,000 servers from a small VM just fine.

So many protocols to use…

  • SNMP can push (trap) and pull (query)
  • collectd v5/v5 push only
  • statsd push only
  • JMX, etc etc etc.

Do it RESTy. Use JSON now.  XML is better but now people stop listening to you when you say “XML” – they may be dolts but I got tired of swimming upstream. PUT/POST for push and GET for pull.

nad – Node Agent Daemon, new open source widget Theo wrote, use this if you’re trying to escape from the SNMP helhole.  Runs scripts, hands back in JSON. Can push or pull. Does SSL. Tiny.

But that’s not methodology, it’s technology. Just wanted to get “but how?” out of the way. The more interesting question is “what should I be monitoring?”  You should ask yourself this before, during, and after implementing your software. If you could only monitor one thing, what would it be?  Hint: probably not CPU. Sure, “monitor all the things” but you need to understand what your company does and what you really need to watch.

So let’s take an example of an ecomm site.  You could monitor if customers can buy stuff from your site (probably synthetic) or if they are buying stuff from your site (probably RUM). No one right answer, has to do with velocity.  1 sale/day for $600k per order – synthetic, want to know capability. 10 sales/minute with smooth trends – RUM, want to know velocity.

We have this whole new field of “data science” because most of us don’t do math well.

Tenet: Always synthesize, and additionally observe real data when possible.

Synthesizing a GET with curl gets you all kinds of stuff – code, timings (first byte, full…), SSL info, etc.

You can curl but you could also use a browser – so try phantomjs. It’s more representative, you see things that block users that curl doesn’t interpret.

Demo of nad to phantomjs running a local check with start and end of load timings.

Passive… Google Analytics, Omniture.  Statsd and Metrics are a mediocre approach here. But if you have lots of observable data, e.g. the average of N over the last X time is not useful. NO RATES I TOLD YOU DON’T MAKE ME STICK YOU! At least add stddev, cardinality, min/max/95th/99th… But these things don’t follow standard distributions so e.g. stddev is deceptive.  If you take 60k API hits and boil it down to 8 metrics you lose a lot.

How do you get more richness out of that data? We use statsd to store all the data and shows histograms. Oh look, it’s a 3-mode distribution, who knew.

A heat map of histograms doesn’t take any more space than a line graph of averages and is a billion times more useful.  Can use some tools, or build in R.

Now we’ll talk about dtrace… Stop having to “wonder” if X is true about your software in production right now. “Is that queue backed up? Is my btree imbalanced?” Instrument your software. It’s easy with DTrace but only a bit more work otherwise.

Use case – they wrote a db called Sauna that’s a metrics db. They can just hit and get a big JSON telemetry exposure with all the current info, rollups, etc.

Monitoring everything is good but make sure you get the good stuff first and then don’t alert on things without specific remediation requirements.

Collect once and then split streams – if you collect and alert in Zabbix but graph in graphite it’s just confusing and crappy.

Tenet: Never make an alert without a failure condition in plain English, the business impact of the failure condition, a concise and repeatable remediation procedure, and an escalation path. That doesn’t have to all be “in the alert” but linking to a wiki or whatever is good.

How to get there? Do alerting postmortems. Understand why it alerted, what was done to fix, bring in stakeholders, have the stakeholder speak to the business impact. [Ed. We have super awful alerting right now and this is a good playbook to get started!]

Q: How do you handle alerts/oncall?  Well, the person oncall is on call during the day too, so they handle 24×7. [Ed. We do that too…]

Q: How does your monitoring system identify the root cause of an issue?  That’s BS, it can’t without AI.  Human mind is required for causation.  A monitoring system can show you highly correlated behavior to guide that determination. Statistical data around a window.

Q: How to set thresholds?  We use lots. Some stock, some Holt-Winters, starting into some Markov… Human train on which algorithms are “less crappy.”

Q: Metrics db? We use a commercial one called Snowth that is cool, but others use cassandra successfully.

Q: How much system performance compromise is OK to get the data? I hate sampling because you lose stuff, and dropping 12 bytes into UDP never hurt anyone… Log to the network, transmit everything, then decide later how to store/sample.

Don’t forget to check out his conference, SURGE.


Filed under Conferences, DevOps

Welcome to Velocity 2013!

All the agile admins – James (@wickett), Peco (@bproverb), and Ernest (@ernestmueller) are reunited at the Web performance and operations conference Velocity again this year!  And with us we have newer cool guys, dev extraordinaire Karthik (@iteration1) from Mentor Graphics and operations muscleman Bryan (@bryguy1211) from Bazaarvoice, and some of our old NI colleagues, Eric and Matt.  Then some of us are staying over for DevOpsDays Mountain View.

So buckle in and experience one of the handiest Web/DevOps conferences by proxy! We’re encouraging everyone to liveblog along here on the agile admin.  I always take notes but always run out of time to prettify and post them after so I’m trying liveblogging in hopes of staying caught up. Comment if you are getting value out of it to encourage us to keep it up!

I hear the Velocity marketing stuff is using one of my quotes from the blog, which is cool; it credits the old defunct webadminblog.com, but we’ve moved here now!

Leave a comment

Filed under Conferences, DevOps

Velocity 2012 Day Two

After John Allspaw and Steve Souders caper about in fake muscles, we get started with the keynotes.

Building for a Billion Users (Facebook)

Jay Parikh, Facebook, spoke about building for a billion users.  Several folks yesterday warned with phrases like “if you have a billion users and hundreds of engineers then this advice may be on point…”

Today in 30 minutes, Facebook did…

  • 10 TB log data into hadoop
  • Scan 105 TB in Hive
  • 6M photos
  • 160m newsfeed stories
  • 5bn realtime messages
  • 10bn profile pics
  • 108bn mysql queries
  • 3.8 tn cache ops

Principle 1: Focus on Impact

Day one code deploys. They open sourced fabricator for code review. 30 day coder boot camp – Ops does coder boot camp, then weeks of ops boot camp. Mentorship program.

Principle 2: Move Fast

  • Commits scale with people, but you have to be safe! Perflab does a performance test before every commit with log-replay. Also does checks for slow drift over time.
  • Gatekeeper is the feature flag tool, A/B testing – 500M checks/sec. We have one of these at Bazaarvoice and it’s super helpful.
  • Claspin is a high density heat map viewer for large services
  • fast deployment technique – used shared memory segment to connect cache, so they can swap out binaries on top of it, took weeks of back end deploys down to days/hours
  • Built lots of ops tools – people | tools | process

Random Bits:

  • Be bold
  • They use BGP in the data center
  • Did we mention how cool we are?
  • Capacity engineering isn’t just buying stuff, it’s APM to reduce usage
  • Massive fail from gatekeeper bug.  Had to pull dns to shut the site down
  • Fix more, whine less

Investigating Anomalies, Amazon

John Rauser, Amazon data scientist, on investigating anomalies.  He gave a well received talk on statistics for operations last year.
He used a very long “data in the time of cholera” example.  Watch the video for the long version.
You can’t just look at the summary, look at distributions, then look into the logs themselves.
Look at the extremes and you’ll find things that are broken.
Check your long tail, monitor percentiles long in the tail.

Building Resilient User Experiences

Mike Brittain, Etsy on Building Resilient User Experiences – here’s the video.

Don’t mess up a page just because one of the 14 lame back end services you compose into it is down. Distinguish critical back end service failures and just suppress other stuff when composing a product page. Consider blocking vs non-blocking ajax. Load critical stuff synchronously and noncritical async/not at all if it times out.
Google apps has the nice problem/retrying in an interval messages (use exponential back off in your ui!)

Planning for 100% availability is not enough; plan for failure.  Your UI should adapt to failure. That’s usually a joint dev+ops+product call.
Do operability reviews, postmortems. Watch your page views for your error template!  (use an error template)

Speeding up the web using prediction of user activity

Arvind Jain/Dominic Hamon from Google – see the video.

Why things would be so much faster if you just preload the next pages a user might go to.  Sure.  As long as you don’t mind accidentally triggering all kinds of shit.  It’s like cross browser request forgery as a feature!
<link rel=’prerender’> to provide guidance.

Annoyed by this talk, and since I’ve read How Complex Systems Fail already, I spent rest of the morning plenary time wandering the vendor floor

It’s All About Telemetry

From the famous and irascible Theo Schlossnagle (@postwait). Here’s the slides.
Monitor what matters! Most new big data problems are created by our own solutions in the first place, and are thus solvable despite their ROI. e.g. logs.

What’s the cost/benefit of your data?
Don’t erode granularity (as with RRD).  It controls your storage but your ability to, say, do YOY black friday compares sucks.
As you zoom in you see obscured major differences and patterns.

There’s a cost/benefit curve for monitoring that goes positive but eventually goes negative.Value=benefit-cost.  So you don’t go to the end of the curve, you want the max difference!

Technique 1: Text
Just store changes- be careful not to have too many changes and store them

Technique 2: Numeric
store rollups, over 1 minute min/max/avg/sttdev/covar/50/95/99%
store first order derivative and then derivative of that (jerkiness)
db replication – lag AND RATE OF LAG CHANGE
“It’s a lot easier to see change when you’re actually graphing change”
Project numbers out.   Graph storage space doing down!
With simple numeric data you can do prediction (Holt-Winters) even hacked into RRD

Technique 3: Histograms
about 2k for a 1 minute histogram (5b for a single bucket)

Event correlation – change mgmt vs performance, is still human eye driven

Monitor everything.

  • the business – financials, marketing, support! problems, custsat, res time
  • operations! durr
  • system/db/yeah
  • middleware! messaging, apis, etc.

They use reconnoiter, statsd, d3, flot for graphing.

This was one of the best sessions of the day IMO.

RUM For Breakfast

Carlos from Facebook on routing users to the closest datacenter with Doppler. Normal DNS geo-routing depends on resolvers being near the user, etc.
They inject js into some browsers and log packet latency.  Map resolver IP to user IP, datacenter, latency. Then you can cluster users to nearest data centers. They use for planning and analysis – what countries have poor latency, where do we need peering agreements
Akamai said doing it was impractical, so they did it.
Amazon has “latency based routing” but no one knows how it works.
Google as part of their standard SOP has proposed a DNS extension that will never be adopted by enough people to work.

Looking at log-normal perf data – look at the whole spread but that can be huge, so filter then analyze.
Margin of error – 1.96*sd/sqrt(num), need less than 5% error.

How does performance influence human behavior?
Very fast sessions had high bounce rates (probably errors)
Strong bounce rate to load time correlation, and esp front end speed
Toxicology has the idea of a “median lethal dose”-  our LD50 is where we pass 50% bounce rate
These are: Back end 1.7s, dom load 1.8s, dom interactive 2.75s, front end 3.5s, dom complete 4.75s, load event 5.5s

Rollback, the Impossible Dream

by James Turnbull! Here’s the slides.

Rollback is BS.  You are insane if you rely on it. It’s theoretically possible, if you apply sufficient capital, and all apps are idempotent, resources…

  • Solve for availability, not rollback. Do small iterative changes instead.
  • Accept that failure happens, and prevent it from happening again.
  • Nobody gives a crap whose fault it was.
  • Assumption is the mother of all fuckups.
  • This can’t be upgraded like that because…  challenge it.

Stability Patterns

by Michael Nygard (@mtnygard) – Slides are here.   For more see his pragmatic programmers book Release It!

Failures come in patterns. Here’s some major ones!

Integrations are the #1 risk to stability, from both “in spec” and “out of spec” errors. Integrations are a necessary evil.  To debug you have to peel back the layers of abstraction; you have to diagnose a couple levels lower than the level an error manifests. Larger systems fail faster than small ones.

Chain Reactions – #2!
Failure moves horizontally across tiers, search engines and app servers get overloaded, esp. with connection pools. Look for resource leaks.

Cascading Failure
Failure moves vertically cross tiers. Common in SOA and enterprise services. Contain the damage – decouple higher tiers with timeouts, circuit breakers.

Blocked Threads
All threads blocked = “crash”. Use java.util/concurrent or system.threading (or ruby/php, don’t do it). Hung request handlers = less capacity and frustrated users. Scrutinize resource pools. Decompile that lame ass third party code to see how it works, if you’re using it you’re responsible for it.

Attacks of Self-Denial
Making your own DoS attack, often via mass unexpected promotions. Open lines of communication with marketers

Unbalanced Capacities
Your environment scaling ratios are different dev to qa to prod. Simulate back end failures or overloading during testing!

Unbounded result sets
Dev/test have smaller data volumes and unreal relationships (what’s the max in prod?). SOA with chatty providers – client can’t trust not being hurt. Don’t trust data producers, put limits in your APIs.

Stability patterns!

Circuit Breaker
Remote call wrapped with a retry loop (always 3!)
Immediate retries are very likely to fail again – TCP fixes packet drops for you man
Makes the user wait longer for their error, and floods the servers (cascading failure)
Count failures (leaky bucket) and stop calling the back end for a cool-off period
Can critical work be queued for later, or rejected, or what?
State of circuit breakers is a good dashboard!

Partition the system and allow partial failure without losing service.
Classes of customer, etc.
If foo and bar are coupled with baz, then hitting baz can bork both. Make baz pools or whatever.
Less efficient resource use but important with shared-service models

Test Harness
Real-world failures are hard to create in QA
Integration tests don’t find those out-of-spec errors unless you force them
Service acting like a back end but doing crazy shit – slow, endless, send a binary
Supplement testing methods

The Colo or the Cloud?

Ken King/James Sheridan, Yammer

They started in a colo.
Dual network, racks have 4 2U quad and 10 1Us
EC2 provides you: network, cabling, simple load balancing, geo distribution, hardware repair, network and rack diagrams (?)
There are fixed and variable costs, assume 3 year depreciation
AWS gives you 20% off for 2M/year?
Three year commit, reserve instances
one rack one year – $384k cool, $378k ec2, $199k reserved
20 racks one year – $105k colo, $158 ec2 even with reserve and 20% discount
20 racks 3 years – $50k colo, $105k ec2

But – speed (agility) of ramping up…

To me this is like cars.  You build your own car, you know it.  But for arbitrary small numbers of cars, you don’t have time for that crap.  If you run a taxi fleet or something, then you start moving back towards specialized needs and spec/build your own.

Colo benefits – ownership and knowledge.  You know it and control it.

  • Load balancing (loadbalancer.com or zeus/riverbed only cloud options)
  • IDS etc. (appliances)
  • Choose connectivity
  • know your IO, get more RAM, more cores
  • Throw money at vertical scalability
  • Perception of control = security

Cloud benefits – instant. Scaling.

  • Forces good architecture
  • Immediate replacement, no sparing etc.
  • Unlimited storage, snapshots (1 TB volume limit)
  • No long term commitment
  • Provisioning APIs, autoscaling, EU storage, geodist


  • crocodoc and encoder yammer partners are good to be close to
  • need to burst work
  • dev servers, windows dev, demo servers banished to AWS
  • moving cross connects to ec2 with vpc

Whew!  Man I hope someone’s reading this because it’s a lot of work.  Next, day three and bonus LSPE meeting!


Filed under Conferences, DevOps

Velocity 2012 Day One

Hello all! The Velocity cadre grows as the agile admins spread out.  I’m here with Chris, Larry, and Victor from Bazaarvoice and our new friends Kevin, Bob, and Morgan from Powerreviews which is now Bazaarvoice’s West Coast office; Peco is here with Charlie from Opnet, and James is here… with himself, from Mentor Graphics.  Our old friends from National Instruments Robert, Eric, and Matt are here too. We have quite a Groupme going!

Chris, Peco, James and I were on the same flight, all went well and we ended up at Kabul for a meaty dinner to fortify us for the many iffy breakfasts and lunches to come.  Sadly none of us got into the conference hotel so we were spread across the area.  I’m in the Quality Inn Santa Clara, which is just fine so far (alas, the breakfast is skippable, unlike that place Peco and I always used to stay).

I’m sharing my notes in mildly cleaned up fashion – sorry if it gets incoherent, but this is partially for me and partially for you.

Now it’s time for the first session!  Spoiler alert – it was really, really good and I strongly agree with large swaths of what he has to say.  In retrospect I think this was the best session of Velocity.  It combined high level guidance and tech tips with actionable guidelines. As a result I took an incredible number of notes.  Strap in!

Scaling Typekit: Infrastructure for Startups

by Paul Hammond (@ph) of Typekit, Slides are here: paulhammond.org/2012/startup-infrastructure

Typekit does Web fonts as a service; they were acquired by Adobe early this year. The characteristics of a modern startup are extreme uncertainty and limited money. So this is basically an exercise in effective debt management.

Rule #1 – Don’t run out of money.

Your burn rate is likely # of people on the team * $10k because the people cost is the hugely predominant factor.

Rule #2 – Your time is valuable, Don’t waste it.

He notes the three kinds of startups  – venture funded, bootstrapped, and big company internal.  Sadly he’s not going to talk about big company internal startups, but heck, we did that already at National Instruments so fair enough!  He does say in that case, leverage existing infrastructure unless it’s very bad, then spend effort on making it better instead of focusing on new product ideas.  “Instead of you building a tiny beautiful cloud castle in the corner that gets ignored.” Ouch! The ex-NI’ers look ruefully at each other. Then he discussed startup end states, including acquisition.  Most possible outcomes mean your startup infrastructure will go away at some point. So technical debt is OK, just like normal debt; it’s incurred for agility but like financial must be dealt with promptly.

Look for “excuses” to build the infrastructure you need (business and technical). He cites Small Batch Inc., which did a “How to start a company” conference first thing, forcing incorporation and bank accounts and liability insurance and all that, and then Wikirank, which was not “the product” but an excuse to get everyone working together and learn new tech and run a site as a throwaway before diving into a product. Typekit, in standard Lean Startup fashion, announced in a press release before there was anything to gauge interest, then a funding round, then 6 months later (of 4 people working full time) to get 1.0 out.  Launching a startup is very hard.  Do whatever you can to make it easier.

When they launched their stack was merb/datamapper/resque/mysql/redis/munin/pingdom/chef-solo/ubuntu/slicehost/dynect/edgecast/github/google apps/dropbox/campfire/skype/join.me/every project tracking tool ever.

Now about the tech stack and what worked/didn’t work.

  • Merb is a Web framework like Rails. It got effectively end of lifed and merged into Ruby 3, and to this day they’re still struggling with the transition. Lesson: You will be stuck with your technology choices for a long time.  Choose wisely.
  • Datamapper – a Ruby ORM. Not as popular as ActiveRecord but still going.  Launched on v0.9.11!  Over the long term. many bugs. A 1.0 version came out but it has unknown changes, so they haven’t ported.  The code that stores your data, you need 100% confidence in.  Upgrading to Activerecord was easier because you could do both in parallel.   Lesson: Keep up with upgrades.  Once you’re a couple behind it’s over.
  • Resque – queueing system for Ruby. They love it. Gearman is also a great choice. Lesson: You need a queue – start with one. Retrofitting makes things much harder.
  • Data: MySQL/Redis (and Elasticsearch)
    • MySQL: You have to trust your database like nothing else. You want battle tested, well understood infrastructure here. And scaling mySQL is a solved problem, just read Cal Henderson’s book.
    • Redis: Redis doesn’t do much, which is why it’s awesome.
    • Elasticsearch: Our search needs are small, and elastic search is easy to use.
    • Lessons from their data tier: Choose your technology on what it does today, not promises of the future. They take a couple half hour downtimes a year for schema upgrades. You don’t need 99.999% availability yet as a startup.  Sure, the Facebook/Yahoo/Google presentations about that are so tempting but you/re 4 guys, not them.
  • Monitoring
    • Munin – monitoring, graphing, alerting.  Now collected, nagios and custom code and they hate it.
    • Pingdom is awesome. It’s the service of last resort.
    • Pagerduty is also awesome. Makes sure you get woken up and you know who does.
    • Papertrail is hosted syslog. “It’s not splunk but it’s good enough for our needs.” “But a syslog server is easy to run.  Why use papertrail?” The tools around it are better than what they have time to build themselves.  Hosted services are usually better and cheaper than what you can do yourself.  If there’s one that does what you need, use it.  If it costs less than $70/month buy without thinking about it, because the AWS instance to run whatever open source thingy you were going to use instead costs that much.
    • #monitoringsucks shout-out!  “I don’t know anyone who’s happy with their monitoring that doesn’t have 3-4 full time engineers working on it.”  However, #monitoringsucks isn’t delivering. Every single little open source doohickey you use is something else to go wrong and something they all need to understand.  Nothing is meeting small startups’ needs.  A lot of the hosting ones too, they charge per metric or per host (or both) and that’s discouraging to a startup.  You want to be capturing and graphing as much as you can.
  • Chef – started with chef-solo and rsync; moved to Chef Hosted in 2011 and have been very happy with it.
  • Ubuntu TLS 10.04.  “I don’t thing any startup has ever failed because they picked the wrong Linux distribution.”
  • Slicehost – loved it but then Rackspace shut it down, and the migration sucked – new IPs, hours of downtime. Migrated to Rackspace and EC2. Lots of people are going to bash cloud hosting later at the conference as a waste of money. Counterpoint – “Employees are the biggest cost to a startup.”
  • Start with EC2, period, unless you’re an infra company or totally need super bare metal performance.
  • But – credentials… use IAM to manage them. We use it at BV but it ends up causing a lot of problems too (“So you want your stuff in different IAM accounts to talk to each other like with VPC?  Oh, well, not really supported…”)  Never use the root credentials.
  • Databases in the cloud.  Ephemeral or EBS? Backups? They get a high memory instance, run everything in memory, and then stop worrying about disk IO.  Sha za!  Figure it out later.
  • DynECT – Invisible and fine.
  • Edgecast – cool. CDNs are not created equal, and they have different strengths in regions etc. If you don’t want to hassle with talking to someone on the phone, screw Akamai/Limelight/etc. If you’re not haggling you’re paying too much.  But as a startup, you want click to start, credit card signup. Amazon Cloudfront, Fastly. For Typekit they needed high uptime and high performance as a critical part of the service.  Story time, they had a massive issue with Edgecast as about.me was going live. See Designing for Disaster by Jeff Veen from Velocity Europe. Systems perform in unexpected ways as they grow.  Things have unexpected scaling behavior. Know your escape plan for every infrastructure provider.  That doesn’t have to be “immediate hot backup available,” just a plan.
  • Github – using organizations.
  • Google Apps – yay.  Using Google App Engine for their status page to put it on different infrastructure. They use Stashboard, which we used at NI!

“Buy or build?”

Buy, unless nothing meets your needs.  Then build.  Or if it’s your core business and you’re eating your own dog food.
If it costs more than your annual salary, build it.

A third party provider having an outage is still YOUR problem. Still need a “sorry!” Write your update without naming your service provider.  [You should take responsibility but that seems close to not being transparent to me. -Ed.]  Anyway, buy or build option is “neither” if it’s not needed for the minimum viable product.

You’re not Facebook or Etsy with 100 engineers yet. You don’t need a highly scalable data store.  A half hour outage is OK. You don’t need multi-vendor redundancy, you need a product someone cares about.

Rule #3 – Set up the infrastructure you need.

Rule #4 – Don’t set up infrastructure you don’t need.

Almost every performance problem has been on something they didn’t yet measure.  All their scaling pain points were unexpected.  You can’t plan for everything and the stuff you do plan for may be wasted.

Brain twister: He spent a week to write code to automatically bring up a front end Tomcat server in AWS if one of theirs crashes.  That has never happened in years.  Was that work worth while, does it really meet ROI?

Rule #5 – Don’t make future work for yourself.

There’s a difference between not doing something yet and deliberately setting yourself up for redo.  People talk about “technical debt” but just as in finance, there’s judicious debt and then there’s payday loans. Optimize for change. Every time you grow 10x you’ll need to rewrite. Just make it easy to change.

“You ain’t gonna need it”

Everyone’s startup story:

  1. Find biggest problem
  2. Fix biggest problem
  3. Repeat

The story never reads like:

  1. Up front, plan and build infrastructure based on other companies
  2. Total success!

Minimum Viable Infrastructure for a Startup:

  1. Source control
  2. Configuration management
  3. Servers
  4. Backups
  5. External availability monitoring

So you really could get started with github orgs, rsync/bash, EC2, s3cmd, pingdom, then start improving from there. Well, he’s not really serious you should start that way, he wouldn’t start with rsync again.  But he’s somewhat serious, in that you should really consider the minimum (but good) solution and not get too fancy before you ship.

Watch out for

  • Black swans
  • Vendor lockin
  • Unsupported products
  • Time wasting

Woot! This was a great session, everything from straight dope on specific techs, mistakes made and lessons learned, high level guidance with tangible rules of thumb.

Question and Answer Takeaways:
If you’re going to build, build and open source it to make the ecosystem better
Monitoring – none of them have a decent dashboard. Ganglia, nagios, munin UI sucks.


Discussion with Mike Rembetsy and other Etsyans about why JIRA and Confluence are ubiquitously used but people don’t like talking about it.  His theory is that everyone has to hack them so bad that they don’t want to answer 100 questions about “how you made JIRA do that.”

Turning Operational Data Into Gold At Expedia

By Eddie Satterly, previously of Expedia and now with Splunk. This is starting off bad.  I was hoping with Expedia having top billing it was going to be more of a real use case but we’re getting stock splunk vendor pitch.

Eddie Satterly was sr. director of arch at Expedia, now with splunk.  They put 6 TB/day in splunk. Highlights:

  • They built a sdk for cassandra data stores  and archive specific splunks for long term retention to hadoop for batch analysis
  • The big data integration really ramped up the TB/day
  • They do external lookups – geo, ldap, etc.
  • Puppet deploy of the agents/SCCM and gold images
  • A lot of the tealeaf RUM/Omniture Web analytics stuff is being done in splunk now
  • Zenoss integration but moving more to splunk there too
  • Using the file integrity monitoring stuff
  • Custom jobs for unusual volumes and “new errors”

Session was high on generalities; sadly I didn’t really come away with any new insights on splunk from it. Without the sales pitch it could have been a lightning talk.

11 Ways To Hack Puppet For Fun and Productivity

by Luke Kanies. I got here late but all I missed was a puppet overview. Slides on Slideshare.


  1. Puppet as you.  It doesn’t have to run as root.
  2. Curl speaks.  You can pull catalogs etc. easily, decouple see facts/pull catalog/run catalog/run report.
  3. Data, and lots of it. Catalogs, facts, reports.
  4. Static compiler. Refer to files with checksum instead of URL. And it reduces requests for additional files.
  5. config_version. Find out who made changes in this version.
  6. report processor.
  7. Function
  8. Fact
  9. Types
  10. Providers
  11. Face

Someone’s working on a puppet IDE called geppetto (eclipse based).

I don’t know much puppet yet, so most of this went right by me.

Develop and Test Configuration Management Scripts With Vagrant

By Mitchell Hashimoto from Kiip (@mitchellh). Slides on Speakerdeck.

Sure, you can bring up an ec2 instance and run chef and whatnot, but that gets repetitive. This tempts you to not do incremental systems development, because it takes time and work. So you just “set things up once” and start gathering cruft.

Maybe you have a magic setup script that gets your Macbook all up and running your new killer app. But it’s unlikely, and then it’s not like production.  Requires maintenance, what about small changes… Bah. Or perhaps an uber-readme (read: Confluence wiki page). Naturally prone to intense user error. So, use Vagrant!

We’ll walk through the CLI, VM creation, provisioning, scripted config of vm, network, fs, and setup

Install Virtualbox and Vagrant – All that’s needed are vagrantfile and vagrant CLI
vagrantfile: Per project configuration, ruby DSL
CLI: vagrant <something> e.g “vagrant up”

vagrant box – set up base boxes.  It’s just a single file. “vagrant box add name url”.
Go to vagrantbox.es for more base boxes. They’re big (It’s a vm…)

Project context. “vagrant init <boxtype>” will dump you a file.

“vagrant up” makes a private copy, doesn’t corrupt base box

vagrant up, status, reload, suspend (freeze), halt (shutdown), destroy (delete)

Provides shared folders, NFS to share files host to guest
Shared folder performance degrades with # of files, go to NFS

Provisioning – scripted instal packages, etc.  It supports shell/puppet/chef and soon cfengine.
Use the same scripts as production. vagrant up does utp, but vagrant reload or provision does it in isolation

Networking – port forwarding, host-onlu

port forwarding exposes hosts on the guest via ports on the host, even to the outside.
Simple, over 1024 and open
host only makes a private net of VMs and your host. set IPs or even DHCP it. Beware of IP collisions.
bridge – get IPs from a real router. makes them real boxes, though bad networks won’t do it.

multi vm.  Configure multiple VMs in one file and hook ’em up.  In multi mode you can specify a target on each command to not have it do on all

vagrant package “burns a new AMI” off the current system.
package up installed software, use provisioners for config and managing services

Great for developing and testing chef/puppet/etc scripts. Use prod-quality ops scripts to set up dev env’s, QA. It brings you a nice standard workflow.


  • other virtualization, vmware, ec2, kvm
  • vagrant builder: ami creator
  • any guest OS

End, Day One!

And we’re done with “Tutorial” day!  The distinction between tutorials and other conference sessions is very weak and O’Reilly would do better to just do a three day conference and right-size people’s presentations – some, like the Typekit one, deserve to be this long.  Others should be a normal conference session and some should be a lightning talk.

Then we went to the Ignites and James and I did Ignite slide karaoke where you have to talk to random slides.  Check out the deck, I got slides 43-47 which were a bit of a tough row to hoe. I got to use my signature phrase “keep your pimp hand strong” however.

1 Comment

Filed under Conferences, DevOps

Velocity 2011: The Workshops

Peco and I split up to cover more ground.  I went to four workshops and here’s the details… Peco will have to chime in on his.

First, Adrian Cockroft, Director of Cloud Architecture for Netflix, spoke on Netflix in the Cloud. This session was excellent.  He talked about the importance of model driven architecture, a runtime registry, how too many of the monitoring etc. tools don’t do cloud worth a damn…  All great stuff.  Included a love letter to AppDynamics, a cool cloud-friendly app instrumentation tool similar to our beloved Opnet Panorama.

Next, I saw John Rauser of Amazon talk about Just Enough Statistics To Be Dangerous. He talked about basic probability stats and how to use them.  Pretty good, though could have used more “and here’s how this applies to WebOps” examples instead of “how many quarters are in this jar” examples. I missed a bit of this because I ran out to go to the head and Patrick Debois grabbed me to talk to a guy from Dell about DevOps, which was loads of fun!  I missed the part on Bayesian stats though, I’ll have to watch the session video once it’s available.

Over lunch we met up with all the other guys here from NI, and my college friend Jon Whitney! Woot!  Rice University in the house!

After lunch, it was John Allspaw talking about reliability engineering and Postmortems and Human Error. Root cause is a myth!  So is human error!  Mindbending stuff. You should read the “How Complex Systems Fail” chapter in the Web Ops book to lube you up first, then watch the video for this session. Very relevant to all ops folks. We were a little split, though, on how a militant no-blame philosophy jives with places that aren’t hiring the absolute cream of the crop – if you don’t work at Etsy or similar 3l33t place, you do have some folks that are… a disproportionate source of errors.

My last workshop was a little disappointing – Automating Web Performance Testing by 5 PM, by the Neustar crew. There was some good info in there – Selenium, proxies, HAR format – but delivery was weak.  Sample code though, you can download some Python and Java automation examples. But “I can’t read text that small” combined with bad presentation technique (asking 5 times for “raise your hand if you don’t know X,” for example) made it a bit of a chore. Ah well.

Now it’s time for dinner and then the evening Ignite! sessions!

1 Comment

Filed under Conferences, DevOps

Velocity 2011 Kickoff!

Two of the agile admins, Ernest and Peco, are in Santa Clara this week for our fourth Velocity conference! We’ve been to all of them and always get a lot out of them.  It’s the first conference focused on Web performance and operations. Today is the day of workshops, then Wed-Thurs is normal sessions.  On Fri-Sat we’re going to DevOpsDays 2011 Mountain View. The third agile admin, James, is in Penang hanging out with our follow-the-sun WebOps staff!

If any of you are out in sunny CA for these events (or heck, if you’re in Penang and bored), ping us, we’d love to meet you! Tweet me at @ernestmueller for the hookup.

Now to our first workshops – I’m watching Adrian Cockroft talk about Netflix’s use of the Amazon cloud and Peco is going to see the Openstack workshop!

Leave a comment

Filed under Conferences, DevOps

Velocity 2010 – Facebook Performance Shenanigans

Pipelining, Progressive Enhancement, and More: Making Facebook Twice as Fast by Jason Sobel (Facebook), Changhao Jiang (Facebook)

It’s the last session of Velocity already! The companies are tearing down their booths, people are escaping to the airport. Today went really, really fast. The room is still mostly full though!

As we’ve heard before, they have loads of users.  They have a central performance team but also distributed and embedded throughout the company.

The core site speed team started working on PHP speed.  Then they read Steve Souder’s book and realized “Oh, crap…”

They are working on a “perflab” to measure performance impacts of all changes.  And detect regressions.

What are the three things they measure at Facebook?

  1. Server time
  2. Network time
  3. Client/render time

What are we optimizing for?  Shouldn’t be any of those three.  Optimize for people.  That doesn’t even mean end user response time – that means impression of performance.

How fast is Facebook?  Well, determine what the core of the experience is.  What do people look at first, what defines the experience?  Lazy load the rest of that crap.

Metric: Time To Interact (TTI).  It’s a very custom metric.  When is the user getting value out of the site?  This is subjective and requires you to really know your users.

For this, the critical pieces have to be there and have to WORK – you can’t just display it and have the functionality not there yet.  You can’t pick visible but not functional yet.

Techniques used to speed things up:

  1. Early flush.  Get them a list of the crucial elements.
  2. Components.  Pages used the same components with different names, the color blue was defined a thousand times.  Make a reusable set of visual components that can appear on any page and share the same CSS rules.  Besides enforcing visual standards, you can optimize them and then reuse them.  Theirs are a grid, an image block, some buttons, page headers…
  3. JavaScript!  We love it, but it is hard.  They wrote a lot before they knew what they were doing.  They have something called “primer.”  There’s a simple JS library that lives in the head and can bootstrap the rest of the javascript and respond to simple stuff devs were writing over and over again.  An event handler that can do a popup, get and insert content, or do a form submit.  And go get other javascript.  Then you tag something with a rel=”dialog” and it pops a dialog.  And once the page is done you can go get the stuff instead of making it on demand.  “async” gets content. In the feedback interface; Like and View and Delete use it.
  4. BigPipe is an attempt to rethink how we present pages.  The problem is the page generation, network latency, and page rendering being serial.  They render personalized pages and have to query several back end services to make the page.  The page is waiting on the slowest back end query.  So pipeline it out!  Decompose pages into “pagelets” and pipeline them through different execution stages in the server and browser.  They give priorities to different pagelets.
    How does it work?  First you get a nearly empty doc.  In the head, script src bigpipe.js.  Then there are divs on the page with IDs; a template with the logical structure of the page.  For each pagelet, it’s flushed separately in a script tag, JSON encoded.  BigPipe on th  client downloads CSS for the pagelet, displays it, downloads JS, and executes onLoad()s.
    This gave them a 2x improvement in perceived latency (defined by TTI) across all browsers.
    What about search engines?  Well, first of all, for devs to use the pipe, they have to write pagelets, and they have a pagelet abstraction for them to use.  Only has three functions: initialize, prepare, and render.  To pipeline you create a BigPipe instance, specify your page layout and place holders, add pagelets to the pipe (source file and wrapper id) and then call render.  So you can do pipeline, singleflush, parallel, or prepare models.  One parameter in Bigpipe::GetInstance controls it. Use singleflush for search and non-JS stuff.  Preparelets you batch multiple pages.  Parallel lets you use multiple threads for different pagelets (at the cost of server resources!).

Whew!  All this was a success – on Dec 22 they got their goal of making Facebook twice as fast.

Combine with ESIs for even more fun!

Thoughts from the Site Speed team:

To build a culture of performance…  Make tshirts!  They gave shirts to those who made improvements.

  1. Getting the right metrics, that people buy into, then they’ll work on optimizing it.
  2. Build the right abstraction to make the site fast by default, and if devs use them then you are fast without them having to do loads of work.
  3. Partnership.  If other teams are committed you’ll have success.  Find the people that get it and work with them.  Ignore the ignorant.

Final thought – spriting!  He likes spriting.it’s crazy but the platform is a little broken so you have to do stuff like that.  But let’s fic the platform so you don’t hav eto do crazy stuff.  Fast by default!!!

And that’s a wrap for Velocity 2010!  Next, stuff from DevOpsDays, and my thoughts and reflections on what we’ve learned!

Leave a comment

Filed under Conferences, DevOps

Velocity 2010 – Always Ship Trunk

Always Ship Trunk: Managing Change In Complex Websites by Paul Hammond (Typekit)

No rest for the wicked.  More sessions to write up.  Let’s find out how to do feature switches, Flickr-style.  My comments are in italics.

Use revision control. Branching is sad because of merging.  But Mercurial and git make it all magically delicious.

Revision control is nice but what it doesn’t answer is what is running on a given Web server.

There are three kinds of software.

  1. Installed
  2. Open Source installed
  3. Web apps/SaaS

Web apps are not like installed apps.  Revision control is meant to deal with loads of versions.  With a Web app there’s about 1 version of your app in use.  If you administer every computer your software is installed on, you don’t have to worry about a lot of stuff.  Once you upgrade, the old code will never be run again.  It has a very linear flow.

But not really.  Upgrades don’t happen on every box simultaneously.  And shouldn’t – best practice is rolling to a subset.

And you push to a staging/QA environment first.  So suddenly you have more “installs.”  And beta environments.

You have stuff (dependencies) outside your control – installed library dependencies, Web service dependencies – all that change has to be managed.

Coordinating lots of peopel working at the same time is hard.

Deep thought alert: Nobody knows you just deployed unless you tell them.

You can separate the code deployment from the launch.  You can rewrite your infrastructure and keep the UI the same and no one knows.

Deep thought alert 2: You can run different versions in production at the same time.

Put it out.  Ramp up usage.  Different people can see different UIs and they don’t know.

What we need is a revision control system that lets up manage multiple parallel versions of the code and switch between them at runtime.

Branches don’t solve that problem for us (by themselves).  And they don’t help with dependency changes that affect all branches at once – if someone changes their Web API you call, it affects every version!

revision vs version.

Manage the different versions within your application – “branching in code.”  You know, if statements.

This is really dangerous if you don’t have super duper regression testing right?  I’m rolling a new version but not really…  Good luck on that.

This is the “switch concept.”  It allows for feature testing on production servers.

Join it with cookies and you can have a “feature flip” page!  You can put all kinds of private functionality into the app and rely on whatever if statement you wrote to make sure no one bad gets to it!  Good Lord!

There are benefits to production testing (even if it’s not from end users) – firewall stuff, CDN stuff, et cetera.  It’s very flexible.  You can do dark launches.  Run the code in the background and don’t display it.  Now that’s clever.

There are three types of feature flags

  1. user facing feature development
  2. infrastructure development
  3. kill switches

Disable login!

They have loads of $cfg[‘disable_random_feature’] = false

The cost of this is complexity.

Separate your operational controls from development flags.

Be disciplined about removing unused feature flags so it’s not full of cruft.

If you’re going to do this,  just go all in and always deploy trunk to every server on every deploy and manage versions with config.

Definitely daring.  I wonder if it’s appropriate for more “real” workloads than “I’m uploading my pics to a free service for kicks” though.

Joel Spolsky sayeth:  This is retarded.

With new style distributed merge, instead:

  • Use branches for early development. Branches should be merged into trunk.
  • Use flags for rollout of almost-finished code.

Is there a better alternative?  Everyone who makes revision  control systems makes them for installed software not Web software – what would one for installed software look like?

Q&A Tidbits: Put all the switches in one place… Not spread through the code.

What about Sarbanes/Oxley division of labor?  Pshaw.  This is for apps that are just for funsies.

You have to build some culture stuff to about devs not jsut hitting deploy and wandering off, but following up on production state.

1 Comment

Filed under Conferences, DevOps

Velocity 2010 – Performance Indicators In The Cloud

Common Sense Performance Indicators in the Cloud by Nick Gerner (SEOmoz)

SEOmoz has been  EC2/S3 based since 2008.  They scaled from 50 to 500 nodes.  Nick is a developer who wanted him some operational statistics!

Their architecture has many tiers – S3, memcache, appl, lighttpd, ELB.  They needed to visualize it.

This will not be about waterfalls and DNS and stuff.  He’s going to talk specifically about system (Linux system) and app metrics.

/proc is the place to get all the stats.  Go “man proc” and understand it.

What 5 things does he watch?

  • Load average – like from top.  It combines a lot of things and is a good place to start but explains nothing.
  • CPU – useful when broken out by process, user vs system time.  It tells you who’s doing work, if the CPU is maxed, and if it’s blocked on IO.
  • Memory – useful when broken out by process.  Free, cached, and used.  Cached + free = available, and if you have spare memory, let the app or memcache or db cache use it.
  • Disk – read and write bytes/sec, utilization.  Basically is the disk busy, and who is using it and when?  Oh, and look at it per process too!
  • Network – read and write bytes/sec, and also the number of established connections.  1024 is a magic limit often.  Bandwidth costs money – keep it flat!  And watch SOA connections.

Perf Monitoring For Free

  1. data collection – collectd
  2. data storage- rrdtool
  3. dashboard management – drraw

They put those together into a dashboard.  They didn’t want to pay anyone or spend time managing it.  The dynamic nature of the cloud means stuff like nagios have problems.

They’d install collectd agents all over the cluster.  New nodes get a generic config, and node names follow a convention according to role.

Then there’s a dedicated perf server with the collectd server, a Web server, and drraw.cgi.  In a security group everyone can connect in to.

Back up your performance data- it’s critical to have history.

Cloudwatch gives you stuff – but not the insight you have when breaking out by process.  And Keynote/Gomez stuff is fine but doesn’t give you the (server side) nitty gritty.

More about the dashboard. Key requirements:

  • Summarize nodes and systems
  • Visualize data over time
  • Stack measurements per process and per node
  • Handle new nodes dynamically w/o config chage

He showed their batch mode dashboard.  Just a row per node, a metric graph per column.  CPU broken out by process with load average superimposed on top.  You see things like “high load average but there’s CPU to spare.”  Then you realize that disk is your bottleneck in real workloads.  Switch instance types.

Memory broken out by process too.  Yay for kernel caching.

Disk chart in bytes and ops.  The steady state, spikes, and sustained spikes are all important.

Network – overlay the 95th percentile cause that’s how you get billed.

Web Server dashboard from an API server is a little different.

Add Web requests by app/request type.  app1, app2, 302, 500, 503…  You want to see requests per second by type.

mod_status gives connections and children idleness.

System wide dashboard.  Each graph is a request type, then broken out by node.  And aggregate totals.

And you want median latency per request.  And any app specific stuff you want to know about.

So get the basic stats, over time, per node, per process.

Understand your baseline so you know what’s ‘really’ a spike.

Ad hoc tools -try ’em!

  • dstat -cdnml for system characteristics
  • iotop for per process disk IO
  • iostat -x 3 for detailed disk stats
  • netstat -tnp for per process TCP connection stats

His slides and other informative blog posts are at nickgerner.com.

A good bootstrap method… You may want to use more/better tools but it’s a good point that you can certainly do this amount for free with very basic tooling, so something you pay for best be better! I think the “per process” intuition is the best takeaway; a lot of otherwise fancy crap doesn’t do that.

But in the end I want more – baselines, alerting, etc.

Leave a comment

Filed under Cloud, Conferences, DevOps