Monthly Archives: June 2010

DevOps Time!

All right!  After the last three days of Velocity 2010, we’ve talked a lot about ops and even hinted at devops, although often in a “recycled from previous Velocity” fashion.  But today it’s time to mainline it with DevOpsDays!

I’m going to be too busy actually participating to do full writeups like I did from Velocity, but I’ll distill down the best takeaways and bring them here as soon as I can.  If you just can’t wait and aren’t here, follow along on twitter at #devopsdays!

Leave a comment

Filed under DevOps

Velocity 2010 – Facebook Performance Shenanigans

Pipelining, Progressive Enhancement, and More: Making Facebook Twice as Fast by Jason Sobel (Facebook), Changhao Jiang (Facebook)

It’s the last session of Velocity already! The companies are tearing down their booths, people are escaping to the airport. Today went really, really fast. The room is still mostly full though!

As we’ve heard before, they have loads of users.  They have a central performance team but also distributed and embedded throughout the company.

The core site speed team started working on PHP speed.  Then they read Steve Souder’s book and realized “Oh, crap…”

They are working on a “perflab” to measure performance impacts of all changes.  And detect regressions.

What are the three things they measure at Facebook?

  1. Server time
  2. Network time
  3. Client/render time

What are we optimizing for?  Shouldn’t be any of those three.  Optimize for people.  That doesn’t even mean end user response time – that means impression of performance.

How fast is Facebook?  Well, determine what the core of the experience is.  What do people look at first, what defines the experience?  Lazy load the rest of that crap.

Metric: Time To Interact (TTI).  It’s a very custom metric.  When is the user getting value out of the site?  This is subjective and requires you to really know your users.

For this, the critical pieces have to be there and have to WORK – you can’t just display it and have the functionality not there yet.  You can’t pick visible but not functional yet.

Techniques used to speed things up:

  1. Early flush.  Get them a list of the crucial elements.
  2. Components.  Pages used the same components with different names, the color blue was defined a thousand times.  Make a reusable set of visual components that can appear on any page and share the same CSS rules.  Besides enforcing visual standards, you can optimize them and then reuse them.  Theirs are a grid, an image block, some buttons, page headers…
  3. JavaScript!  We love it, but it is hard.  They wrote a lot before they knew what they were doing.  They have something called “primer.”  There’s a simple JS library that lives in the head and can bootstrap the rest of the javascript and respond to simple stuff devs were writing over and over again.  An event handler that can do a popup, get and insert content, or do a form submit.  And go get other javascript.  Then you tag something with a rel=”dialog” and it pops a dialog.  And once the page is done you can go get the stuff instead of making it on demand.  “async” gets content. In the feedback interface; Like and View and Delete use it.
  4. BigPipe is an attempt to rethink how we present pages.  The problem is the page generation, network latency, and page rendering being serial.  They render personalized pages and have to query several back end services to make the page.  The page is waiting on the slowest back end query.  So pipeline it out!  Decompose pages into “pagelets” and pipeline them through different execution stages in the server and browser.  They give priorities to different pagelets.
    How does it work?  First you get a nearly empty doc.  In the head, script src bigpipe.js.  Then there are divs on the page with IDs; a template with the logical structure of the page.  For each pagelet, it’s flushed separately in a script tag, JSON encoded.  BigPipe on th  client downloads CSS for the pagelet, displays it, downloads JS, and executes onLoad()s.
    This gave them a 2x improvement in perceived latency (defined by TTI) across all browsers.
    What about search engines?  Well, first of all, for devs to use the pipe, they have to write pagelets, and they have a pagelet abstraction for them to use.  Only has three functions: initialize, prepare, and render.  To pipeline you create a BigPipe instance, specify your page layout and place holders, add pagelets to the pipe (source file and wrapper id) and then call render.  So you can do pipeline, singleflush, parallel, or prepare models.  One parameter in Bigpipe::GetInstance controls it. Use singleflush for search and non-JS stuff.  Preparelets you batch multiple pages.  Parallel lets you use multiple threads for different pagelets (at the cost of server resources!).

Whew!  All this was a success – on Dec 22 they got their goal of making Facebook twice as fast.

Combine with ESIs for even more fun!

Thoughts from the Site Speed team:

To build a culture of performance…  Make tshirts!  They gave shirts to those who made improvements.

  1. Getting the right metrics, that people buy into, then they’ll work on optimizing it.
  2. Build the right abstraction to make the site fast by default, and if devs use them then you are fast without them having to do loads of work.
  3. Partnership.  If other teams are committed you’ll have success.  Find the people that get it and work with them.  Ignore the ignorant.

Final thought – spriting!  He likes’s crazy but the platform is a little broken so you have to do stuff like that.  But let’s fic the platform so you don’t hav eto do crazy stuff.  Fast by default!!!

And that’s a wrap for Velocity 2010!  Next, stuff from DevOpsDays, and my thoughts and reflections on what we’ve learned!

Leave a comment

Filed under Conferences, DevOps

Velocity 2010 – Always Ship Trunk

Always Ship Trunk: Managing Change In Complex Websites by Paul Hammond (Typekit)

No rest for the wicked.  More sessions to write up.  Let’s find out how to do feature switches, Flickr-style.  My comments are in italics.

Use revision control. Branching is sad because of merging.  But Mercurial and git make it all magically delicious.

Revision control is nice but what it doesn’t answer is what is running on a given Web server.

There are three kinds of software.

  1. Installed
  2. Open Source installed
  3. Web apps/SaaS

Web apps are not like installed apps.  Revision control is meant to deal with loads of versions.  With a Web app there’s about 1 version of your app in use.  If you administer every computer your software is installed on, you don’t have to worry about a lot of stuff.  Once you upgrade, the old code will never be run again.  It has a very linear flow.

But not really.  Upgrades don’t happen on every box simultaneously.  And shouldn’t – best practice is rolling to a subset.

And you push to a staging/QA environment first.  So suddenly you have more “installs.”  And beta environments.

You have stuff (dependencies) outside your control – installed library dependencies, Web service dependencies – all that change has to be managed.

Coordinating lots of peopel working at the same time is hard.

Deep thought alert: Nobody knows you just deployed unless you tell them.

You can separate the code deployment from the launch.  You can rewrite your infrastructure and keep the UI the same and no one knows.

Deep thought alert 2: You can run different versions in production at the same time.

Put it out.  Ramp up usage.  Different people can see different UIs and they don’t know.

What we need is a revision control system that lets up manage multiple parallel versions of the code and switch between them at runtime.

Branches don’t solve that problem for us (by themselves).  And they don’t help with dependency changes that affect all branches at once – if someone changes their Web API you call, it affects every version!

revision vs version.

Manage the different versions within your application – “branching in code.”  You know, if statements.

This is really dangerous if you don’t have super duper regression testing right?  I’m rolling a new version but not really…  Good luck on that.

This is the “switch concept.”  It allows for feature testing on production servers.

Join it with cookies and you can have a “feature flip” page!  You can put all kinds of private functionality into the app and rely on whatever if statement you wrote to make sure no one bad gets to it!  Good Lord!

There are benefits to production testing (even if it’s not from end users) – firewall stuff, CDN stuff, et cetera.  It’s very flexible.  You can do dark launches.  Run the code in the background and don’t display it.  Now that’s clever.

There are three types of feature flags

  1. user facing feature development
  2. infrastructure development
  3. kill switches

Disable login!

They have loads of $cfg[‘disable_random_feature’] = false

The cost of this is complexity.

Separate your operational controls from development flags.

Be disciplined about removing unused feature flags so it’s not full of cruft.

If you’re going to do this,  just go all in and always deploy trunk to every server on every deploy and manage versions with config.

Definitely daring.  I wonder if it’s appropriate for more “real” workloads than “I’m uploading my pics to a free service for kicks” though.

Joel Spolsky sayeth:  This is retarded.

With new style distributed merge, instead:

  • Use branches for early development. Branches should be merged into trunk.
  • Use flags for rollout of almost-finished code.

Is there a better alternative?  Everyone who makes revision  control systems makes them for installed software not Web software – what would one for installed software look like?

Q&A Tidbits: Put all the switches in one place… Not spread through the code.

What about Sarbanes/Oxley division of labor?  Pshaw.  This is for apps that are just for funsies.

You have to build some culture stuff to about devs not jsut hitting deploy and wandering off, but following up on production state.

1 Comment

Filed under Conferences, DevOps

Velocity 2010 – Performance Indicators In The Cloud

Common Sense Performance Indicators in the Cloud by Nick Gerner (SEOmoz)

SEOmoz has been  EC2/S3 based since 2008.  They scaled from 50 to 500 nodes.  Nick is a developer who wanted him some operational statistics!

Their architecture has many tiers – S3, memcache, appl, lighttpd, ELB.  They needed to visualize it.

This will not be about waterfalls and DNS and stuff.  He’s going to talk specifically about system (Linux system) and app metrics.

/proc is the place to get all the stats.  Go “man proc” and understand it.

What 5 things does he watch?

  • Load average – like from top.  It combines a lot of things and is a good place to start but explains nothing.
  • CPU – useful when broken out by process, user vs system time.  It tells you who’s doing work, if the CPU is maxed, and if it’s blocked on IO.
  • Memory – useful when broken out by process.  Free, cached, and used.  Cached + free = available, and if you have spare memory, let the app or memcache or db cache use it.
  • Disk – read and write bytes/sec, utilization.  Basically is the disk busy, and who is using it and when?  Oh, and look at it per process too!
  • Network – read and write bytes/sec, and also the number of established connections.  1024 is a magic limit often.  Bandwidth costs money – keep it flat!  And watch SOA connections.

Perf Monitoring For Free

  1. data collection – collectd
  2. data storage- rrdtool
  3. dashboard management – drraw

They put those together into a dashboard.  They didn’t want to pay anyone or spend time managing it.  The dynamic nature of the cloud means stuff like nagios have problems.

They’d install collectd agents all over the cluster.  New nodes get a generic config, and node names follow a convention according to role.

Then there’s a dedicated perf server with the collectd server, a Web server, and drraw.cgi.  In a security group everyone can connect in to.

Back up your performance data- it’s critical to have history.

Cloudwatch gives you stuff – but not the insight you have when breaking out by process.  And Keynote/Gomez stuff is fine but doesn’t give you the (server side) nitty gritty.

More about the dashboard. Key requirements:

  • Summarize nodes and systems
  • Visualize data over time
  • Stack measurements per process and per node
  • Handle new nodes dynamically w/o config chage

He showed their batch mode dashboard.  Just a row per node, a metric graph per column.  CPU broken out by process with load average superimposed on top.  You see things like “high load average but there’s CPU to spare.”  Then you realize that disk is your bottleneck in real workloads.  Switch instance types.

Memory broken out by process too.  Yay for kernel caching.

Disk chart in bytes and ops.  The steady state, spikes, and sustained spikes are all important.

Network – overlay the 95th percentile cause that’s how you get billed.

Web Server dashboard from an API server is a little different.

Add Web requests by app/request type.  app1, app2, 302, 500, 503…  You want to see requests per second by type.

mod_status gives connections and children idleness.

System wide dashboard.  Each graph is a request type, then broken out by node.  And aggregate totals.

And you want median latency per request.  And any app specific stuff you want to know about.

So get the basic stats, over time, per node, per process.

Understand your baseline so you know what’s ‘really’ a spike.

Ad hoc tools -try ’em!

  • dstat -cdnml for system characteristics
  • iotop for per process disk IO
  • iostat -x 3 for detailed disk stats
  • netstat -tnp for per process TCP connection stats

His slides and other informative blog posts are at

A good bootstrap method… You may want to use more/better tools but it’s a good point that you can certainly do this amount for free with very basic tooling, so something you pay for best be better! I think the “per process” intuition is the best takeaway; a lot of otherwise fancy crap doesn’t do that.

But in the end I want more – baselines, alerting, etc.

Leave a comment

Filed under Cloud, Conferences, DevOps

Velocity 2010 – Grendel

Protecting “Cloud” Secrets With Grendel by Sam Quigley (Square, Inc) and Coda Hale (Yammer, Inc.)

Everyone stores private data.  Passwords, credit cards, documents, etc.  But also personal conversations, personal histories, usage patternns – that’s all private too.  So you store private info – yes you – so how do you protect it?  Firewalls and VPNs?  Passwords?  Bah.  They are useful against last decade’s attacks.

Application level attacks are the new hotness – see the OWASP Top 10.  What you want to do is encryption.  But that’s complex.  Veracode has analyzed a lot of apps and crypto problems are the #1 problem.

What do we do?  Here’s some ideas.


It is a secure document storage system. Open and does minimal/simple.  It does data storage, authentication, and access control using the OpenPGP message format and a RESTful interface, it’s in Java, and uses a normal DB backend.

OpenPGP – mature, flexible.  It’s for confidentiality and integrity.  It uses asymmetric keys.  The keys are stored encrypted with passphrases.  The keys are used to encrypt documents to one or more recipients.

REST API – http native.  Why REST?  For all the reasons everyone uses REST.  Ubiquitous, well understood, simple, easily debugged (charles), free features.

Java 1.6 + RDBMS.  Java because it’s fast and stable and well understood.  Uses hibernate.  RDBMS because you already have one.

Grendel is simple.  One config file.  DB location and password and some c3p0 stuff.

java -jar grendel.jar schema -c

generates a schema.  Three tables; users, documents, and links.

java -jar grendel.jar server -c -p 8080

starts it.

The API has users, docs, links, and linked docs.  JSON based.

You can create a user, which makes a new key set behind the scenes.

You can store a document.  PUT /users/name/documents/docname with a basic auth header.  It decrypts the user’s keys, signs and encrypts the doc, and stores it.

GET /users/name/documents gets you a JSON list.  Or get the document and you get the document (duh).

Then you can link the document to another user to share it with them.

So what’s the big deal?

Self defending data.  The data itself enforces the access control rules.  And business logic is enforced with math.

He didn’t even mention the brilliance of this related to scenarios like things like subpoenas causing Amazon to give up your S3 data to people…

Authentication done right.  It’s hard to do it right.  Adaptive hashing.  A centralized service model.  Resistant to modern attacks.

It makes it “sudo for the Web.”  You can grant long lived session coolies, and re-auth for privileged access.  Yeah, we do that in general not with encryption…  Like remembers you but when it’s purchase time you have to reauth securely.

It also mitigates XSS/CSRF attacks, kinda.

This creates a privacy wall.  You the admin are locked out of the data.  Insider threat defeated.

In the future…

Support for sessions.  OAuth 2.0.  And spreading the idea in general!

How is this better than symmetric encryption with the user’s password?  Since you’re proxying it anywhere.  Because you can’t share then.

I guess one downside is that you can’t see inside the docs to search index, etc.

You could use client side certs instead of passwords right?  No.

Does it have support for password change?  Yes.

I personally am psyched about this – I think we have a product underway that could really benefit from using it.

Leave a comment

Filed under Cloud, Conferences, DevOps, Security

Velocity 2010 – Memcached Scalability

After lunch, we start off with Hidden Scalability Gotchas in Memcached and Friends by Neil Gunther (Performance Dynamics Company), Shanti Subramanyam (Oracle Corporation), and Stefan Parvu (Oracle Finland).

Scaling up versus scaling out.  Bigger or more.  There is no “best approach” – you need to be quantitative, with controlled measurements and numbers to see the cost-benefit.

Data isn’t information, you need to transform it.  Capacity planning  has “planning” in it.  Like with finance, you need a model.  Metrics + models = information.

Controlled Measurements

You want to take measurements in a known environment with a specific workload.  Using production time series data is like predicting the stock market using Google finance graphs.

You need throughput measured in steady state.  No load vs vusers with varying throughput…

So they did some controlled tests.

Memcached scaling is thread limited.  Past about 4-6 threads, throughput levels off.

By using a SPARC multicore friendly hash map patch, it did scale up to maybe 30 threads.

Quantifying Scalability

1.  Equal bang for the buck.  Ideal parallelism is a linear load vs capacity graph, but really it plateaus and degenerates at some point. But there’s an early art of the graph that looks like this.

2.  Cost of sharing resources – when the curve falls away from linear.

3.  Resource limitation – where the curve tops out (Amdahl’s law)

4.  Degradation/negative return – more capacity makes things worse after a point.

Formula: c(N)=N/(1+a(N-1)+bN(N-1))

N is the number of threads.  1=concurrency, a=contention , b= coherency

Run it through excel USL analysis and calculate a and b.

As memcached versions came out, the concurrency was improved, but N didn’t budge.  People can say they make improvements, but if it doesn’t affect the data, then bah.

Anyway, the model is semi predictive but not perfect.  If you know whether your problem is a contention (like queuing) or coherence (like point to point transfers) issue you know what to look for in your code.

Memcached Gotchas

Throw more hardware at it!  Well, current strategies are around old cheap hardware with single CPUs.  As multicore arrives, if you can’t use all the cores, you won’t utilize your hardware fully.

As memcached is thread limited it’ll be a problem on multicore.

Take controlled measurements with steady state throughput to analyze data.

Quantify scalability using a model.  Reduce contention and coherency.

Follow them at:

There’s a lot of discussion about the model predictability because they had a case where the model predicted one thing until there were higher order data points and then it changed.  The more data, the more the model works – but he stresses you need to trust the model with what data you have.  You’re not predicting, you’re explaining with the model.  It’s not going to tell you exactly what is wrong…  Lots of questions, people are mildly confused.

Leave a comment

Filed under Conferences, DevOps

Velocity 2010 – Day 3 Demos and More

Check out the Velocity 2010 flickr set!  And YouTube channel!

Time for lightning demos.


A HTTP browser proxy that does the usual waterfalls from your Web pages.  Version 7 is out!  You can change fonts.  They work in IE and Firefox both, unlike the other stuff.

Rather than focus on ranking like YSlow/PageSpeed, they focus on showing specific requests that need attention.  Same kind of data but from a different perspective.  And other warnings, not all strictly perf related.  Security, etc.  Exports and consumes HAR files (the new standard for http waterfall interchange).

Based on AOL Pagetest, an IE module, but hosted.  Can be installed as a private site too.  It provides obejct timing and waterfalls.  Allows testing from multiple locations and network speeds, saved them historically. Like a free single-snapshot version of Keynote/Gomez kinds of things.

Shows stuff like browser CPU and bandwidth utilization and does visual comparisons, showing percieved performance in filmstrip view and video.

And does HAR  import/export!  Ah, collaboration.

The CPU/net metrics show some “why” behind gaps in the waterfalls.

The filmstrip side by side view shows different user experiences very well.  And you can do video, as that is sexy.

They have a new UI built by a real designer (thanks Neustar!).

Speed Tracer

But what about Chrome, you ask?  We have an extension for that now.  Similar to PageSpeed.  The waterfall timeline is beautiful, using real “Google Finance” style visualization.  The other guys aren’t like RRDTool ugly but this is super purty.

It will deobfuscate JavaScript and integates it with Eclipse!

They’re less worried about network waterfall and more about client side render.  A lot of the functionality is around that.

You can send someone a Speed Trace dump file for debug.


Are you tired of being browser dependent?  Fiddler has your back.

New features…  Hey, platform preview 3 for IE9 is out.  It has some tools for capture and export.; it captures traffic in a XML serialised HAR.  Fiddler imports JSON HAR from norms and the IE9 HAR!  And there’s HAR 1.1!  Eeek.  And wcat.  It imports lots of different stuff in other words.

I want one of these to take in Wireshark captures and rip out all the crap and give me the HTTP view!

FiddlerCap Web recorder ( lets people record transactions and send it to you.

Side by side viewingwith 2 fiddlers  if you launch with -viewer.

There’s a comparison extension called differ.  Nice!

You can replay captures, including binarieis now, with the AutoResponder tab.  And it’ll play latency soon.

We still await the perfect HTTP full capture and replay toolchain… We have our own HTTP log replayer we use for load tests and regression testing, if we could do this but in volume it would rock…

Caching analysis.  FiddlerCore library you can put in your app.

Now, Bobby Johnson of Facebook speaks on Moving Fast.

Building something isn’t hard, but you don’t know how people will use it, so you have to adapt quickly and make faster changes.

How do you get to a fast release cycle?  Their biggest requirement is “the site can’t go down.”  So they go to frequent small changes.

Most of the challenge when something goes wrong isn’t fixing it, it’s finding out what went wrong so you can fix it.  Smaller changes make that easier.

They take a new thing, push some fake traffic to it, then push a % of user traffic, and then dial back if there’s problems.

If you aren’t watching something, it’ll slip.  They had their performance at 5s; did a big improvement and got it to 2.5.  But then it slips back up.  How do you keep it fast (besides “don’t change anything?”)  They changed their organization to make quick changes but still maintain performance.

What makes a site get slow?  It’s not a deep black art.  A lot of it is jus tnot paying attention to your performance – you can’t foresee new code to not have bugs or not have performance problems.

  • New code is slow
  • More code is slow

“Age the code” by allocating time to shaking out performance problems – not just before, but after deploy.

  • Big pipe – break the page into small pieces and pipelines it.
  • Primer – a JavaScritp library that bootstraps by dl’ing the minimum first.

Both separate code into a fast path and a slow path and defaults to the slow path.

I have a new “poke” feature I want to try… I add it in as a lazy loaded thing and see if anyone cares, before I spend huge optimization time.

It gets popular!  OK, time to figure out performance on the fast path.

So they engineer performance per feature, which allows prioritization.

You can have a big metric collection and reporting tool.  But that’s different from alerting tools.  Granularity of alerting – no one wants to get paged on “this one server is slow.”  But no one cares about “the whole page is 1% slower” either.  But YOUR FEATURE is 50% slower than it was – that someone cares about.  Your alert granularity needs to be at the level of “what a single person works on.”

No one is going to fix things if they don’t care about it.  And also not unless they have control over it (like deploying code someone else wrote and being responsible for it breaking the site!). And they need to be responsible for it.

They have tried various structures.  A centralized team focused on performance but doesn’t have control over it (except “say no” kinds of control).

Saying “every dev is responsible for their perf” distributes the responsibility well, but doesn’t make people care.

So they have adopted a middle road.  There’s a central team that builds tools and works with product teams.  On each product team there is a performance point person.  This has been successful.

Lessons learned:

  • New code is slow
  • Give developers room to try things
  • Nobody’s job is to say no

Joshua from Strangeloop on The Mobile Web

Here we all know that performance links directly to money.

But others (random corporate execs) doubt it.  And what about the time you have to spend?  And what about mobile?

We need to collect more data

Case study – with 65% performance increase, 6% order size increase and 9% conversion increase.

For mobile, 40% perf led to 3% order size and 5% conversion.

They have a conversion rate fall-off by landing page speed graph, so you can say what a 2 second improvement is worth.  And the have preliminary data on mobile too.

I think  he’s choosing a very confusing way to say you need mttrics to establish the ROI of performance changes.  And MOBILE IS PAYING MONEY RIGHT NOW!

Cheryl Ainoa from Yahoo! on Innovation at Scale

The challenges of scale – technical complexity and outgrowing many tools and techniques, there are no off hours, and you’re a target for abuse.

Case Study: Fighting Spam with Hadoop

Google Groups was sending 20M emails/day to Taiwan and there’s only 18M Internet users in Taiwan.  What can help?  Nothing existing could do that volume (spamcop etc.)  And running their rules takes a couple days.  So they used hadoop to drive indications of “spammy” groups in parallel. Cut mail delivered by 5x.

Edge pods – small compute footprints to optimize cost and performance.  You can’t replicate your whole setup globally.  But adding on to a CDN is adding some compute capability to the edge in “pods.”  They have a proxy called YCPI to do this with.

And we’re out of time!

Leave a comment

Filed under Conferences, DevOps

Velocity 2010 – Day 3 Keynotes

Ohhh my aching head.  Apparently this is a commonly held problem, as the keynote hall is much more sparsely attended at 8:30 AM today than it was yesterday.  Some great fun last night, we hung with the Cloudkick and Turner Broadcasting guys, drinking and playing the “Can we name all the top level Apache projects” game.

It’s time for another morning of keynotes.  Presentations from yesterday should be appearing on their schedule pages (I link to the schedule page in each blog post, so they should only be a click away).  As always, my own comments below will be set off in italics to avoid libel suits.

First, we have John Rauser from Amazon on Creating Cultural Change.

Since we are technologists and problemsolvers, we of course tend to try to solve problems with technology – but many of our biggest, hardest problems are cultural.

It’s very true for Operations and performance because their goals are often in tension with other parts of the business.  Much like security, legal, and other “non-core” groups. And it’s easy to fall into an adversarial relationship when there are “two sides.”  Having a dedicated ops team is somewhat dangerous for this reason.  So you need to ingrain performance and ops into your org’s mentality.  Idyllic?  Maybe, but doable.

If you determine someone is a bad person, like the coffee “free riders” that take the last cup and don’t make more, you have a dilemma.  Complaining about “free riders” doesn’t work.  Nagging, shaming, etc. same deal  He had a friend that put in some humorous placards that marketed the “product” of making coffee.  And it worked.

Sasquatch dancing guy!  I don’t have the heart to go into it, just Google the video.  Anyway, people join in when there’s unabashed joy and social cover.  If you’re cranky, you add social cover for people to be cranky.  Welcome newcomers.  Lavish praise.  Help them succeed.  “Treat your beta testers as your most valuable resource and they will respond by becoming your most valuable resource.”

Shreddies vs Diamond Shreddies!  Rebranding and perception change.  Is DevOps our opportunity to turn “infrastructure people” into “agile admins?”

Anyway, be relentlessly happy and joyful.  I know I have positivity problems and I definitely look back and say “outcomes would have been better if I didn’t succumb to the temptation to bitch about those “dumbass programmers”…

1.  Try something new.  A little novelty gets through people’s mental filters.  If you’ve tried without metrics, try with.  If you’ve tried metrics, try business drivers.  If you tried that, pull out a single user session and simulate what it’s like.

2. Group identity.  Mark people as special.  Badges!  Authority!  Invite people to review their next project with an “ops expert” or “perf expert.”

3.  Be relentless.  Sending an email and waiting is a chump play.  And be relentlessly happy.

There was a lot of wisdom in this presentation.  As a former IT manager trying to run a team with complex relationships with other infrastructure teams, dev teams, and business teams, times where we hewed to this kind of theory things tended to work, and when we didn’t they tended not to.

Operations at Twitter by John Adams

What’s changed since they spoke last year?  They’ve made headway on Rails performance, more efficient use of Apache, many more servers, lbs, and people.  Up to 210 employees.

One of my questions about the devops plan is how to scale it – same problem agile development has.

More and more it’s API.  75% of the traffic to Twitter is API now.  160k registered apps, 100M searches a day, 65M tweets per day.

They’re trying to work on CM and other stuff.  Scaling doesn’t work the first time – you have to rebuild (refactor, in agile speak).  They’re doing that now.

Shortening Mean Time To Detect Problems drives shorter Mean Time To Recovery.

They continuously evaluate looking for bottlenecks – find the weakest part, fix it, and move on to the next in an iterative manner.

They are all about metrics using ganglia, and feed it to users on ans

Don’t be a “systems administrator” any more.  Combine statistical analysis and monitoring to produce meaningful results.  Make decisions based on data not gut instincts.

They’re working on low level profiling of Apache, Ruby, etc. Network –  Latency, network usage, memory leaks.  tcpdump + tcpdstat, yconalyze  Apps – Introspect with google perftools

Instrumenting the world pays off.  Data analysis and visualization are necessary skills nowadays.

Rails hasn’t really been their performance problem.  It’s front end problems like caching/cache invalidation problems, bad queries generated by ActiveQuery, garbage collection (20% of the issues!), and replication lag.

Analyze!  Turn data into information.  Understand where the code base is going.

Logging!  Syslog doesn’t work at scale.  No redundancy, no failure recovery.  And moving large files is painful.  They use scribe to HDFS/hadoop with LZO compression.

Dashboard – Theirs has “criticals” view (top 10 metrics), smokeping/mrtg, google analytics (not just for 200s!), XML feeds from managed services.

Whale Watcher – a shell script that looks for errors in the logs.

Change management is the stuff.  They use Reviewboard and puppet+svn.  Hundreds of modules, runs constantly.    It reuses tools that engineers use.

And Deploywatcher, another script that stops deploys if there’s system problems.  They work a lot on deploy.  Graph time of day next to CPU/latency.

They release features in “dark mode” and have on/off switches. Especially computationally/IO heavy stuff.  Changes are logged and reported to all teams (they have like 90 switches).  And they have a static/read-only mode and “emergency stop” button.

Subsystems!  Take a look at how we manage twitter.

loony – a central machine database in mySQL.  They use managed hosting so they’re always mapping names.  Python, django, paraminko SSH (twitter’s OSS SSH library).  Ties into LDAP.  When the data center sends mail, machine definitions are built in real time.  On demand changes with “run.”  Helps with deploy and querying.

murder – a bittorrent based deploy client, python+libtorrent.

memcached – network memory bus isn’t infinite.  Evictions make the cache unreliable for important configs.  They segment into pools for better performance.  Examine slab allocation and watch for high use/eviction with “peep. ”  Manage it like anything else.

Tiers – load balancer, apache, rails (unicorn), flock DB.

Unicorn is awesome and more awesome than mongrel for Rails guys.

Shift from Proxy Balancer to ProxyPass (the slide said PP to PB, but he spoke about it the other way, and putting our heads together we believe the latter more) Apache’s not better than nginx, it’s the proxy.

Aynchronous requests.  Do it.  Workers are expensive.  The request pipeline should not be used to handle third party communication or back end work.  Move long running work to daemons whenever possible.  They’re moving more parts to queuing

kestrel is their queuing server that looks like memcache.  set/get.

They have daemons, in fact have been doing consolidation (not one daemon per job, one per many jobs).

Flock DB shards their social graph through gizzard and stores it in mySQL.

Disk is the new tape

Caching – realtime but heck, 60 seconds is close.  Separate memcache pools for different types.  “Cache everything” is not the best policy – invalidation problems, cold memcache problems.  Use memcache to augment the database.  You don’t want to go down if you lose memcache, your db still needs to handle the load.

MySQL challenges – replication delay.  And social networks don’t fit RDBMS well. Kill long running SQL queries with mkill.  Fail fast!!!

He’s going over a lot of this too fast to write down but there’s GREAT info.  Look for the slides later.  It’s like drinking from a fire hose.

In closing –

  • Use CM early
  • log everything
  • plan to build everything more than once
  • instrument everything and use science

Tim Morrow from ShopZilla on Time Is Money

ShopZilla spoke previous years about their major performance redesign and the effects it had.  It opened their eyes to the $$ benefits and got them addicted to performance.

Performance is top in Fred Wilson’s 10 Golden Rules For Successful Web Apps.  And Google/Microsoft have put out great data this year on performance’s link to site metrics.

Performance can slip away if you take your eyes off the ball.  It doesn’t improve if left alone.

They took their eye off the ball because they were testing back end perf but not front end (and that’s where 80% of the time is spent).  Constant feature development runs it up.  A/B testing needs framework that adds JS and overhead.

It’s easier to attack performance early in the dev cycle, by infecting everyone with a performance mindset.

They put together a virtual performance team and went for more measurements.  Nightly testing using httpwatch and YSlow and stuff.  All these “one time” test tools really need to all be built into a regression rig.

They found they were doing some good things but found things they could fix.  Progressive rendering 8k chunks were too large and they set tomcat to smaller flush intervals.  Too many requests. Bandwidth contention with less critical page elements.

They wanted to focus on the perception of performance via progressive rendering, defer less important stuff.  Flushing faster got the header out quicker. They reordered stuff.   It improved their conversion rate by .4%.  Doesn’t’ sound like much but it’s comparable to a feature release, and they had a 2 month ROI given what they put into it.

They did an infrastructure audit looking for hotspots and underutilization, and saved a lot in future hardware costs ($480k).

Performance is an important feature, but isn’t free, but has a measurable value.  ROI!

Imad Mouline from Compuware on Performance Testing

Actually Gomez, which is now “the Web performance division of Compuware”.  He promises not to do a product pitch, but to share data.

Does better performance impact customer behavior and the bottom line?  They looked at their data.  Performance vs page abandonment.  If you improve your performance, abandon rate goes down by large percentages per second of speedup.

The Web is the new integration platform and the browser’s where the apps come together.  How many hosts are hit by the browser per user transaction on average?  8 or higher, across all industries and localities.

What percent of Web transactions touch Amazon EC2 for at least one object?  Like 20%.  It is going up hideously fast (like 4% in the last month).

Cloud performance concerns – loss of visibility and control especially because existing tools don’t work well.  And multitenant means “someone might jack me!”

They put the same app over many clouds and monitored it!  They have a nice graph that shows variance; I can’t find it with a quick Google search though, otherwise I’d link it here for you.  And availability across many clouds is about 99.5%.

How do you know if the problem is the cloud or you?  They put together to show off these apps and performance.  You can put in your own URL and soon you’ll get data on “is the cloud messed up, or is it you?”

Domain sharding is a common performance optimization.  With S3 you get that “for free” using buckets’ DNS names and you can get a big performance speedup.

The cloud allows dynamic provisioning… Yeah we know.

But… Domain sharding fails to show a benefit on modern browsers.  In fact, it hurts.

At NI we recently did a whole analysis of various optimizations and found exactly this – domain sharding didn’t help and in fact hurt performance a bit.  We thought we might have been crazy or doing something wrong.  Apparently not.

They can see the significant performance differences among browsers/devices.

You have to test and validate your optimizations.  Older wisdom (like “shard domains”) doesn’t always hold any more.

Check some of this stuff out at!

Coming up – lightning demos!

Leave a comment

Filed under Conferences, DevOps

Velocity 2010 – Dueling Cloud Management Suppliers

Two cloud systems management suppliers talk about their bidness!  My comments in italics.

Cloud Autoscaling in Enterprise Computing by George Reese (enStratus Networks LLC)

How the Top Social Games Scale on the Cloud by Michael Crandell (RightScale, Inc)

I am more familiar with RightScale, but just read Reese’s great Cloud Application Architectures book on the plane here.  Whose cuisine will reign supreme?


Reese starts talking about “naive autoscaling” being a problem.  The cloud isn’t magic; you have to be careful.  He defines “enterprise” autoscaling as scaling that is cognizant of financial constraints and not this hippy VC-funded twitter type nonsense.

Reactive autoscaling is done when the system’s resource requirements exceed demand.  Proactive autoscaling is done in response to capacity planning – “run more during the day.”

Proactive requires planning.  And automation needs strict governors in place.

In our PIE autoscaling, we have built limits like that into the model – kinda like any connection pool.  Min, max, rate of increase, etc.

He says your controls shouldn’t be all “number of servers,” but be “budget” based.  Hmmm.  That’s ideal but is it too ideal?  And so what do you do, shut down all your servers if you get to the 28th of the month and you run out of cash?

CPU is not a scaling metric. Have better metrics tied to things that matter like TPS/response time.  Completely agree there; scaling just based on CPU/memory/disk is primitive in the extreme.

Efficiency is a key cloud metric.  Get your utilization high.

Here’s where I kinda disagree – it can often be penny wise and pound foolish.  In the name of “efficiency” I’ve seen people put a bunch of unrelated apps on one server and cause severe availability problems.  Screw utilization.  Or use a cloud provider that uses a different charging model – I forget which one it was, but we had a conf call with one cloud provider that only charged on CPU used, not “servers provisioned.”

Of course you don’t have to take it to an extreme, just roll down to your minimum safe redundancy number on a given tier when you can.

Security – well, you tend not to do some centralized management things (like add to Active Directory) in the cloud.  It makes user management hard.  Or just makes you use LDAP, like God intended.

Cloud bursting – scaling from on premise into the cloud.

Case study – a diaper company.  Had a loyalty program.  It exceeded capacity within an hour of launch.  Humans made a scaling decision to scale at the load balancing tier, and enStratus executed the auto-scale change.  They checked it was valid traffic and all first.

But is this too fiddly for many cases?  If you are working with a “larger than 5 boxes” kind of scale don’t you really want some more active automation?


The RightScale blog is full of good info!

They run 1.2 million cloud servers!  hey see things like 600k concurrent users, 100x scaling in 4 days, 15k instances, 1:2000 management ratio…

Now about gaming and social apps.  They power the top 10 Facebook apps.  They are an open management environment that lives atop the cloud suppliers’ APIs.

Games have a natural lifecycle where they start small, maybe take off, get big, eventually taper off.  It’s not a flat demand curve, so flat supply is ‘tarded.

During the early phase, game publishers need a cheap, fast solution that can scale.  They use Chef and other stuff in server templates for dynamic boot-time configuration.

Typically, game server side tech looks like normal Web stuff!  Apache+HAproxy LB, app servers, db cache (memcached), db (sharded mySQL master/slave pairs).  Plus search, queues, admin, logs.

Instance types – you start to see a lot of larger instances – large and extra large.  Is this because of legacy comfort issues?  Is it RAM needs?

CentOS5 dominates!  Generic images, configured at boot.  One company rebundles for faster autoscale.  Not much ubuntu or Windows.  To be agile you need to do that realtime config.

A lot of the boxes are used for databases.  Web/app and load balancing significant too.  There’s a RightScale paper showing a 100k packets per second LB limit with Amazon.

People use autoscaling a lot, but mainly for web app tier.  Not LBs because the DNS changing is a pain.  And people don’t autoscale their DBs.

They claim a lot lower human need on average for management on RightScale vs using the APIs “or the consoles.”  That’s a big or.  One of our biggest gripes with RightScale is that they consume all those lovely cloud APIs and then just give you a GUI and not an API.  That’s lame.  It does a lot of good stuff but then it “terminates” the programmatic relationship. [Edit: Apparently they have a beta API now, added since we looked at them.]

He disagrees with Reese – the problem isn’t that there is too much autoscaling, it’s that it has never existed.  I tend to agree. Dynamic elasticity is key to these kind of business models.

If your whole DB fits into memcache, what is mySQL for?  Writes sometimes?  NoSQL sounds cool but in the meantime use memcache!!!

The cloud has enabled things to exist that wouldn’t have been able to before.  Higher agility, lower cost, improved performance with control, anew levels of resiliency and automation, and full lifecycle support.

1 Comment

Filed under Cloud, Conferences, DevOps

Velocity 2010 – Infrastructure Philharmonic

Whew!  This is a marathon.  Next, we have John Willis and Damon Edwards on The Infrastructure Philharmonic: How Out of Tune are Your Operations? I really wanted to see Lenny’s talk as well but had to make a hard decision.  Shout out to him, rocks!  As always, my personal comments will be in italics below.

Note – download the DevOps Cafe podcasts by John and Damon and give them a listen!  They rock!

What separates high and low performing IT organizations?  If you’re average, the leader is 2-3x better than you.  There are a lot of “good” qualities that are found in both.

What do high performing organizations specifically share?

Pretty simple:

  • Seeing the whole – holistic vision, common goals.
  • Tune the organization for maximum business agility.

What gets in the way?

Specialization.  We value deep specialization because that gets us paid more. And allows people to act like a-holes with impunity.

Why change?

Competition, durrr.

And on a personal basis, the new specialization is integration and people want that.

Our Analogy: The Philharmonic

Highly skilled individual contributors that need to contribute to a seamless whole.

  1. The sponsors – business & marketing.
  2. The musicians – network, systems, database, etc etc.
  3. The audience – users!
  4. The conductor – leadership.  Coordination, bridging.

I like them talking about the conductor as an occasional manager myself. Seems like a lot of the time people talk about this stuff like it should  magically emerge from among peers and that’s not the way the world works.

Antipatterns – But sometimes there’s not one conductor – there’s a dev manager and an ops manager.  And if the person responsible for the full lifecycle is more than 3 degrees away from the actual process, it doesn’t work.

An IT organization’s musical “ear” evaluates output, shares understanding of goals, impacts individual decisions, and is tuned for your specific business needs.

Antipatterns – individual focus, script based, limited reusability, etc.

Good patterns – ops as code, team focus, reusability, method/process, sournce control.

Developing your “ear” starts with what you can measure.  “Ear” isn’t all subjective, there’s some science to it.  You need to start at the top with measurements that are meaningful to the business.

Antipattern – There’s not enough visibility!  Time for a metrics project!  We’ll get a sea of information and suddenly via BI or something we’ll get the Matrix.  But really you end up with a bunch of crap data.

We did this at NI just recently!

Measurement: a set of observations that reduce uncertainty where the result is expressed as a quantity.  Doesn’t have to be perfectly precise.

It’s about the high level KPIs and not the low level metrics.  Start with THREE TO FIVE.  Don’t be a fool with your life.  To get KPIs:

  1. Step 1 – Get everyone together and pick ’em
  2. Step 2 – Tie back to lower level metrics
  3. Step 3 – Tie back to performance (and compensation!)
  4. Step 4 – Profit

High performing organizations:

1.  Automate as a way of life.  Check out the Tale of Two Startups post by Jesse Robbins on Radar.

Infrastructure as code requires:

  • Provisioning (gimme boxes!)
  • Configuration Management (add roles!)
  • Systems Integration/Orchestration (crossconnect!)

2.  Test as a way of life.  Testing as a skill -> testing as a culture -> quality as a culture.  Check out what kaChing does.  They built a “business immune system” with testing and monitoring deviation – in INTERVIEWS they have people write code to go to production.

3.  DevOps culture.  Get past the disconnects in culture, tools, and process.

Batching up deploys turns your agile dev into a waterfall result.  DON’T BE A FOOL WITH YOUR LIFE!!!

In my previous role in IT, I kept hearing some higher-ups wanting “fewer releases.”  Why release more than once a quarter?  Those monthly releases are so costly!  I always tried to not cuss people out when I heard it.  If your new code is so worthless it can wait an extra two months to go out, just don’t roll it out and save us all some hassle.

Operations wants…

To get out of the muck!  People want to add value and implement things and not just fight fires.  We’re in an explosion right now where ops gets to come out and play!  We want to be agile and say “yes”!



Filed under Conferences, DevOps