Monthly Archives: June 2010

Velocity 2010 – Day 3 Keynotes

Ohhh my aching head.  Apparently this is a commonly held problem, as the keynote hall is much more sparsely attended at 8:30 AM today than it was yesterday.  Some great fun last night, we hung with the Cloudkick and Turner Broadcasting guys, drinking and playing the “Can we name all the top level Apache projects” game.

It’s time for another morning of keynotes.  Presentations from yesterday should be appearing on their schedule pages (I link to the schedule page in each blog post, so they should only be a click away).  As always, my own comments below will be set off in italics to avoid libel suits.

First, we have John Rauser from Amazon on Creating Cultural Change.

Since we are technologists and problemsolvers, we of course tend to try to solve problems with technology – but many of our biggest, hardest problems are cultural.

It’s very true for Operations and performance because their goals are often in tension with other parts of the business.  Much like security, legal, and other “non-core” groups. And it’s easy to fall into an adversarial relationship when there are “two sides.”  Having a dedicated ops team is somewhat dangerous for this reason.  So you need to ingrain performance and ops into your org’s mentality.  Idyllic?  Maybe, but doable.

If you determine someone is a bad person, like the coffee “free riders” that take the last cup and don’t make more, you have a dilemma.  Complaining about “free riders” doesn’t work.  Nagging, shaming, etc. same deal  He had a friend that put in some humorous placards that marketed the “product” of making coffee.  And it worked.

Sasquatch dancing guy!  I don’t have the heart to go into it, just Google the video.  Anyway, people join in when there’s unabashed joy and social cover.  If you’re cranky, you add social cover for people to be cranky.  Welcome newcomers.  Lavish praise.  Help them succeed.  “Treat your beta testers as your most valuable resource and they will respond by becoming your most valuable resource.”

Shreddies vs Diamond Shreddies!  Rebranding and perception change.  Is DevOps our opportunity to turn “infrastructure people” into “agile admins?”

Anyway, be relentlessly happy and joyful.  I know I have positivity problems and I definitely look back and say “outcomes would have been better if I didn’t succumb to the temptation to bitch about those “dumbass programmers”…

1.  Try something new.  A little novelty gets through people’s mental filters.  If you’ve tried without metrics, try with.  If you’ve tried metrics, try business drivers.  If you tried that, pull out a single user session and simulate what it’s like.

2. Group identity.  Mark people as special.  Badges!  Authority!  Invite people to review their next project with an “ops expert” or “perf expert.”

3.  Be relentless.  Sending an email and waiting is a chump play.  And be relentlessly happy.

There was a lot of wisdom in this presentation.  As a former IT manager trying to run a team with complex relationships with other infrastructure teams, dev teams, and business teams, times where we hewed to this kind of theory things tended to work, and when we didn’t they tended not to.

Operations at Twitter by John Adams

What’s changed since they spoke last year?  They’ve made headway on Rails performance, more efficient use of Apache, many more servers, lbs, and people.  Up to 210 employees.

One of my questions about the devops plan is how to scale it – same problem agile development has.

More and more it’s API.  75% of the traffic to Twitter is API now.  160k registered apps, 100M searches a day, 65M tweets per day.

They’re trying to work on CM and other stuff.  Scaling doesn’t work the first time – you have to rebuild (refactor, in agile speak).  They’re doing that now.

Shortening Mean Time To Detect Problems drives shorter Mean Time To Recovery.

They continuously evaluate looking for bottlenecks – find the weakest part, fix it, and move on to the next in an iterative manner.

They are all about metrics using ganglia, and feed it to users on ans

Don’t be a “systems administrator” any more.  Combine statistical analysis and monitoring to produce meaningful results.  Make decisions based on data not gut instincts.

They’re working on low level profiling of Apache, Ruby, etc. Network –  Latency, network usage, memory leaks.  tcpdump + tcpdstat, yconalyze  Apps – Introspect with google perftools

Instrumenting the world pays off.  Data analysis and visualization are necessary skills nowadays.

Rails hasn’t really been their performance problem.  It’s front end problems like caching/cache invalidation problems, bad queries generated by ActiveQuery, garbage collection (20% of the issues!), and replication lag.

Analyze!  Turn data into information.  Understand where the code base is going.

Logging!  Syslog doesn’t work at scale.  No redundancy, no failure recovery.  And moving large files is painful.  They use scribe to HDFS/hadoop with LZO compression.

Dashboard – Theirs has “criticals” view (top 10 metrics), smokeping/mrtg, google analytics (not just for 200s!), XML feeds from managed services.

Whale Watcher – a shell script that looks for errors in the logs.

Change management is the stuff.  They use Reviewboard and puppet+svn.  Hundreds of modules, runs constantly.    It reuses tools that engineers use.

And Deploywatcher, another script that stops deploys if there’s system problems.  They work a lot on deploy.  Graph time of day next to CPU/latency.

They release features in “dark mode” and have on/off switches. Especially computationally/IO heavy stuff.  Changes are logged and reported to all teams (they have like 90 switches).  And they have a static/read-only mode and “emergency stop” button.

Subsystems!  Take a look at how we manage twitter.

loony – a central machine database in mySQL.  They use managed hosting so they’re always mapping names.  Python, django, paraminko SSH (twitter’s OSS SSH library).  Ties into LDAP.  When the data center sends mail, machine definitions are built in real time.  On demand changes with “run.”  Helps with deploy and querying.

murder – a bittorrent based deploy client, python+libtorrent.

memcached – network memory bus isn’t infinite.  Evictions make the cache unreliable for important configs.  They segment into pools for better performance.  Examine slab allocation and watch for high use/eviction with “peep. ”  Manage it like anything else.

Tiers – load balancer, apache, rails (unicorn), flock DB.

Unicorn is awesome and more awesome than mongrel for Rails guys.

Shift from Proxy Balancer to ProxyPass (the slide said PP to PB, but he spoke about it the other way, and putting our heads together we believe the latter more) Apache’s not better than nginx, it’s the proxy.

Aynchronous requests.  Do it.  Workers are expensive.  The request pipeline should not be used to handle third party communication or back end work.  Move long running work to daemons whenever possible.  They’re moving more parts to queuing

kestrel is their queuing server that looks like memcache.  set/get.

They have daemons, in fact have been doing consolidation (not one daemon per job, one per many jobs).

Flock DB shards their social graph through gizzard and stores it in mySQL.

Disk is the new tape

Caching – realtime but heck, 60 seconds is close.  Separate memcache pools for different types.  “Cache everything” is not the best policy – invalidation problems, cold memcache problems.  Use memcache to augment the database.  You don’t want to go down if you lose memcache, your db still needs to handle the load.

MySQL challenges – replication delay.  And social networks don’t fit RDBMS well. Kill long running SQL queries with mkill.  Fail fast!!!

He’s going over a lot of this too fast to write down but there’s GREAT info.  Look for the slides later.  It’s like drinking from a fire hose.

In closing –

  • Use CM early
  • log everything
  • plan to build everything more than once
  • instrument everything and use science

Tim Morrow from ShopZilla on Time Is Money

ShopZilla spoke previous years about their major performance redesign and the effects it had.  It opened their eyes to the $$ benefits and got them addicted to performance.

Performance is top in Fred Wilson’s 10 Golden Rules For Successful Web Apps.  And Google/Microsoft have put out great data this year on performance’s link to site metrics.

Performance can slip away if you take your eyes off the ball.  It doesn’t improve if left alone.

They took their eye off the ball because they were testing back end perf but not front end (and that’s where 80% of the time is spent).  Constant feature development runs it up.  A/B testing needs framework that adds JS and overhead.

It’s easier to attack performance early in the dev cycle, by infecting everyone with a performance mindset.

They put together a virtual performance team and went for more measurements.  Nightly testing using httpwatch and YSlow and stuff.  All these “one time” test tools really need to all be built into a regression rig.

They found they were doing some good things but found things they could fix.  Progressive rendering 8k chunks were too large and they set tomcat to smaller flush intervals.  Too many requests. Bandwidth contention with less critical page elements.

They wanted to focus on the perception of performance via progressive rendering, defer less important stuff.  Flushing faster got the header out quicker. They reordered stuff.   It improved their conversion rate by .4%.  Doesn’t’ sound like much but it’s comparable to a feature release, and they had a 2 month ROI given what they put into it.

They did an infrastructure audit looking for hotspots and underutilization, and saved a lot in future hardware costs ($480k).

Performance is an important feature, but isn’t free, but has a measurable value.  ROI!

Imad Mouline from Compuware on Performance Testing

Actually Gomez, which is now “the Web performance division of Compuware”.  He promises not to do a product pitch, but to share data.

Does better performance impact customer behavior and the bottom line?  They looked at their data.  Performance vs page abandonment.  If you improve your performance, abandon rate goes down by large percentages per second of speedup.

The Web is the new integration platform and the browser’s where the apps come together.  How many hosts are hit by the browser per user transaction on average?  8 or higher, across all industries and localities.

What percent of Web transactions touch Amazon EC2 for at least one object?  Like 20%.  It is going up hideously fast (like 4% in the last month).

Cloud performance concerns – loss of visibility and control especially because existing tools don’t work well.  And multitenant means “someone might jack me!”

They put the same app over many clouds and monitored it!  They have a nice graph that shows variance; I can’t find it with a quick Google search though, otherwise I’d link it here for you.  And availability across many clouds is about 99.5%.

How do you know if the problem is the cloud or you?  They put together to show off these apps and performance.  You can put in your own URL and soon you’ll get data on “is the cloud messed up, or is it you?”

Domain sharding is a common performance optimization.  With S3 you get that “for free” using buckets’ DNS names and you can get a big performance speedup.

The cloud allows dynamic provisioning… Yeah we know.

But… Domain sharding fails to show a benefit on modern browsers.  In fact, it hurts.

At NI we recently did a whole analysis of various optimizations and found exactly this – domain sharding didn’t help and in fact hurt performance a bit.  We thought we might have been crazy or doing something wrong.  Apparently not.

They can see the significant performance differences among browsers/devices.

You have to test and validate your optimizations.  Older wisdom (like “shard domains”) doesn’t always hold any more.

Check some of this stuff out at!

Coming up – lightning demos!

Leave a comment

Filed under Conferences, DevOps

Velocity 2010 – Dueling Cloud Management Suppliers

Two cloud systems management suppliers talk about their bidness!  My comments in italics.

Cloud Autoscaling in Enterprise Computing by George Reese (enStratus Networks LLC)

How the Top Social Games Scale on the Cloud by Michael Crandell (RightScale, Inc)

I am more familiar with RightScale, but just read Reese’s great Cloud Application Architectures book on the plane here.  Whose cuisine will reign supreme?


Reese starts talking about “naive autoscaling” being a problem.  The cloud isn’t magic; you have to be careful.  He defines “enterprise” autoscaling as scaling that is cognizant of financial constraints and not this hippy VC-funded twitter type nonsense.

Reactive autoscaling is done when the system’s resource requirements exceed demand.  Proactive autoscaling is done in response to capacity planning – “run more during the day.”

Proactive requires planning.  And automation needs strict governors in place.

In our PIE autoscaling, we have built limits like that into the model – kinda like any connection pool.  Min, max, rate of increase, etc.

He says your controls shouldn’t be all “number of servers,” but be “budget” based.  Hmmm.  That’s ideal but is it too ideal?  And so what do you do, shut down all your servers if you get to the 28th of the month and you run out of cash?

CPU is not a scaling metric. Have better metrics tied to things that matter like TPS/response time.  Completely agree there; scaling just based on CPU/memory/disk is primitive in the extreme.

Efficiency is a key cloud metric.  Get your utilization high.

Here’s where I kinda disagree – it can often be penny wise and pound foolish.  In the name of “efficiency” I’ve seen people put a bunch of unrelated apps on one server and cause severe availability problems.  Screw utilization.  Or use a cloud provider that uses a different charging model – I forget which one it was, but we had a conf call with one cloud provider that only charged on CPU used, not “servers provisioned.”

Of course you don’t have to take it to an extreme, just roll down to your minimum safe redundancy number on a given tier when you can.

Security – well, you tend not to do some centralized management things (like add to Active Directory) in the cloud.  It makes user management hard.  Or just makes you use LDAP, like God intended.

Cloud bursting – scaling from on premise into the cloud.

Case study – a diaper company.  Had a loyalty program.  It exceeded capacity within an hour of launch.  Humans made a scaling decision to scale at the load balancing tier, and enStratus executed the auto-scale change.  They checked it was valid traffic and all first.

But is this too fiddly for many cases?  If you are working with a “larger than 5 boxes” kind of scale don’t you really want some more active automation?


The RightScale blog is full of good info!

They run 1.2 million cloud servers!  hey see things like 600k concurrent users, 100x scaling in 4 days, 15k instances, 1:2000 management ratio…

Now about gaming and social apps.  They power the top 10 Facebook apps.  They are an open management environment that lives atop the cloud suppliers’ APIs.

Games have a natural lifecycle where they start small, maybe take off, get big, eventually taper off.  It’s not a flat demand curve, so flat supply is ‘tarded.

During the early phase, game publishers need a cheap, fast solution that can scale.  They use Chef and other stuff in server templates for dynamic boot-time configuration.

Typically, game server side tech looks like normal Web stuff!  Apache+HAproxy LB, app servers, db cache (memcached), db (sharded mySQL master/slave pairs).  Plus search, queues, admin, logs.

Instance types – you start to see a lot of larger instances – large and extra large.  Is this because of legacy comfort issues?  Is it RAM needs?

CentOS5 dominates!  Generic images, configured at boot.  One company rebundles for faster autoscale.  Not much ubuntu or Windows.  To be agile you need to do that realtime config.

A lot of the boxes are used for databases.  Web/app and load balancing significant too.  There’s a RightScale paper showing a 100k packets per second LB limit with Amazon.

People use autoscaling a lot, but mainly for web app tier.  Not LBs because the DNS changing is a pain.  And people don’t autoscale their DBs.

They claim a lot lower human need on average for management on RightScale vs using the APIs “or the consoles.”  That’s a big or.  One of our biggest gripes with RightScale is that they consume all those lovely cloud APIs and then just give you a GUI and not an API.  That’s lame.  It does a lot of good stuff but then it “terminates” the programmatic relationship. [Edit: Apparently they have a beta API now, added since we looked at them.]

He disagrees with Reese – the problem isn’t that there is too much autoscaling, it’s that it has never existed.  I tend to agree. Dynamic elasticity is key to these kind of business models.

If your whole DB fits into memcache, what is mySQL for?  Writes sometimes?  NoSQL sounds cool but in the meantime use memcache!!!

The cloud has enabled things to exist that wouldn’t have been able to before.  Higher agility, lower cost, improved performance with control, anew levels of resiliency and automation, and full lifecycle support.

1 Comment

Filed under Cloud, Conferences, DevOps

Velocity 2010 – Infrastructure Philharmonic

Whew!  This is a marathon.  Next, we have John Willis and Damon Edwards on The Infrastructure Philharmonic: How Out of Tune are Your Operations? I really wanted to see Lenny’s talk as well but had to make a hard decision.  Shout out to him, rocks!  As always, my personal comments will be in italics below.

Note – download the DevOps Cafe podcasts by John and Damon and give them a listen!  They rock!

What separates high and low performing IT organizations?  If you’re average, the leader is 2-3x better than you.  There are a lot of “good” qualities that are found in both.

What do high performing organizations specifically share?

Pretty simple:

  • Seeing the whole – holistic vision, common goals.
  • Tune the organization for maximum business agility.

What gets in the way?

Specialization.  We value deep specialization because that gets us paid more. And allows people to act like a-holes with impunity.

Why change?

Competition, durrr.

And on a personal basis, the new specialization is integration and people want that.

Our Analogy: The Philharmonic

Highly skilled individual contributors that need to contribute to a seamless whole.

  1. The sponsors – business & marketing.
  2. The musicians – network, systems, database, etc etc.
  3. The audience – users!
  4. The conductor – leadership.  Coordination, bridging.

I like them talking about the conductor as an occasional manager myself. Seems like a lot of the time people talk about this stuff like it should  magically emerge from among peers and that’s not the way the world works.

Antipatterns – But sometimes there’s not one conductor – there’s a dev manager and an ops manager.  And if the person responsible for the full lifecycle is more than 3 degrees away from the actual process, it doesn’t work.

An IT organization’s musical “ear” evaluates output, shares understanding of goals, impacts individual decisions, and is tuned for your specific business needs.

Antipatterns – individual focus, script based, limited reusability, etc.

Good patterns – ops as code, team focus, reusability, method/process, sournce control.

Developing your “ear” starts with what you can measure.  “Ear” isn’t all subjective, there’s some science to it.  You need to start at the top with measurements that are meaningful to the business.

Antipattern – There’s not enough visibility!  Time for a metrics project!  We’ll get a sea of information and suddenly via BI or something we’ll get the Matrix.  But really you end up with a bunch of crap data.

We did this at NI just recently!

Measurement: a set of observations that reduce uncertainty where the result is expressed as a quantity.  Doesn’t have to be perfectly precise.

It’s about the high level KPIs and not the low level metrics.  Start with THREE TO FIVE.  Don’t be a fool with your life.  To get KPIs:

  1. Step 1 – Get everyone together and pick ’em
  2. Step 2 – Tie back to lower level metrics
  3. Step 3 – Tie back to performance (and compensation!)
  4. Step 4 – Profit

High performing organizations:

1.  Automate as a way of life.  Check out the Tale of Two Startups post by Jesse Robbins on Radar.

Infrastructure as code requires:

  • Provisioning (gimme boxes!)
  • Configuration Management (add roles!)
  • Systems Integration/Orchestration (crossconnect!)

2.  Test as a way of life.  Testing as a skill -> testing as a culture -> quality as a culture.  Check out what kaChing does.  They built a “business immune system” with testing and monitoring deviation – in INTERVIEWS they have people write code to go to production.

3.  DevOps culture.  Get past the disconnects in culture, tools, and process.

Batching up deploys turns your agile dev into a waterfall result.  DON’T BE A FOOL WITH YOUR LIFE!!!

In my previous role in IT, I kept hearing some higher-ups wanting “fewer releases.”  Why release more than once a quarter?  Those monthly releases are so costly!  I always tried to not cuss people out when I heard it.  If your new code is so worthless it can wait an extra two months to go out, just don’t roll it out and save us all some hassle.

Operations wants…

To get out of the muck!  People want to add value and implement things and not just fight fires.  We’re in an explosion right now where ops gets to come out and play!  We want to be agile and say “yes”!



Filed under Conferences, DevOps

Velocity 2010 – Change Management

Andrew “Dice” Clay Schafer of Cloudscaling on Change Management – A Scientific Classification.

The Dunning-Kruger effect says we can’t always judge our own level of competence.
Often we prefer confidence over expertise.

You have probably confidently deployed a disaster.

Premise:  Everyone is working to enable value creation for a business.  (Or should be out the door.)

If you don’t understand what creates value for your organization, how can you be good at your job?

Change Management: A structured approach to transitioning individuals, teams,  and organizations from a current state to a desired future state.

You look into doing this in the technology sphere.  Soon you get to the ITIL framework and you crap your pants.

Web Operations Renaissance

Process innovation is a differentiator that is driving value creation.

You can add value with performance, scalability, performance, and agility.

The 6 Laws of Reliability by Joe Armstrong of Ericsson (all this comp sci theory stuff was worked out in like the 1960s, just go look it up.)

  1. isolation
  2. concurrency
  3. faulire detection
  4. fault identification
  5. live upgrade
  6. stable storage

One approach – Cowboysaurus Rex.  Make changes live to production!

It’s fast and flexible.  But we suck at doing it safely.  Throwing more meat at it is a poor way of scaling.

The next approach – Signaturus Maximus – laying in the bureaucracy.

It adds adult supervision and has fewer surprises, but it slows things down.  And people don’t follow the process.  When there’s a crisis, it’s back to cowboy.

Next approach, ITIL Incipio!

Now, there are some good guidelines.  But no one knows what it really says.  And the process can become an end unto itself.  And when there’s afire, you break out of it and go back to cowboy.

Condo Adeptus.

Computers are better than people at doing the same thing over and over.  But, nothing can bring down your systems like automation. If you don’t test it, you jack yourself.

Novus Probator.

Your system is an application.  Run it through a dev process, and testing.  But we suck at testing.  Testing is its own skill and is often an afterthought.

Prodo Continuus.

All production all the time- “continuous deploy.”  If you build up the scaffolding to support it, you can push to prod without fear (almost).  You get feedback from prod very fast.  But we still suck at testing.  But you don’t break the same thing twice.

What are we neglecting? ell, what are you building?

You have some things that are less important (twitter) and ones that are more critical (your bank) and crazy critical (medical).

When we talk about changes, we assume it’s mostly apps, kinda some OS as those are less sensitive.  Maybe CPU, bot not really storage and network.  There is a criticality path but as much as you can do infrastructure as code all the way down the stack, the better.

There’s also a continuum of situation – from building to firefighting.  Perhaps different approaches are somewhat merited.

Dependencies aren’t just technology.  There are people in the value chain who need to know, and they are in different roles from us.  We need to be transparent to all those other folks and communicate better.

The wall of confusion between roles (like dev and ops) makes people sad if there’s not communication.  Hence, DevOps!  But there are many groups and many walls.

David Christensen’s FSOP (flying by the seat of your pants) cycle describes the problem – we don’t even know what best practices are yet.  Find some, use the, but go back to flying.  Never calcify anything you don’t have to.

Change your organization.  “But it’s hard and there’s politics!”  Strive to make the place where you work the place you want to work,  And if that doesn’t work out, there’s a place you should come work!

Come to DevOpsDays on Friday for more on how to do this!

Leave a comment

Filed under Conferences, DevOps

Velocity 2010 – Drizzle

Monty Taylor from Rackspace talked about Drizzle, a MySQL variant “built for operations“. My thoughts will be in italics so you can be enraged at the right party.

Drizzle is “a database for the cloud”.  What does that even mean?  It’s “the next Web 2.0”, which is another away of saying “it’s the new hotness, beeyotch” (my translation).

mySQL scaling to multiple machines brings you sadness.  And mySQL deploy is crufty as hell.  So step 1 to Drizzle recovery is that they realized “Hey, we’re not the end all be all of the infrastructure – we’re just one piece people will be putting into their own structure.”  If only other software folks would figure that out…

Oracle style vertical scaling is lovely using a different and lesser definition of scaling.  Cloud scaling is extreme!  <Play early 1990s music>  It requires multiple machines.

They shard.  People complain about sharding, but that’s how the Internet works – the Internet is a bunch of sites sharded by functionality.  QED.

“Those who don’t know UNIX are doomed to repeat it.”  The goal (read about the previous session on toolchains) is to compose stuff easily, string them together like pipes in UNIX.  But most of the databases still think of themselves as a big black box in the corner, whose jealous priests guard it from the unwashed heathen.

So what makes Drizzle different? In summary:

  • Less features
  • Ops driven
  • Sane config
  • Plugins

Less features means less ways for developers to kill you.  Oracle’s “run Java within the database” is an example of totally retarded functionality whose main job is to ruin your life. No stored procedures, no triggers, no prepared statements.  This avoids developer sloppiness.  “Insert a bunch of stuff, then do a select and the database will sort it!” is not appropriate thinking for scale.

Ops driven means not marketing driven, which means driven by lies.  For example, there’s no marketdroids that want them to add a mySQL event scheduler when cron exists.  Or “we could sell more if we had ANSI compliant stored procedures!”  They don’t have a company to let the nasty money affect their priorities.

They don’t do competitive benchmarks, as they are all lies.  That’s for impartial third parties to do.  They do publish their regression tests vs themselves for transparency.

You get Drizzle via distros.  There are no magic “gold” binaries and people that do that are evil.  But distros sometimes get behind.  pandora-build

They have sane defaults.  If most people are doing to set something (like FRICKING INNODB) they install by default that way.  To install Drizzle, the only mandatory thing to say is the data directory.

install from apt/yum works.  Or configure/make/make install and run drizzled.  No bootstrap, system tables, whatever.

They use plugins.  mySQL plugins are pain, more of a patch really.  You can just add them at startup time, no SQL from a sysadmin.  And no loading during runtime – see “less features” above.  This is still in progress especially config file sniblets.  But plugins are the new black.

They have pluggable protocols.  It ships with mySQL and Drizzle, but you can plug console, HTTP/REST or whatever.  Maybe dbus…  Their in progress Drizzle protocol removes the potential for SQL injection by only delivering one query, has a sharding key in the packet header, supports HTTP-like redirects…

libdrizzle has both client and server ends, and talks mySQL and Drizzle.

So what about my app that always auths to the database with its one embedded common username/password?  Well, you can do none, PAM, LDAP (and well), HTTP. You just say authenticate(user, pass) and it does it.   It has pluggable authorization too, none, LDAP, or hard-coded.

There is a pluggable query filter that can detect and stop dumb queries- without requiring a proxy.

It has pluggable logging – none, syslog, gearman, etc. – and errors too.

Pluggable replication!  A new scheme based on Google protocol buffers, readable in Java, Python, and C++.  It’s logical change (not quite row) based.  Combined with protocol redirects, it’s db migration made easy!

Boots!  A new command line client, on  It’s pluggable, scriptable, pipes SQL queries, etc.

P.S. mySQL can lick me! (I’m paraphrasing, but only a little.)

1 Comment

Filed under Conferences, DevOps

Velocity 2010 – Facebook Operations

How The Pros Do It

Facebook Operations – A Day In The Life by Tom Cook

Facebook has been very open about their operations and it’s great for everyone.  This session is packed way past capacity.  Should be interesting.  My comments are  in italics.

Every day, 16 billion minutes are spent on Facebook worldwide.  It started in Zuckerberg’s dorm room and now is super huge, with tens of thousands of servers and its own full scale Oregon data center in progress.

So what serves the site?  It’s rerasonably straightforward.  Load balancer, web servers, services servers, memory cache, database.  THey wrote and 100% use use HipHop for PHP, once they outgrew Apache+mod_php – it bakes php down to compiled C++.  They use loads of memcached, and use sharded mySQL for the database. OS-wise it’s all Linux – CentOS 5 actually.

All the site functionality is broken up into separate discrete services – news, search, chat, ads, media – and composed from there.

They do a lot with systems management.  They’re going to focus on deployment and monitoring today.

They see two sides to systems management – config management and on demand tools.  And CM is priority 1 for them (and should be for you).  No shell scripting/error checking to push stuff.  There are a lot of great options out there to use – cfengine, puppet, chef.  They use cfengine 2!  Old school alert!  They run updates every 15 minutes (each run only takes like 30s).

It means it’s easy to make a change, get it peer reviewed, and push it to production.  Their engineers have fantastic tools and they use those too (repo management, etc.)

On demand tools do deliberate fix or data gathering.  They used to use dsh but don’t think stuff like capistrano will help them.  They wrote their own!  He ran a uname -a across 10k distributed hosts in 18s with it.

Up a layer to deployments.  Code is deployed two ways – there’s front end code and back end deployments.  The Web site, they push at least once a day and sometimes more.  Once a week is new features, the rest are fixes etc.  It’s a pretty coordinated process.

Their push tool is built on top of the mystery on demand tool.  They distribute the actual files using an internal BitTorrent swarm, and scaling issues are nevermore!  Takes 1 minute to push 100M of new code to all those 10k distributed servers.  (This doesn’t include the restarts.)

On the back end, they do it differently.  Usually you have engineering, QA, and ops groups and that causes slowdown.  They got rid of the formal QA process and instead built that into the engineers.  Engineers write, debug, test, and deploy their own code.  This allows devs to see response quickly to subsets of real traffic and make performance decisions – this relies on the culture being very intense.  No “commit and quit.”  Engineers are deeply involved in the move to production.  And they embed ops folks into engineering teams so it’s not one huge dev group interfacing with one huge ops group.  Ops participates in architectural decisions, and better understand the apps and its needs.  They can also interface with other ops groups more easily.  Of course, those ops people have to do monitoring/logging/documentation in common.

Change logging is a big deal.  They want the engineers to have freedom to make changes, and just log what is going on.  All changes, plus start and end time.  So when something degrades, ops goes to that guy ASAP – or can revert it themselves.  They have a nice internal change log interface that’s all social.  It includes deploys and “switch flips”.

Monitoring!  They like ganglia even tough it’s real old.  But it’s fast and allows rapid drilldown.  They update every minute; it’s just RRD and some daemons.  You can nest grids and pools.  They’re so big they have to shard ganglia horizontally across servers and store RRD’s in RAM, but you won’t need to do that.

They also have something called ODS (operational data store) which is more application focused and has history, reporting, better graphs.  They have soooo much data in it.

They also use nagios, even though “that’s crazy”.  Ping testing, SSH testing, Web server on a port.  They distribute it and feed alerting into other internal tools to aggregate them as an execution back end.  Aggregating into alarms clumps is critical, and decisions are made based on a tiered data structure – feeding into self healing, etc.  They have a custom interface for it.

At their size, there are some kind of failures going on constantly.  They have to be able to push fixes fast.

They have a lot of rack/cluster/datacenter etc levels of scale, and they are careful to understand dependencies and failure states among them.

They have constant communication – IRC with bots, internal news updates, “top of page” headers on internal tools, change log/feeds.  And using small teams.

How many users per engineer?  At Facebook, 1.1 million – but 2.3 million per ops person!  This means a 2:1 dev to ops ratio, I was going to ask…

To recap:

  • Version control everything
  • Optimize early
  • Automate, automate, automate
  • Use configuration management.  Don’t be a fool with your life.
  • Plan for failure
  • Instrument everything.  Hardware, network, OS, software, application, etc.
  • Don’t spend time on dumb things – you can slow people down if you’re “that guy.”
  • Priorities – Stability, support your engineers

Check for their blog!  And for their tools.

Leave a comment

Filed under Conferences, DevOps

Velocity 2010 – Getting Fast

The first session in Day 1’s afternoon is Getting Fast: Moving Towards a Toolchain for Automated Operations.  Peco, Jeff, and I all chose to attend it.  Lee Thompson and Alex Honor of dto gave it.

I have specific investment in this one, as a member of the devops-toolchain effort, so was jazzed to see one of its first outputs!

A toolchain is a set of tools you use for a purpose.  Choosing the specific tools should be your last thing.  They have a people over process over tools methodology that indicates the order of approach.

Back in their ControlTier days they wrote a paper on fully automated provisioning.  Then the devops-toolchain Google group and OpsCamp stuff was created to promote collaboration around the space.

Discussion on the Google group has been around a variety of topics, from CM to log management to DevOps group sizing/hiring.

Ideas borrowed from in the derivation of the devops toolchain concept:

  • Brent Chapman’s Incident Command System (this is boss; I wrote up the session on it from Velocity 2008)
  • Industrial control automation; it’s physical but works similarly to virtual and tends to be layered and toolchain oriented.  Layers include runbook automation, control, eventing, charting, measurement instrumentation, and the system itself.  Statistical process control FTW.
  • The UNIX toolchain as a study in modularity and composition; it’s one of the most durable best practices approaches ever.  Douglas McIlroy FTW!

Eric Raymond (The Art of UNIX Programming, The Cathedral and the Bazaar)

Interchangeable parts – like Honore Blanc started with firearms, and lean manufacturing and lean startup concepts today.

In manufacturing, in modern automation thought, you don’t make the product, you should make the robots that make the product.

Why toolchains?

  • Projects are failing due to handoff issues, and automation and tools reduce that.
  • Software operation – including release and operations – are critical nonfunctional requirements of the development process.
  • Composite apps mean way more little bits under manangement
  • Cloud computing means you can’t slack off and sit around with server racking being the critical path

Integrated tools are less flexible – integratable tools can be joined together to address a specific problem (but it’s more complex).

Commercial bundled software is integrated.  It has a big financial commitment and if one aspect of it is weak, you can’t replace it.  It’s a black box/silo solution that weds you to their end to end process.

Open source software is lots of independent integratable parts.  It may leave gaps, and done wrong it’s confused and complicated.  But the iterative approach aligns well with it.

They showed some devops guys’ approaches to automated infrastructure – including ours!  Woot!

KaChing’s continuous deployment is a great example of a toolchain in action.  They have an awesome build/monitor/deploy GUi-faved app for deploy and rollback.


Then they showed a new cut at the generalized architecture, with Control, Provisioning, Release, Model, Monitoring, and Sources as the major areas.

Release management became a huge area, with subcomponents of repository, artifact, build, SCM, and issue tracker.

In monitoring and control, they identified Runbook Automation, Op Console/Control, Alarm Management, Charting/History/SPC, and Measurement Instrumentation.

Provisioning consists of Application Service Orchestration, System Configuration, Cloud/VM or OS install.

This is all great stuff.  All these have open source tools named; I’ll link to wherever these diagrams are as soon as I find it!  I must not have been paying enough attention to the toolchain wiki!

Hot Tips

  • Tool projects fail if the people and process aren’t aligned.
  • Design the toolchain for interoperability
  • Adopt a SDLC for any tool you develop
  • Separate the dev release process from the package process
  • Need better interchange boundaries (the UNIX pipe equivalent)
  • No one size fits all – different tools is OK
  • Communication is your #1 ingredient for success

All in all an awesome recap of the DevOps toolchain effort!  Great work to everyone who’s done stuff on it, and I know this talk inspired me to put more time into it – I think this is a super important effort that can advance the state of our discipline!  And everyone is welcome to join up and join in.

Leave a comment

Filed under Conferences, DevOps