How The Pros Do It
Facebook Operations – A Day In The Life by Tom Cook
Facebook has been very open about their operations and it’s great for everyone. This session is packed way past capacity. Should be interesting. My comments are in italics.
Every day, 16 billion minutes are spent on Facebook worldwide. It started in Zuckerberg’s dorm room and now is super huge, with tens of thousands of servers and its own full scale Oregon data center in progress.
So what serves the site? It’s rerasonably straightforward. Load balancer, web servers, services servers, memory cache, database. THey wrote and 100% use use HipHop for PHP, once they outgrew Apache+mod_php – it bakes php down to compiled C++. They use loads of memcached, and use sharded mySQL for the database. OS-wise it’s all Linux – CentOS 5 actually.
All the site functionality is broken up into separate discrete services – news, search, chat, ads, media – and composed from there.
They do a lot with systems management. They’re going to focus on deployment and monitoring today.
They see two sides to systems management – config management and on demand tools. And CM is priority 1 for them (and should be for you). No shell scripting/error checking to push stuff. There are a lot of great options out there to use – cfengine, puppet, chef. They use cfengine 2! Old school alert! They run updates every 15 minutes (each run only takes like 30s).
It means it’s easy to make a change, get it peer reviewed, and push it to production. Their engineers have fantastic tools and they use those too (repo management, etc.)
On demand tools do deliberate fix or data gathering. They used to use dsh but don’t think stuff like capistrano will help them. They wrote their own! He ran a uname -a across 10k distributed hosts in 18s with it.
Up a layer to deployments. Code is deployed two ways – there’s front end code and back end deployments. The Web site, they push at least once a day and sometimes more. Once a week is new features, the rest are fixes etc. It’s a pretty coordinated process.
Their push tool is built on top of the mystery on demand tool. They distribute the actual files using an internal BitTorrent swarm, and scaling issues are nevermore! Takes 1 minute to push 100M of new code to all those 10k distributed servers. (This doesn’t include the restarts.)
On the back end, they do it differently. Usually you have engineering, QA, and ops groups and that causes slowdown. They got rid of the formal QA process and instead built that into the engineers. Engineers write, debug, test, and deploy their own code. This allows devs to see response quickly to subsets of real traffic and make performance decisions – this relies on the culture being very intense. No “commit and quit.” Engineers are deeply involved in the move to production. And they embed ops folks into engineering teams so it’s not one huge dev group interfacing with one huge ops group. Ops participates in architectural decisions, and better understand the apps and its needs. They can also interface with other ops groups more easily. Of course, those ops people have to do monitoring/logging/documentation in common.
Change logging is a big deal. They want the engineers to have freedom to make changes, and just log what is going on. All changes, plus start and end time. So when something degrades, ops goes to that guy ASAP – or can revert it themselves. They have a nice internal change log interface that’s all social. It includes deploys and “switch flips”.
Monitoring! They like ganglia even tough it’s real old. But it’s fast and allows rapid drilldown. They update every minute; it’s just RRD and some daemons. You can nest grids and pools. They’re so big they have to shard ganglia horizontally across servers and store RRD’s in RAM, but you won’t need to do that.
They also have something called ODS (operational data store) which is more application focused and has history, reporting, better graphs. They have soooo much data in it.
They also use nagios, even though “that’s crazy”. Ping testing, SSH testing, Web server on a port. They distribute it and feed alerting into other internal tools to aggregate them as an execution back end. Aggregating into alarms clumps is critical, and decisions are made based on a tiered data structure – feeding into self healing, etc. They have a custom interface for it.
At their size, there are some kind of failures going on constantly. They have to be able to push fixes fast.
They have a lot of rack/cluster/datacenter etc levels of scale, and they are careful to understand dependencies and failure states among them.
They have constant communication – IRC with bots, internal news updates, “top of page” headers on internal tools, change log/feeds. And using small teams.
How many users per engineer? At Facebook, 1.1 million – but 2.3 million per ops person! This means a 2:1 dev to ops ratio, I was going to ask…
- Version control everything
- Optimize early
- Automate, automate, automate
- Use configuration management. Don’t be a fool with your life.
- Plan for failure
- Instrument everything. Hardware, network, OS, software, application, etc.
- Don’t spend time on dumb things – you can slow people down if you’re “that guy.”
- Priorities – Stability, support your engineers