Tag Archives: facebook

Velocity 2013 Day 2 Liveblog: mobile performance and engagement

Guilin Chen from Facebook is the presenter…

UX is important and mobile users are less tolerant than desktop developers.

What does performance mean for the Facebook study
– page load times
– scroll performance (how smooth)
– prefetch delay (infinite scrolling)

– page load times showed a strong correlation between slowness and user drop off.
– consistent scrolling experience matters more; slower scrolling is better than jittery scrolling.
– prefetch delay studies weren’t as conclusive and thus, didn’t matter as much..

Leave a comment

Filed under Conferences, DevOps

Velocity 2012 Day Two

After John Allspaw and Steve Souders caper about in fake muscles, we get started with the keynotes.

Building for a Billion Users (Facebook)

Jay Parikh, Facebook, spoke about building for a billion users.  Several folks yesterday warned with phrases like “if you have a billion users and hundreds of engineers then this advice may be on point…”

Today in 30 minutes, Facebook did…

  • 10 TB log data into hadoop
  • Scan 105 TB in Hive
  • 6M photos
  • 160m newsfeed stories
  • 5bn realtime messages
  • 10bn profile pics
  • 108bn mysql queries
  • 3.8 tn cache ops

Principle 1: Focus on Impact

Day one code deploys. They open sourced fabricator for code review. 30 day coder boot camp – Ops does coder boot camp, then weeks of ops boot camp. Mentorship program.

Principle 2: Move Fast

  • Commits scale with people, but you have to be safe! Perflab does a performance test before every commit with log-replay. Also does checks for slow drift over time.
  • Gatekeeper is the feature flag tool, A/B testing – 500M checks/sec. We have one of these at Bazaarvoice and it’s super helpful.
  • Claspin is a high density heat map viewer for large services
  • fast deployment technique – used shared memory segment to connect cache, so they can swap out binaries on top of it, took weeks of back end deploys down to days/hours
  • Built lots of ops tools – people | tools | process

Random Bits:

  • Be bold
  • They use BGP in the data center
  • Did we mention how cool we are?
  • Capacity engineering isn’t just buying stuff, it’s APM to reduce usage
  • Massive fail from gatekeeper bug.  Had to pull dns to shut the site down
  • Fix more, whine less

Investigating Anomalies, Amazon

John Rauser, Amazon data scientist, on investigating anomalies.  He gave a well received talk on statistics for operations last year.
He used a very long “data in the time of cholera” example.  Watch the video for the long version.
You can’t just look at the summary, look at distributions, then look into the logs themselves.
Look at the extremes and you’ll find things that are broken.
Check your long tail, monitor percentiles long in the tail.

Building Resilient User Experiences

Mike Brittain, Etsy on Building Resilient User Experiences – here’s the video.

Don’t mess up a page just because one of the 14 lame back end services you compose into it is down. Distinguish critical back end service failures and just suppress other stuff when composing a product page. Consider blocking vs non-blocking ajax. Load critical stuff synchronously and noncritical async/not at all if it times out.
Google apps has the nice problem/retrying in an interval messages (use exponential back off in your ui!)

Planning for 100% availability is not enough; plan for failure.  Your UI should adapt to failure. That’s usually a joint dev+ops+product call.
Do operability reviews, postmortems. Watch your page views for your error template!  (use an error template)

Speeding up the web using prediction of user activity

Arvind Jain/Dominic Hamon from Google – see the video.

Why things would be so much faster if you just preload the next pages a user might go to.  Sure.  As long as you don’t mind accidentally triggering all kinds of shit.  It’s like cross browser request forgery as a feature!
<link rel=’prerender’> to provide guidance.

Annoyed by this talk, and since I’ve read How Complex Systems Fail already, I spent rest of the morning plenary time wandering the vendor floor

It’s All About Telemetry

From the famous and irascible Theo Schlossnagle (@postwait). Here’s the slides.
Monitor what matters! Most new big data problems are created by our own solutions in the first place, and are thus solvable despite their ROI. e.g. logs.

What’s the cost/benefit of your data?
Don’t erode granularity (as with RRD).  It controls your storage but your ability to, say, do YOY black friday compares sucks.
As you zoom in you see obscured major differences and patterns.

There’s a cost/benefit curve for monitoring that goes positive but eventually goes negative.Value=benefit-cost.  So you don’t go to the end of the curve, you want the max difference!

Technique 1: Text
Just store changes- be careful not to have too many changes and store them

Technique 2: Numeric
store rollups, over 1 minute min/max/avg/sttdev/covar/50/95/99%
store first order derivative and then derivative of that (jerkiness)
db replication – lag AND RATE OF LAG CHANGE
“It’s a lot easier to see change when you’re actually graphing change”
Project numbers out.   Graph storage space doing down!
With simple numeric data you can do prediction (Holt-Winters) even hacked into RRD

Technique 3: Histograms
about 2k for a 1 minute histogram (5b for a single bucket)

Event correlation – change mgmt vs performance, is still human eye driven

Monitor everything.

  • the business – financials, marketing, support! problems, custsat, res time
  • operations! durr
  • system/db/yeah
  • middleware! messaging, apis, etc.

They use reconnoiter, statsd, d3, flot for graphing.

This was one of the best sessions of the day IMO.

RUM For Breakfast

Carlos from Facebook on routing users to the closest datacenter with Doppler. Normal DNS geo-routing depends on resolvers being near the user, etc.
They inject js into some browsers and log packet latency.  Map resolver IP to user IP, datacenter, latency. Then you can cluster users to nearest data centers. They use for planning and analysis – what countries have poor latency, where do we need peering agreements
Akamai said doing it was impractical, so they did it.
Amazon has “latency based routing” but no one knows how it works.
Google as part of their standard SOP has proposed a DNS extension that will never be adopted by enough people to work.

Looking at log-normal perf data – look at the whole spread but that can be huge, so filter then analyze.
Margin of error – 1.96*sd/sqrt(num), need less than 5% error.

How does performance influence human behavior?
Very fast sessions had high bounce rates (probably errors)
Strong bounce rate to load time correlation, and esp front end speed
Toxicology has the idea of a “median lethal dose”-  our LD50 is where we pass 50% bounce rate
These are: Back end 1.7s, dom load 1.8s, dom interactive 2.75s, front end 3.5s, dom complete 4.75s, load event 5.5s

Rollback, the Impossible Dream

by James Turnbull! Here’s the slides.

Rollback is BS.  You are insane if you rely on it. It’s theoretically possible, if you apply sufficient capital, and all apps are idempotent, resources…

  • Solve for availability, not rollback. Do small iterative changes instead.
  • Accept that failure happens, and prevent it from happening again.
  • Nobody gives a crap whose fault it was.
  • Assumption is the mother of all fuckups.
  • This can’t be upgraded like that because…  challenge it.

Stability Patterns

by Michael Nygard (@mtnygard) – Slides are here.   For more see his pragmatic programmers book Release It!

Failures come in patterns. Here’s some major ones!

Integrations are the #1 risk to stability, from both “in spec” and “out of spec” errors. Integrations are a necessary evil.  To debug you have to peel back the layers of abstraction; you have to diagnose a couple levels lower than the level an error manifests. Larger systems fail faster than small ones.

Chain Reactions – #2!
Failure moves horizontally across tiers, search engines and app servers get overloaded, esp. with connection pools. Look for resource leaks.

Cascading Failure
Failure moves vertically cross tiers. Common in SOA and enterprise services. Contain the damage – decouple higher tiers with timeouts, circuit breakers.

Blocked Threads
All threads blocked = “crash”. Use java.util/concurrent or system.threading (or ruby/php, don’t do it). Hung request handlers = less capacity and frustrated users. Scrutinize resource pools. Decompile that lame ass third party code to see how it works, if you’re using it you’re responsible for it.

Attacks of Self-Denial
Making your own DoS attack, often via mass unexpected promotions. Open lines of communication with marketers

Unbalanced Capacities
Your environment scaling ratios are different dev to qa to prod. Simulate back end failures or overloading during testing!

Unbounded result sets
Dev/test have smaller data volumes and unreal relationships (what’s the max in prod?). SOA with chatty providers – client can’t trust not being hurt. Don’t trust data producers, put limits in your APIs.

Stability patterns!

Circuit Breaker
Remote call wrapped with a retry loop (always 3!)
Immediate retries are very likely to fail again – TCP fixes packet drops for you man
Makes the user wait longer for their error, and floods the servers (cascading failure)
Count failures (leaky bucket) and stop calling the back end for a cool-off period
Can critical work be queued for later, or rejected, or what?
State of circuit breakers is a good dashboard!

Bulkheads
Partition the system and allow partial failure without losing service.
Classes of customer, etc.
If foo and bar are coupled with baz, then hitting baz can bork both. Make baz pools or whatever.
Less efficient resource use but important with shared-service models

Test Harness
Real-world failures are hard to create in QA
Integration tests don’t find those out-of-spec errors unless you force them
Service acting like a back end but doing crazy shit – slow, endless, send a binary
Supplement testing methods

The Colo or the Cloud?

Ken King/James Sheridan, Yammer

They started in a colo.
Dual network, racks have 4 2U quad and 10 1Us
EC2 provides you: network, cabling, simple load balancing, geo distribution, hardware repair, network and rack diagrams (?)
There are fixed and variable costs, assume 3 year depreciation
AWS gives you 20% off for 2M/year?
Three year commit, reserve instances
one rack one year – $384k cool, $378k ec2, $199k reserved
20 racks one year – $105k colo, $158 ec2 even with reserve and 20% discount
20 racks 3 years – $50k colo, $105k ec2

But – speed (agility) of ramping up…

To me this is like cars.  You build your own car, you know it.  But for arbitrary small numbers of cars, you don’t have time for that crap.  If you run a taxi fleet or something, then you start moving back towards specialized needs and spec/build your own.

Colo benefits – ownership and knowledge.  You know it and control it.

  • Load balancing (loadbalancer.com or zeus/riverbed only cloud options)
  • IDS etc. (appliances)
  • Choose connectivity
  • know your IO, get more RAM, more cores
  • Throw money at vertical scalability
  • Perception of control = security

Cloud benefits – instant. Scaling.

  • Forces good architecture
  • Immediate replacement, no sparing etc.
  • Unlimited storage, snapshots (1 TB volume limit)
  • No long term commitment
  • Provisioning APIs, autoscaling, EU storage, geodist

Hybrid.

  • crocodoc and encoder yammer partners are good to be close to
  • need to burst work
  • dev servers, windows dev, demo servers banished to AWS
  • moving cross connects to ec2 with vpc

Whew!  Man I hope someone’s reading this because it’s a lot of work.  Next, day three and bonus LSPE meeting!

2 Comments

Filed under Conferences, DevOps

Velocity 2010 – Facebook Operations

How The Pros Do It

Facebook Operations – A Day In The Life by Tom Cook

Facebook has been very open about their operations and it’s great for everyone.  This session is packed way past capacity.  Should be interesting.  My comments are  in italics.

Every day, 16 billion minutes are spent on Facebook worldwide.  It started in Zuckerberg’s dorm room and now is super huge, with tens of thousands of servers and its own full scale Oregon data center in progress.

So what serves the site?  It’s rerasonably straightforward.  Load balancer, web servers, services servers, memory cache, database.  THey wrote and 100% use use HipHop for PHP, once they outgrew Apache+mod_php – it bakes php down to compiled C++.  They use loads of memcached, and use sharded mySQL for the database. OS-wise it’s all Linux – CentOS 5 actually.

All the site functionality is broken up into separate discrete services – news, search, chat, ads, media – and composed from there.

They do a lot with systems management.  They’re going to focus on deployment and monitoring today.

They see two sides to systems management – config management and on demand tools.  And CM is priority 1 for them (and should be for you).  No shell scripting/error checking to push stuff.  There are a lot of great options out there to use – cfengine, puppet, chef.  They use cfengine 2!  Old school alert!  They run updates every 15 minutes (each run only takes like 30s).

It means it’s easy to make a change, get it peer reviewed, and push it to production.  Their engineers have fantastic tools and they use those too (repo management, etc.)

On demand tools do deliberate fix or data gathering.  They used to use dsh but don’t think stuff like capistrano will help them.  They wrote their own!  He ran a uname -a across 10k distributed hosts in 18s with it.

Up a layer to deployments.  Code is deployed two ways – there’s front end code and back end deployments.  The Web site, they push at least once a day and sometimes more.  Once a week is new features, the rest are fixes etc.  It’s a pretty coordinated process.

Their push tool is built on top of the mystery on demand tool.  They distribute the actual files using an internal BitTorrent swarm, and scaling issues are nevermore!  Takes 1 minute to push 100M of new code to all those 10k distributed servers.  (This doesn’t include the restarts.)

On the back end, they do it differently.  Usually you have engineering, QA, and ops groups and that causes slowdown.  They got rid of the formal QA process and instead built that into the engineers.  Engineers write, debug, test, and deploy their own code.  This allows devs to see response quickly to subsets of real traffic and make performance decisions – this relies on the culture being very intense.  No “commit and quit.”  Engineers are deeply involved in the move to production.  And they embed ops folks into engineering teams so it’s not one huge dev group interfacing with one huge ops group.  Ops participates in architectural decisions, and better understand the apps and its needs.  They can also interface with other ops groups more easily.  Of course, those ops people have to do monitoring/logging/documentation in common.

Change logging is a big deal.  They want the engineers to have freedom to make changes, and just log what is going on.  All changes, plus start and end time.  So when something degrades, ops goes to that guy ASAP – or can revert it themselves.  They have a nice internal change log interface that’s all social.  It includes deploys and “switch flips”.

Monitoring!  They like ganglia even tough it’s real old.  But it’s fast and allows rapid drilldown.  They update every minute; it’s just RRD and some daemons.  You can nest grids and pools.  They’re so big they have to shard ganglia horizontally across servers and store RRD’s in RAM, but you won’t need to do that.

They also have something called ODS (operational data store) which is more application focused and has history, reporting, better graphs.  They have soooo much data in it.

They also use nagios, even though “that’s crazy”.  Ping testing, SSH testing, Web server on a port.  They distribute it and feed alerting into other internal tools to aggregate them as an execution back end.  Aggregating into alarms clumps is critical, and decisions are made based on a tiered data structure – feeding into self healing, etc.  They have a custom interface for it.

At their size, there are some kind of failures going on constantly.  They have to be able to push fixes fast.

They have a lot of rack/cluster/datacenter etc levels of scale, and they are careful to understand dependencies and failure states among them.

They have constant communication – IRC with bots, internal news updates, “top of page” headers on internal tools, change log/feeds.  And using small teams.

How many users per engineer?  At Facebook, 1.1 million – but 2.3 million per ops person!  This means a 2:1 dev to ops ratio, I was going to ask…

To recap:

  • Version control everything
  • Optimize early
  • Automate, automate, automate
  • Use configuration management.  Don’t be a fool with your life.
  • Plan for failure
  • Instrument everything.  Hardware, network, OS, software, application, etc.
  • Don’t spend time on dumb things – you can slow people down if you’re “that guy.”
  • Priorities – Stability, support your engineers

Check facebook.com/engineering for their blog!  And facebook.com/opensource for their tools.

Leave a comment

Filed under Conferences, DevOps