Ohhh my aching head. Apparently this is a commonly held problem, as the keynote hall is much more sparsely attended at 8:30 AM today than it was yesterday. Some great fun last night, we hung with the Cloudkick and Turner Broadcasting guys, drinking and playing the “Can we name all the top level Apache projects” game.
It’s time for another morning of keynotes. Presentations from yesterday should be appearing on their schedule pages (I link to the schedule page in each blog post, so they should only be a click away). As always, my own comments below will be set off in italics to avoid libel suits.
First, we have John Rauser from Amazon on Creating Cultural Change.
Since we are technologists and problemsolvers, we of course tend to try to solve problems with technology – but many of our biggest, hardest problems are cultural.
It’s very true for Operations and performance because their goals are often in tension with other parts of the business. Much like security, legal, and other “non-core” groups. And it’s easy to fall into an adversarial relationship when there are “two sides.” Having a dedicated ops team is somewhat dangerous for this reason. So you need to ingrain performance and ops into your org’s mentality. Idyllic? Maybe, but doable.
If you determine someone is a bad person, like the coffee “free riders” that take the last cup and don’t make more, you have a dilemma. Complaining about “free riders” doesn’t work. Nagging, shaming, etc. same deal He had a friend that put in some humorous placards that marketed the “product” of making coffee. And it worked.
Sasquatch dancing guy! I don’t have the heart to go into it, just Google the video. Anyway, people join in when there’s unabashed joy and social cover. If you’re cranky, you add social cover for people to be cranky. Welcome newcomers. Lavish praise. Help them succeed. “Treat your beta testers as your most valuable resource and they will respond by becoming your most valuable resource.”
Shreddies vs Diamond Shreddies! Rebranding and perception change. Is DevOps our opportunity to turn “infrastructure people” into “agile admins?”
Anyway, be relentlessly happy and joyful. I know I have positivity problems and I definitely look back and say “outcomes would have been better if I didn’t succumb to the temptation to bitch about those “dumbass programmers”…
1. Try something new. A little novelty gets through people’s mental filters. If you’ve tried without metrics, try with. If you’ve tried metrics, try business drivers. If you tried that, pull out a single user session and simulate what it’s like.
2. Group identity. Mark people as special. Badges! Authority! Invite people to review their next project with an “ops expert” or “perf expert.”
3. Be relentless. Sending an email and waiting is a chump play. And be relentlessly happy.
There was a lot of wisdom in this presentation. As a former IT manager trying to run a team with complex relationships with other infrastructure teams, dev teams, and business teams, times where we hewed to this kind of theory things tended to work, and when we didn’t they tended not to.
What’s changed since they spoke last year? They’ve made headway on Rails performance, more efficient use of Apache, many more servers, lbs, and people. Up to 210 employees.
One of my questions about the devops plan is how to scale it – same problem agile development has.
More and more it’s API. 75% of the traffic to Twitter is API now. 160k registered apps, 100M searches a day, 65M tweets per day.
They’re trying to work on CM and other stuff. Scaling doesn’t work the first time – you have to rebuild (refactor, in agile speak). They’re doing that now.
Shortening Mean Time To Detect Problems drives shorter Mean Time To Recovery.
They continuously evaluate looking for bottlenecks – find the weakest part, fix it, and move on to the next in an iterative manner.
They are all about metrics using ganglia, and feed it to users on dev.twitter.com ans status.twitter.com.
Don’t be a “systems administrator” any more. Combine statistical analysis and monitoring to produce meaningful results. Make decisions based on data not gut instincts.
They’re working on low level profiling of Apache, Ruby, etc. Network – Latency, network usage, memory leaks. tcpdump + tcpdstat, yconalyze Apps – Introspect with google perftools
Instrumenting the world pays off. Data analysis and visualization are necessary skills nowadays.
Rails hasn’t really been their performance problem. It’s front end problems like caching/cache invalidation problems, bad queries generated by ActiveQuery, garbage collection (20% of the issues!), and replication lag.
Analyze! Turn data into information. Understand where the code base is going.
Logging! Syslog doesn’t work at scale. No redundancy, no failure recovery. And moving large files is painful. They use scribe to HDFS/hadoop with LZO compression.
Dashboard – Theirs has “criticals” view (top 10 metrics), smokeping/mrtg, google analytics (not just for 200s!), XML feeds from managed services.
Whale Watcher – a shell script that looks for errors in the logs.
Change management is the stuff. They use Reviewboard and puppet+svn. Hundreds of modules, runs constantly. It reuses tools that engineers use.
And Deploywatcher, another script that stops deploys if there’s system problems. They work a lot on deploy. Graph time of day next to CPU/latency.
They release features in “dark mode” and have on/off switches. Especially computationally/IO heavy stuff. Changes are logged and reported to all teams (they have like 90 switches). And they have a static/read-only mode and “emergency stop” button.
Subsystems! Take a look at how we manage twitter.
loony – a central machine database in mySQL. They use managed hosting so they’re always mapping names. Python, django, paraminko SSH (twitter’s OSS SSH library). Ties into LDAP. When the data center sends mail, machine definitions are built in real time. On demand changes with “run.” Helps with deploy and querying.
murder – a bittorrent based deploy client, python+libtorrent.
memcached – network memory bus isn’t infinite. Evictions make the cache unreliable for important configs. They segment into pools for better performance. Examine slab allocation and watch for high use/eviction with “peep. ” Manage it like anything else.
Tiers – load balancer, apache, rails (unicorn), flock DB.
Unicorn is awesome and more awesome than mongrel for Rails guys.
Shift from Proxy Balancer to ProxyPass (the slide said PP to PB, but he spoke about it the other way, and putting our heads together we believe the latter more) Apache’s not better than nginx, it’s the proxy.
Aynchronous requests. Do it. Workers are expensive. The request pipeline should not be used to handle third party communication or back end work. Move long running work to daemons whenever possible. They’re moving more parts to queuing
kestrel is their queuing server that looks like memcache. set/get.
They have daemons, in fact have been doing consolidation (not one daemon per job, one per many jobs).
Flock DB shards their social graph through gizzard and stores it in mySQL.
Disk is the new tape
Caching – realtime but heck, 60 seconds is close. Separate memcache pools for different types. “Cache everything” is not the best policy – invalidation problems, cold memcache problems. Use memcache to augment the database. You don’t want to go down if you lose memcache, your db still needs to handle the load.
MySQL challenges – replication delay. And social networks don’t fit RDBMS well. Kill long running SQL queries with mkill. Fail fast!!!
He’s going over a lot of this too fast to write down but there’s GREAT info. Look for the slides later. It’s like drinking from a fire hose.
In closing –
- Use CM early
- log everything
- plan to build everything more than once
- instrument everything and use science
ShopZilla spoke previous years about their major performance redesign and the effects it had. It opened their eyes to the $$ benefits and got them addicted to performance.
Performance is top in Fred Wilson’s 10 Golden Rules For Successful Web Apps. And Google/Microsoft have put out great data this year on performance’s link to site metrics.
Performance can slip away if you take your eyes off the ball. It doesn’t improve if left alone.
They took their eye off the ball because they were testing back end perf but not front end (and that’s where 80% of the time is spent). Constant feature development runs it up. A/B testing needs framework that adds JS and overhead.
It’s easier to attack performance early in the dev cycle, by infecting everyone with a performance mindset.
They put together a virtual performance team and went for more measurements. Nightly testing using httpwatch and YSlow and stuff. All these “one time” test tools really need to all be built into a regression rig.
They found they were doing some good things but found things they could fix. Progressive rendering 8k chunks were too large and they set tomcat to smaller flush intervals. Too many requests. Bandwidth contention with less critical page elements.
They wanted to focus on the perception of performance via progressive rendering, defer less important stuff. Flushing faster got the header out quicker. They reordered stuff. It improved their conversion rate by .4%. Doesn’t’ sound like much but it’s comparable to a feature release, and they had a 2 month ROI given what they put into it.
They did an infrastructure audit looking for hotspots and underutilization, and saved a lot in future hardware costs ($480k).
Performance is an important feature, but isn’t free, but has a measurable value. ROI!
Actually Gomez, which is now “the Web performance division of Compuware”. He promises not to do a product pitch, but to share data.
Does better performance impact customer behavior and the bottom line? They looked at their data. Performance vs page abandonment. If you improve your performance, abandon rate goes down by large percentages per second of speedup.
The Web is the new integration platform and the browser’s where the apps come together. How many hosts are hit by the browser per user transaction on average? 8 or higher, across all industries and localities.
What percent of Web transactions touch Amazon EC2 for at least one object? Like 20%. It is going up hideously fast (like 4% in the last month).
Cloud performance concerns – loss of visibility and control especially because existing tools don’t work well. And multitenant means “someone might jack me!”
They put the same app over many clouds and monitored it! They have a nice graph that shows variance; I can’t find it with a quick Google search though, otherwise I’d link it here for you. And availability across many clouds is about 99.5%.
How do you know if the problem is the cloud or you? They put together cloudsleuth.net to show off these apps and performance. You can put in your own URL and soon you’ll get data on “is the cloud messed up, or is it you?”
Domain sharding is a common performance optimization. With S3 you get that “for free” using buckets’ DNS names and you can get a big performance speedup.
The cloud allows dynamic provisioning… Yeah we know.
But… Domain sharding fails to show a benefit on modern browsers. In fact, it hurts.
At NI we recently did a whole analysis of various optimizations and found exactly this – domain sharding didn’t help and in fact hurt performance a bit. We thought we might have been crazy or doing something wrong. Apparently not.
They can see the significant performance differences among browsers/devices.
You have to test and validate your optimizations. Older wisdom (like “shard domains”) doesn’t always hold any more.
Check some of this stuff out at gomez.com/velocity!
Coming up – lightning demos!