Tag Archives: performance

Velocity 2010 – Day 3 Demos and More

Check out the Velocity 2010 flickr set!  And YouTube channel!

Time for lightning demos.

HTTPWatch

A HTTP browser proxy that does the usual waterfalls from your Web pages.  Version 7 is out!  You can change fonts.  They work in IE and Firefox both, unlike the other stuff.

Rather than focus on ranking like YSlow/PageSpeed, they focus on showing specific requests that need attention.  Same kind of data but from a different perspective.  And other warnings, not all strictly perf related.  Security, etc.  Exports and consumes HAR files (the new standard for http waterfall interchange).

webpagetest.org

Based on AOL Pagetest, an IE module, but hosted.  Can be installed as a private site too.  It provides obejct timing and waterfalls.  Allows testing from multiple locations and network speeds, saved them historically. Like a free single-snapshot version of Keynote/Gomez kinds of things.

Shows stuff like browser CPU and bandwidth utilization and does visual comparisons, showing percieved performance in filmstrip view and video.

And does HAR  import/export!  Ah, collaboration.

The CPU/net metrics show some “why” behind gaps in the waterfalls.

The filmstrip side by side view shows different user experiences very well.  And you can do video, as that is sexy.

They have a new UI built by a real designer (thanks Neustar!).

Speed Tracer

But what about Chrome, you ask?  We have an extension for that now.  Similar to PageSpeed.  The waterfall timeline is beautiful, using real “Google Finance” style visualization.  The other guys aren’t like RRDTool ugly but this is super purty.

It will deobfuscate JavaScript and integates it with Eclipse!

They’re less worried about network waterfall and more about client side render.  A lot of the functionality is around that.

You can send someone a Speed Trace dump file for debug.

Fiddler2

Are you tired of being browser dependent?  Fiddler has your back.

New features…  Hey, platform preview 3 for IE9 is out.  It has some tools for capture and export.; it captures traffic in a XML serialised HAR.  Fiddler imports JSON HAR from norms and the IE9 HAR!  And there’s HAR 1.1!  Eeek.  And wcat.  It imports lots of different stuff in other words.

I want one of these to take in Wireshark captures and rip out all the crap and give me the HTTP view!

FiddlerCap Web recorder (fiddlercap.com) lets people record transactions and send it to you.

Side by side viewingwith 2 fiddlers  if you launch with -viewer.

There’s a comparison extension called differ.  Nice!

You can replay captures, including binarieis now, with the AutoResponder tab.  And it’ll play latency soon.

We still await the perfect HTTP full capture and replay toolchain… We have our own HTTP log replayer we use for load tests and regression testing, if we could do this but in volume it would rock…

Caching analysis.  FiddlerCore library you can put in your app.

Now, Bobby Johnson of Facebook speaks on Moving Fast.

Building something isn’t hard, but you don’t know how people will use it, so you have to adapt quickly and make faster changes.

How do you get to a fast release cycle?  Their biggest requirement is “the site can’t go down.”  So they go to frequent small changes.

Most of the challenge when something goes wrong isn’t fixing it, it’s finding out what went wrong so you can fix it.  Smaller changes make that easier.

They take a new thing, push some fake traffic to it, then push a % of user traffic, and then dial back if there’s problems.

If you aren’t watching something, it’ll slip.  They had their performance at 5s; did a big improvement and got it to 2.5.  But then it slips back up.  How do you keep it fast (besides “don’t change anything?”)  They changed their organization to make quick changes but still maintain performance.

What makes a site get slow?  It’s not a deep black art.  A lot of it is jus tnot paying attention to your performance – you can’t foresee new code to not have bugs or not have performance problems.

  • New code is slow
  • More code is slow

“Age the code” by allocating time to shaking out performance problems – not just before, but after deploy.

  • Big pipe – break the page into small pieces and pipelines it.
  • Primer – a JavaScritp library that bootstraps by dl’ing the minimum first.

Both separate code into a fast path and a slow path and defaults to the slow path.

I have a new “poke” feature I want to try… I add it in as a lazy loaded thing and see if anyone cares, before I spend huge optimization time.

It gets popular!  OK, time to figure out performance on the fast path.

So they engineer performance per feature, which allows prioritization.

You can have a big metric collection and reporting tool.  But that’s different from alerting tools.  Granularity of alerting – no one wants to get paged on “this one server is slow.”  But no one cares about “the whole page is 1% slower” either.  But YOUR FEATURE is 50% slower than it was – that someone cares about.  Your alert granularity needs to be at the level of “what a single person works on.”

No one is going to fix things if they don’t care about it.  And also not unless they have control over it (like deploying code someone else wrote and being responsible for it breaking the site!). And they need to be responsible for it.

They have tried various structures.  A centralized team focused on performance but doesn’t have control over it (except “say no” kinds of control).

Saying “every dev is responsible for their perf” distributes the responsibility well, but doesn’t make people care.

So they have adopted a middle road.  There’s a central team that builds tools and works with product teams.  On each product team there is a performance point person.  This has been successful.

Lessons learned:

  • New code is slow
  • Give developers room to try things
  • Nobody’s job is to say no

Joshua from Strangeloop on The Mobile Web

Here we all know that performance links directly to money.

But others (random corporate execs) doubt it.  And what about the time you have to spend?  And what about mobile?

We need to collect more data

Case study – with 65% performance increase, 6% order size increase and 9% conversion increase.

For mobile, 40% perf led to 3% order size and 5% conversion.

They have a conversion rate fall-off by landing page speed graph, so you can say what a 2 second improvement is worth.  And the have preliminary data on mobile too.

I think  he’s choosing a very confusing way to say you need mttrics to establish the ROI of performance changes.  And MOBILE IS PAYING MONEY RIGHT NOW!

Cheryl Ainoa from Yahoo! on Innovation at Scale

The challenges of scale – technical complexity and outgrowing many tools and techniques, there are no off hours, and you’re a target for abuse.

Case Study: Fighting Spam with Hadoop

Google Groups was sending 20M emails/day to Taiwan and there’s only 18M Internet users in Taiwan.  What can help?  Nothing existing could do that volume (spamcop etc.)  And running their rules takes a couple days.  So they used hadoop to drive indications of “spammy” groups in parallel. Cut mail delivered by 5x.

Edge pods – small compute footprints to optimize cost and performance.  You can’t replicate your whole setup globally.  But adding on to a CDN is adding some compute capability to the edge in “pods.”  They have a proxy called YCPI to do this with.

And we’re out of time!

Leave a comment

Filed under Conferences, DevOps

Velocity 2010: Day 2 Keynotes Continued

Back from break and it’s time for more!  The dealer room was fricking MOBBED.  And this is with a semi-competing cloud computing convention, Structure,  going on down the road.

Lightning Demos

Time for a load of demos!

dynaTrace

Up first is dynaTrace, a new hotshot in the APM (application performance management) space.  We did a lot of research in this space (well, the sub-niche in this space for deep dive agent-driven analytics), eventually going with Opnet Panorama over CA/Wily Introscope, Compuware, HP, etc.  dynaTrace broke onto the scene since and it seems pretty pimp.  It does traditional metric-based continuous APM and also has a “PurePath” technology where they inject stuff so you can trace a transaction through the tiers it travels along, which is nice.

He analyzed the FIFA page for performance.  dynaTrace is more of a big agenty thing but they have an AJAX edition free analyzer that’s more light/client based.  A lot of the agent-based perf vendors just care about that part, and it’s great that they are also looking at front end performance because it all ties together into the end user experience.  Anyway, he shows off the AJAX edition which does a lot of nice performance analysis on your site.  Their version 2.0 of Ajax Edition is out tomorrow!

And, they’re looking at how to run in the cloud, which is important to us.

Firebug

A Firefox plugin for page inspection, but if you didn’t know about Firebug already you’re fired.  Version 1.6 is out!  And there’s like 40 addons for it now, not just YSlow etc, but you don’t know about them – so they’re putting together “swarms” so you can get more of ’em.

In the new version you can see paint events.  And export your findings in HAR format for portability across tools of this sort.  Like httpwatch, pagespeed, showslow.  Nice!

They’ve added breakpoints on network and HTML events.  “FireCookie” lets you mess with cookies (and breakpoints on this too).

YSlow

A Firebug plugin, was first on the scene in terms of awesome Web page performance analysis.  showslow.com and gtmetrix.com complement it.   You can make custom rules now.  WTF (Web Testing Framework) is an new YSlow plugin that tests for more shady dev practices.

PageSpeed

PageSpeed is like YSlow, but from Google!  They’ve been working on turning the core engine into an SDK so it can be used in other contexts.  Helps to identify optimizations to get time to first pane, making JS/CSS recommendations.

The Big Man

Now it’s time for Tim O’Reilly to lay down the law on us.  He wrote a blog post on operations being the new secret sauce that kinda kicked off the whole Velocity conference train originally.

Tim’s very first book was the Massacomp System Administrator’s Guide, back in 1983.  System administration was the core that drove O’Reilly’s book growth for a long time.

Applications’ competitive advantage is being driven by large amounts of data.  Data is the “intel inside” of the new generation of computing.  The “internet operating system” being built is a data operating system.  And mobile is the window into it.

He mentioned OpenCV (computer vision) and Collective Intelligence, which ended up suing the same kinds of algorithms.  So that got his thinking about sensors and things like Arduino.  And the way technology evolves is hackers/hobbyists to innovators/entrepreneurs.  RedLaser, Google Goggles, etc. are all moves towards sensors and data becoming pervasive.  Stuff like the NHIN (nationwide health information network).  CabSense.  AMEE, “the world’s energy meter” (or at least the UK’s) determined you can tell the make and model of someone’s appliances based on the power spike!  Passur has been gathering radar data from sensors, feeding through algorithms, and now doing great prediction.

Apps aren’t just consumed by humans, but by other machines.  In the new world every device generates useful data, in which every action creates “information shadows” on the net.

He talks about this in Web Squared.  Robotics, augmented reality, personal electronics, sensor webs.

More and more data acquisition is being fed back into real time response – Walmart has a new item on order 20 seconds after you check out with it.  Immediately as someone shows up at the polls, their name is taken off the call-reminder list with Obama’s Project Houdini.

Ushahidi, a crowdsourced crisis mapper.  In Haiti relief, it leveraged tweets and skype and Mechanical Turk – all these new protocls were used to find victims.

And he namedrops Opscode, the Chef guys – the system is part of this application.  And the new Web Operations essay book.  Essentially, operations are who figures out how to actually do this Brave New World of data apps.

And a side note – the closing of the Web is evil.  In The State of the Internet Operating System, Tim urges you to collaborate and cooperate and stay open.

Back to ops – operations is making some pretty big – potentially world-affecting- decisions without a lot to guide us.  Zen and the Art of Motorcycle Maintenance will guide you.  And do the right thing.

Current hot thing he’s into – Gov2.0!  Teaching government to think like a platform provider.  He wants our help to engage with the government and make sure they’re being wise and not making technopopic decisions.

Code for America is trying to get devs for city governments, which of course are the ass end of the government sandwich – closest to us and affecting our lives the most, but with the least funds and skills.

Third Party Pain

And one more, a technical one, on the effects of third party trash on your Web site – “Don’t Let Third Parties Slow You Down“, by Google’s Arvind Jain and Michael Kleber.  This is a good one – if you run YSlow on our Web site, www.ni.com, you’ll see a nice fast page with some crappy page tags and inserted JS junk making it half as fast as it should be.  Eloqua, Unica, and one of our own demented design layers an extra like 6s on top of a nice sub-2s page load.

Adsense adds 12% to your page load.  Google Analytics adds 5% (and you should use the new asynchronous snippet!).  Doubleclick adds 11.5%.  Digg, FacebookConnect, etc. all add a lot.

So you want to minimize blocking on the external publisher’s content – you can’t get rid of them and can’t make them be fast (we’ve tried with Eloqua and Unica, Lord knows).

They have a script called ASWIFT that makes show_ads.js a tiny loader script.  They make a little iframe and write into it.  Normally if you document.write, you block the hell out of everything.  Their old show_ads.js had a median of 47 ms and a 90th percentile of 288 ms latency – the new ASWIFT one has a mdeian of 11 ms and a 90th %ile of 32 ms!!!

And as usual there’s a lot of browser specific details.  See the presentation for details.  They’re working out bugs, and hope to use this on AdSense soon!

Leave a comment

Filed under Conferences, DevOps

Velocity 2010: Day 2 Keynotes

The Huddled Masses

Day 2 of Velocity 2010 starts off with a bang!  It’s Day 1 for a lot of people; Day 1 is optional workshops.  Anyway, Steve Souders (repping performance) and Jesse Robbins (repping ops) took the stage and told us that there are more than 1000 people at the show this year, more than both previous years combined!  And sure enough, we’re stuffed past capacity into the ballroom and they have satellite rooms set up; the show is sold out and there’s a long waiting list.  The fire marshal is thrilled.  Peco and I are long term Velocity alumni and have been to it every year – you can check out all the blog writeups from Velocity 2008 and 2009!  As always, my comments are in italics.

Jesse stripped down to show us the Velocity T-shirt, with the “fast by default” tagline on it.  I think we should get Ops representation on there too, and propose “easy by design.”   “fast by default/easy by design.”  Who’s with me?

Note that all these keynotes are being Webcast live and on demand so you can follow along!

Datacenter Revolution

The first speaker is James Hamilton of AWS on Datacenter Infrastructure Innovation.  There’s been a big surge in datacenter innovation, driven by the huge cloud providers (AWS, Google, Microsoft, Baidu, Yahoo!), green initiatives, etc. A fundamental change in the networking world is coming, and he’s going to talk about it.

Cloud computing will be affecting the picture fundamentally; driving DC stuff to the big guys and therefore driving professionalization and innovation.

Where does the money go in high scale infrastructure?  54% servers, 21% power distribution/cooling, 13% power, 8% networking, 5% other infrastructure.  Power stuff is basically 34% of the total and trending up.  Also, networking is high at 8% (19% of your total server cost).

So should you virtualize and crush it onto fewer servers and turn the others off?  You’re paying 87% for jack crap at this point.  Or should you find some kind of workload that pays more than the fractional cost of power?  Yes.  Hence, Amazon spot instances. The closer you can get to a flat workload, the better it is for you, everyone, and the environment.

Also, keeping your utilization up is critical to making $ return.

In North America, 11% of all power is lost in distribution.  Each step is 1-2% inefficient (substation, transformer, UPS, to rack) and it all adds up.  And that’s not counting the actual server power supply (80% efficient) and on board voltage regulators (80% efficient though you can buy 95% efficient ones for a couple extra dollars).

Game consoles are more efficient computing resources than many data centers – they’ve solved the problem of operating in hot and suboptimal conditions.  There’s also potential for using more efficient cooling than traditional HVAC – like, you know, “open a window”.

Sea Change in Net Gear!  Network gear is oversubscribed.  ASIC vendors are becoming more competitive and innovating more – and Moore’s Law is starting to kick in there (as opposed to the ‘slow ass law’ that’s ruled for a while). Networking gear is one of the major hindrances to agility right now – you need to be able to put servers wherever you want in the datacenter, and that’s coming.

Speed Matters

The second keynote is Urs Hölzle from Google on “Speed Matters.”

Google, as they know and see all, know that the average load time of a Web page is 4.9 seconds, average size 320 kb.  Average user bandwidth is 1.8 MB.  math says that load should be 1.4 seconds – so what up?

Well, webpagetest.org shows you – it’s not raw data transfer, it’s page composition and render.  Besides being 320 kB, it has 44 resources, makes 7 DNS resources, and doesn’t compress 1/3 of its content.

Google wants to make the Web faster.  First, the browser – Chrome!  It’s speed 20 vs Firefox at 10 vs IE8 at like 2.  They’re doing it as open source to spur innovation and competition (HTML5,  DNS prefectch, VP8 codec, V8 JS engine).  So “here’s some nice open source you could adopt to make it better for your users, and if you don’t, here’s a reference browser using it that will beat your pants off.  Enjoy!”

TCP needs improvements too.  It was built for slow networks and nuke-level resiliency, not speed.  They have a tuning paper that shows some benefits – fast start, quick loss recovery makes Google stuff 12% faster (on real sites!).  And no handshake delay (app payload in SYN packets).

DNS needs to propagate the client IP in DNS requests to allow servers to better map to closest servers – when DNS requests go up the chain that info is lost.  Of course it’s a little Big Brother, too.

SSL is slow.  False start (reducing 1 roudn trip from the handshake) makes Android 10% faster.  Snap start and OCSP stapling are proposed improvements to avoid round trips to the client and CA.

HTTP itself, they promote SPDY.  Does header compression and other stuff, reduces packets by 40%.  It’s the round trips that kill you.

DNS needs to be faster too.  Enter Google’s Public DNS.  It’s not really that much data, so for them to load it into memory is no big deal.

And 1 Gbps broadband to everyone’s home!  Who’s with me?  100x improvement!

This is a good alignment of interests.  Everyone wants the Web to be faster then they use it, and obviously Google and others want it to be faster so you can consume their stuff/read their ads/give them your data faster and faster.

They are hosting for popular cross-Web files like jQuery, fonts, etc.  This improves caching on the client and frees up your servers.

For devs, they are trying to make tools for you.  Page Speed, Closure Compiler, Speed Tracer, Auto Spriter, Browserscope, Site Performance data.

“Speed is product feature #1.”  Sped affects search ranking now.  Go to code.google.com/speed and get your speed on!  Google can’t do it alone…  And slow performance is reducing your revenue (they have studies on that).

Are You Experienced?

Keynote 3 is From Browsers to Mobile Devices: The End User Experience Still Mattersby Vik Chaudhary (Keynote Systems, Inc.).  As usual, this is a Keynote product pitch.  I guess you have to pay the piper for that sponsorship. Anyway, they’re announcing a new version of MITE, their mobile version of KITE, the Keynote Internet Testing Environment.

Mobile!  It’s big!  Shakira says so! Mee mee mee mee!

Use MITE to do mobile testing; it’ll test and feed it into the MyKeynote portal.  You can see performance and availability for a site on  iPhone, Blackberry, Palm Pre, etc.  You can see waterfalls and screenshots!  That is, if you pay a bunch extra, at least that’s what we learned from our time as a Keynote customer…

MITE is pretty sexy though.  You can record a transaction on an emulated iPhone.  And analyze it.  Maybe I’m biased because I already know all this, and because we moved off Keynote to Gomez despite their frankly higher prices because their service and technology were better.  KITE was clever but always seemed to be more of a freebie lure-them-in gimmick than a usable part of a real Keynote customer’s work.

Now it’s break time!  I’ll be back with more coverage from Velocity 2010 in a bit!

Leave a comment

Filed under Conferences, DevOps

Velocity 2010: Scalable Internet Architectures

My first workshop is  Scalable Internet Architectures by Theo Schlossnagle, CEO of OmniTI.  He gave a nearly identical talk last year but I missed some of it, and it was really good, so I went!  (Robert from our Web Admin team attended as well.)

There aren’t many good books on scalability.  Mainly there are three – Art of Scalability, Cal Henderson’s Building Scalable Web Sites, and his, Scalable Internet Architectures.  So any tips you can get a hold of are welcome.

Following are my notes from the talk; my own thoughts are in italics.

Architecture

What is architecture?  It encompasses everything form power up to the client touchpoint and everything in between.

Of necessity, people are specialized into specific disciplines but you have to overcome that to make a whole system make sense.

The new push towards devops (development/operations collaboration) tries to address this kind of problem.

Operations

Operations is a serious part of this, and it takes knowledge, tools, experience, and discipline.

Knowledge
– Is easy to get; Internet, conferences (Velocity, Structure, Surge), user groups

Tools – All tools are good; understand the tools you have.  Some of operations encourages hackiness because when there is a disruption, the goal is “make it stop as fast as possible.”

You have to know how to use tools like truss, strace, dtrace through previous practice before the outage comes.  Tools (and automation) can help you maintain discipline.

Experience comes from messing up and owning up.

Discipline is hardest.  It’s the single most lacking thing in our field. You have to become a craftsman. To learn discipline through experience, and through practice achieve excellence. You can’t be too timid and not take risks, or take risks you don’t understand.

It’s like my old “Web Admin Standing Orders” that tried to delineate this approach for  my ops guys – “1.  Make it happen.  2.  Don’t f*ck it up.  3.  There’s the right way, the wrong way, and the standard way.”  Take risks, but not dumb risks, and have discipline and tools.

He recommends the classic Zen and the Art of Motorcycle Maintenance for operations folks.  Cowboys and heroes burn out.  Embrace a Zen attitude.

Best Practices

  1. Version Control everything.  All tools are fine, but mainly it’s about knowing how to use it and using it correctly, whether it’s CVS or Subversion or git.
  2. Know Your Systems – Know what things look like normally so you have a point of comparison.  “Hey, there’s 100 database connections open!  That must be the problem!”  Maybe that’s normal.  Have a baseline (also helps you practice using the tools).  Your brain is the best pattern matcher.
    Don’t say “I don’t know” twice.  They wrote an open source tool called Reconnoiter that looks at data and graphs regressions and alerts on it (instead of cacti, nagions, and other time consuming stuff).  Now available as SaaS!
  3. Management – Package rollout, machine management, provisioning. “You should use puppet or chef!  Get with the times and use declarative definition!”  Use the tools you like.  He uses kickstart and cfengine and he likes it just fine.

Dynamic Content

Our job is all about the dynamic content.  Static content – Bah, use akamai or cachefly or panther or whatever.  it’s a solved problem.

Premature optimization is the root of all evil – well, 97% of it.  It’s the other 3% that’s a bitch.  And you’re not smart enough to know where that 3% is.

Optimization means “don’t do work you don’t have to.”  Computational reuse and caching,  but don’t do it in the first place when possible.
He puts comments for things he decides not to optimize explaining the assumptions and why not.

Sometimes naive business decisions force insane implementations down the line; you need to re-check them.

Your content is not as dynamic as you think it is.  Use caching.

Technique – Static Element Caching

Applied YSlow optimizations – it’s all about the JavaScript, CSS, images.  Consolidate and optimize.  Make it all publicly cacheable with 10 year expiry.

RewriteRule (.*)\.([0-9]+)\.css $1.css makes /s/app.23412 to /s/app.css – you get unique names but with new cached copy.  Bump up the number in the template.  Use “cat” to consolidate files, freaks!

Images, put a new one at a new URI.  Can’t trust caches to really refresh.

Technique – Cookie Caching

Announcing a distributed database cache that always is near the user and is totally resilient!  It’s called cookies.  Sign it if you don’t want tampering.  Encrypt if you don’t want them to see its contents.  Done.  Put user preferences there and quit with the database lookups.

Technique – Data Caching

Data caching.  Caching happens at a lot of layers.  Cache if you don’t have to be accurate, use a materialized view if you do.    Figuring out the state breakdown of your users?  Put it in a separate table at signup or state change time, don’t query all the time.  Do it from the app layer if you have to.

Technique – Choosing Technologies

Understand how you’ll be writing and retrieving data – and how everyone else in the business will be too!  (Reports, BI, etc.)  You have to be technology agnostic and find the best fit for all the needs – business requirements as well as consistency, availability, recoverability, performance, stability.  That’s a place where NoSQL falls down.

Technique – Database

Shard your database then shoot yourself.  Horizontal scaling isn’t always better.  It will make your life hell, so scale vertically first.  If you have to, do it, and try not to have regrets.

Do try “files,” NoSQL, cookies, and other non-ACID alternatives because they scale more easily.  Keep stuff out of the DB where you can.

When you do shard, partition to where you don’t need more than one shard per OLTP question.  Example – private messaging system.  You can partition by recipient and then you can see your messages easily.  But once someone looks for messages they sent, you’re borked.  But you can just keep two copies!  Twice the storage but problem solved.  Searching cross-user messages, however, borks you.

Don’t use multimaster replication.  It sucks – it’s not ready for prime time.  Outside ACID there are key-value stores, document databases, etc.  Eventual consistency helps.  MongoDB, Cassandra, Voldemort, Redis,  CouchDB – you will have some data loss with all of them.

NoSQL isn’t a cure-all; they’re not PCI compliant for example.  Shiny is not necessarily good.  Break up the problem and implement the KISS principle.  Of course you can’t get to the finish line with pure relational for large problems either – you have to use a mix; there is NO one size fits all for data management.

Keep in mind your restore-time and restore-point needs as well as ACID requirements of your data set.

Technique – Service Decoupling

One of the most fundamental techniques to scaling.  The theory is, do it asynchronous.  Why do it now if you can postpone it?  Break down the user transaction and determine what parts can be asynchronous.  Queue the info required to complete the task and process it behind the scenes.

It is hard, though, and is more about service isolation than postponing work.  The more you break down the problem into small parts, the more you have in terms of problem simplification, fault isolation, simplified design, decoupling approach, strategy, and tactics, simpler capacity planning, and more accurate performance modeling.  (Like SOA, but you know, that really works.)

One of my new mantras while building our cloud systems is “Sharing is the devil,” which is another way of stating “decouple heavily.”

Message queueing is an important part of this – you can use ActiveMQ, OpenAMQ, RabbitMQ (winner!).  STOMP sucks but is a universal protocol most everyone uses to talk to message queues.

Don’t decouple something small and simple, though.

Design & Implementation Techniques

Architecture and implementation are intrinsically tied, you can’t wholly separate them.  You can’t label a box “Database” and then just choose Voldemort or something.

Value accuracy over precision.

Make sure the “gods aren’t angry.”  The dtrace guy was running mpstat one day, and the columns didn’t line up.  The gods intended them to, so that’s your new problem instead of the original one!  OK, that’s a confusing anecdote.  A better one is “your Web servers are only handling 25 requests per second.”  It should be obvious the gods are angry.   There has to be something fundamentally wrong with the universe to make that true. That’s not a provisioning problem, that’s an engineering problem.

Develop a model.  A complete model is nearly impossible, but a good queue theory model is easy to understand and provides good insight on dependencies.

Draw it out, rationalize it.  When a user comes in to the site all what happens?  You end up doing a lot of I/O ops.  Given traffic you should then know about what each tier will bear.

Complexity is a problem – decoupling helps with it.

In the end…

Don’t be an idiot.  A lot of scalability problems are from being stupid somewhere.  High performance systems don’t have to scale as much.  Here’s one example of idiocy in three acts.

Act 1 – Amusing Error By Marketing Boneheads – sending a huge mailing with an URL that redirects. You just doubled your load, good deal.

Act 2 – Faulty Capacity Planning – you have 100k users now.  You try to plan for 10 million.  Don’t bother, plan only to 10x up, because you just don’t understand the problems you’ll have at that scale – a small margin of error will get multiplied.

Someone into agile operations might point out here that this is a way of stating the agile principle of “iterative development.”

Act 3 – The Traffic Spike – I plan on having a spike that gives me 3000 more visitors/second to a page with various CSS/JS/images.  I do loads of math and think that’s 5 machines worth.  Oh whoops I forgot to do every part of the math – the redirect issue from the amusing error above!  Suddenly there’s a huge amount more traffic and my pipe is saturated (Remember the Internet really works on packets and not bytes…) .

This shows a lot of trust in engineering math…  But isn’t this why testing was invented?  Whenever anyone shows me math and hasn’t tested it I tend to assume they’re full of it.

Come see him at Surge 2010!  It’s a new scalability and performance conference in Baltimore in late Sep/early Oct.

A new conference, interesting!  Is that code for “server side performance, ” where Velocity kinda focuses on client side/front end a lot?

Leave a comment

Filed under Conferences, DevOps

Upcoming Free Velocity WebOps Web Conference

O’Reilly’s Velocity conference is the only generalized Web ops and performance conference out there.  We really like it; you can go to various other conferences and have 10-20% of the content useful to you as a Web Admin, or you can go here and have most of it be relevant!

They’ve been doing some interim freebie Web conferences and there’s one coming up.  Check it out.  They’ll be talking about performance functionality in Google Webmaster Tools, mySQL, Show Slow, provisioning tools, and dynaTrace’s new AJAX performance analysis tool.

O’Reilly Velocity Online Conference: “Speed and Stability”
Thursday, March 17; 9:00am PST
Cost: Free

Leave a comment

Filed under Conferences, DevOps

Velocity 2009 – Best Tidbits

Besides all the sessions, which were pretty good, a lot of the good info you get from conferences is by networking with other folks there and talking to vendors.  Here are some of my top-value takeaways.

Aptimize is a New Zealand-based company that has developed software to automatically do the most high value front end optimizations (image spriting, CSS/JS combination and minification, etc.).  We predict it’ll be big.  On a site like ours, going back and doing all this across hundreds of apps will never happen – we can engineer new ones and important ones better, but something like this which can benefit apps by the handful is great.

I got some good info from the MySpace people.  We’ve been talking about whether to run our back end as Linux/Apache/Java or Windows/IIS/.NET for some of our newer stuff.  In the first workshop, I was impressed when the guy asked who all runs .NET and only one guy raised his hand.   MySpace is one of the big .NET sites, but when I talked with them about what they felt the advantage was, they looked at each other and said “Well…  It was the most expeditious choice at the time…”  That’s damning with faint praise, so I asked about what they saw the main disadvantage being, and they cited remote administration – even with the new PowerShell stuff it’s just still not as easy as remote admin/CM of Linux.  That’s top of my list too, but often Microsoft apologists will say “You just don’t understand because you don’t run it…”  But apparently running it doesn’t necessarily sell you either.

Our friends from Opnet were there.  It was probably a tough show for them, as many of these shops are of the “I never pay for software” camp.  However, you end up wasting far more in skilled personnel time if you don’t have the right tools for the job.  We use the heck out of their Panorama tool – it pulls metrics from all tiers of your system, including deep in the JVM, and does dynamic baselining, correlation and deviation.  If all your programmers are 3l33t maybe you don’t need it, but if you’re unsurprised when one of them says “Uhhh… What’s a thread leak?” then it’s money.

ControlTier is nice, they’re a commercial open source CM tool for app deploys – it works at a higher level than chef/puppet, more like capistrano.

EngineYard was a really nice cloud provisioning solution (sits on top of Amazon or whatever).  The reality of cloud computing as provided by the base IaaS vendors isn’t really the “machines dynamically spinning up and down and automatically scaling your app” they say it is without something like this (or lots of custom work).  Their solution is, sadly, Rails only right now.  But it is slick, very close to the blue-sky vision of what cloud computing can enable.

And also, I joined the EFF!  Cyber rights now!

You can see most of the official proceedings from the conference (for free!):

1 Comment

Filed under Conferences, DevOps

Velocity 2009 – Monday Night

After a hearty trip to Gordon Biersch, Peco went to the Ignite battery of five minute presentations, which he said was very good.  I went to two Birds of a Feather sessions, which were not.  The first was a general cloud computing discussion which covered well-trod ground.  The second was by a hapless Sun guy on Olio and Fabian.  No, you don’t need to know about them.  It was kinda painful, but I want to commend that Asian guy from Google for diplomatically continuing to try to guide the discussion into something coherent without just rolling over the Sun guy.  Props!

And then – we were lame and just turned in.  I’m getting old, can’t party every night like I used to.  (I don’t know what Peco’s excuse is!)

Leave a comment

Filed under Conferences, DevOps