Velocity 2010: Scalable Internet Architectures

My first workshop is  Scalable Internet Architectures by Theo Schlossnagle, CEO of OmniTI.  He gave a nearly identical talk last year but I missed some of it, and it was really good, so I went!  (Robert from our Web Admin team attended as well.)

There aren’t many good books on scalability.  Mainly there are three – Art of Scalability, Cal Henderson’s Building Scalable Web Sites, and his, Scalable Internet Architectures.  So any tips you can get a hold of are welcome.

Following are my notes from the talk; my own thoughts are in italics.

Architecture

What is architecture?  It encompasses everything form power up to the client touchpoint and everything in between.

Of necessity, people are specialized into specific disciplines but you have to overcome that to make a whole system make sense.

The new push towards devops (development/operations collaboration) tries to address this kind of problem.

Operations

Operations is a serious part of this, and it takes knowledge, tools, experience, and discipline.

Knowledge
– Is easy to get; Internet, conferences (Velocity, Structure, Surge), user groups

Tools – All tools are good; understand the tools you have.  Some of operations encourages hackiness because when there is a disruption, the goal is “make it stop as fast as possible.”

You have to know how to use tools like truss, strace, dtrace through previous practice before the outage comes.  Tools (and automation) can help you maintain discipline.

Experience comes from messing up and owning up.

Discipline is hardest.  It’s the single most lacking thing in our field. You have to become a craftsman. To learn discipline through experience, and through practice achieve excellence. You can’t be too timid and not take risks, or take risks you don’t understand.

It’s like my old “Web Admin Standing Orders” that tried to delineate this approach for  my ops guys – “1.  Make it happen.  2.  Don’t f*ck it up.  3.  There’s the right way, the wrong way, and the standard way.”  Take risks, but not dumb risks, and have discipline and tools.

He recommends the classic Zen and the Art of Motorcycle Maintenance for operations folks.  Cowboys and heroes burn out.  Embrace a Zen attitude.

Best Practices

  1. Version Control everything.  All tools are fine, but mainly it’s about knowing how to use it and using it correctly, whether it’s CVS or Subversion or git.
  2. Know Your Systems – Know what things look like normally so you have a point of comparison.  “Hey, there’s 100 database connections open!  That must be the problem!”  Maybe that’s normal.  Have a baseline (also helps you practice using the tools).  Your brain is the best pattern matcher.
    Don’t say “I don’t know” twice.  They wrote an open source tool called Reconnoiter that looks at data and graphs regressions and alerts on it (instead of cacti, nagions, and other time consuming stuff).  Now available as SaaS!
  3. Management – Package rollout, machine management, provisioning. “You should use puppet or chef!  Get with the times and use declarative definition!”  Use the tools you like.  He uses kickstart and cfengine and he likes it just fine.

Dynamic Content

Our job is all about the dynamic content.  Static content – Bah, use akamai or cachefly or panther or whatever.  it’s a solved problem.

Premature optimization is the root of all evil – well, 97% of it.  It’s the other 3% that’s a bitch.  And you’re not smart enough to know where that 3% is.

Optimization means “don’t do work you don’t have to.”  Computational reuse and caching,  but don’t do it in the first place when possible.
He puts comments for things he decides not to optimize explaining the assumptions and why not.

Sometimes naive business decisions force insane implementations down the line; you need to re-check them.

Your content is not as dynamic as you think it is.  Use caching.

Technique – Static Element Caching

Applied YSlow optimizations – it’s all about the JavaScript, CSS, images.  Consolidate and optimize.  Make it all publicly cacheable with 10 year expiry.

RewriteRule (.*)\.([0-9]+)\.css $1.css makes /s/app.23412 to /s/app.css – you get unique names but with new cached copy.  Bump up the number in the template.  Use “cat” to consolidate files, freaks!

Images, put a new one at a new URI.  Can’t trust caches to really refresh.

Technique – Cookie Caching

Announcing a distributed database cache that always is near the user and is totally resilient!  It’s called cookies.  Sign it if you don’t want tampering.  Encrypt if you don’t want them to see its contents.  Done.  Put user preferences there and quit with the database lookups.

Technique – Data Caching

Data caching.  Caching happens at a lot of layers.  Cache if you don’t have to be accurate, use a materialized view if you do.    Figuring out the state breakdown of your users?  Put it in a separate table at signup or state change time, don’t query all the time.  Do it from the app layer if you have to.

Technique – Choosing Technologies

Understand how you’ll be writing and retrieving data – and how everyone else in the business will be too!  (Reports, BI, etc.)  You have to be technology agnostic and find the best fit for all the needs – business requirements as well as consistency, availability, recoverability, performance, stability.  That’s a place where NoSQL falls down.

Technique – Database

Shard your database then shoot yourself.  Horizontal scaling isn’t always better.  It will make your life hell, so scale vertically first.  If you have to, do it, and try not to have regrets.

Do try “files,” NoSQL, cookies, and other non-ACID alternatives because they scale more easily.  Keep stuff out of the DB where you can.

When you do shard, partition to where you don’t need more than one shard per OLTP question.  Example – private messaging system.  You can partition by recipient and then you can see your messages easily.  But once someone looks for messages they sent, you’re borked.  But you can just keep two copies!  Twice the storage but problem solved.  Searching cross-user messages, however, borks you.

Don’t use multimaster replication.  It sucks – it’s not ready for prime time.  Outside ACID there are key-value stores, document databases, etc.  Eventual consistency helps.  MongoDB, Cassandra, Voldemort, Redis,  CouchDB – you will have some data loss with all of them.

NoSQL isn’t a cure-all; they’re not PCI compliant for example.  Shiny is not necessarily good.  Break up the problem and implement the KISS principle.  Of course you can’t get to the finish line with pure relational for large problems either – you have to use a mix; there is NO one size fits all for data management.

Keep in mind your restore-time and restore-point needs as well as ACID requirements of your data set.

Technique – Service Decoupling

One of the most fundamental techniques to scaling.  The theory is, do it asynchronous.  Why do it now if you can postpone it?  Break down the user transaction and determine what parts can be asynchronous.  Queue the info required to complete the task and process it behind the scenes.

It is hard, though, and is more about service isolation than postponing work.  The more you break down the problem into small parts, the more you have in terms of problem simplification, fault isolation, simplified design, decoupling approach, strategy, and tactics, simpler capacity planning, and more accurate performance modeling.  (Like SOA, but you know, that really works.)

One of my new mantras while building our cloud systems is “Sharing is the devil,” which is another way of stating “decouple heavily.”

Message queueing is an important part of this – you can use ActiveMQ, OpenAMQ, RabbitMQ (winner!).  STOMP sucks but is a universal protocol most everyone uses to talk to message queues.

Don’t decouple something small and simple, though.

Design & Implementation Techniques

Architecture and implementation are intrinsically tied, you can’t wholly separate them.  You can’t label a box “Database” and then just choose Voldemort or something.

Value accuracy over precision.

Make sure the “gods aren’t angry.”  The dtrace guy was running mpstat one day, and the columns didn’t line up.  The gods intended them to, so that’s your new problem instead of the original one!  OK, that’s a confusing anecdote.  A better one is “your Web servers are only handling 25 requests per second.”  It should be obvious the gods are angry.   There has to be something fundamentally wrong with the universe to make that true. That’s not a provisioning problem, that’s an engineering problem.

Develop a model.  A complete model is nearly impossible, but a good queue theory model is easy to understand and provides good insight on dependencies.

Draw it out, rationalize it.  When a user comes in to the site all what happens?  You end up doing a lot of I/O ops.  Given traffic you should then know about what each tier will bear.

Complexity is a problem – decoupling helps with it.

In the end…

Don’t be an idiot.  A lot of scalability problems are from being stupid somewhere.  High performance systems don’t have to scale as much.  Here’s one example of idiocy in three acts.

Act 1 – Amusing Error By Marketing Boneheads – sending a huge mailing with an URL that redirects. You just doubled your load, good deal.

Act 2 – Faulty Capacity Planning – you have 100k users now.  You try to plan for 10 million.  Don’t bother, plan only to 10x up, because you just don’t understand the problems you’ll have at that scale – a small margin of error will get multiplied.

Someone into agile operations might point out here that this is a way of stating the agile principle of “iterative development.”

Act 3 – The Traffic Spike – I plan on having a spike that gives me 3000 more visitors/second to a page with various CSS/JS/images.  I do loads of math and think that’s 5 machines worth.  Oh whoops I forgot to do every part of the math – the redirect issue from the amusing error above!  Suddenly there’s a huge amount more traffic and my pipe is saturated (Remember the Internet really works on packets and not bytes…) .

This shows a lot of trust in engineering math…  But isn’t this why testing was invented?  Whenever anyone shows me math and hasn’t tested it I tend to assume they’re full of it.

Come see him at Surge 2010!  It’s a new scalability and performance conference in Baltimore in late Sep/early Oct.

A new conference, interesting!  Is that code for “server side performance, ” where Velocity kinda focuses on client side/front end a lot?

Leave a comment

Filed under DevOps, Conferences

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.