Velocity 2010 – Dueling Cloud Management Suppliers

Two cloud systems management suppliers talk about their bidness!  My comments in italics.

Cloud Autoscaling in Enterprise Computing by George Reese (enStratus Networks LLC)

How the Top Social Games Scale on the Cloud by Michael Crandell (RightScale, Inc)

I am more familiar with RightScale, but just read Reese’s great Cloud Application Architectures book on the plane here.  Whose cuisine will reign supreme?

enStratus

Reese starts talking about “naive autoscaling” being a problem.  The cloud isn’t magic; you have to be careful.  He defines “enterprise” autoscaling as scaling that is cognizant of financial constraints and not this hippy VC-funded twitter type nonsense.

Reactive autoscaling is done when the system’s resource requirements exceed demand.  Proactive autoscaling is done in response to capacity planning – “run more during the day.”

Proactive requires planning.  And automation needs strict governors in place.

In our PIE autoscaling, we have built limits like that into the model – kinda like any connection pool.  Min, max, rate of increase, etc.

He says your controls shouldn’t be all “number of servers,” but be “budget” based.  Hmmm.  That’s ideal but is it too ideal?  And so what do you do, shut down all your servers if you get to the 28th of the month and you run out of cash?

CPU is not a scaling metric. Have better metrics tied to things that matter like TPS/response time.  Completely agree there; scaling just based on CPU/memory/disk is primitive in the extreme.

Efficiency is a key cloud metric.  Get your utilization high.

Here’s where I kinda disagree – it can often be penny wise and pound foolish.  In the name of “efficiency” I’ve seen people put a bunch of unrelated apps on one server and cause severe availability problems.  Screw utilization.  Or use a cloud provider that uses a different charging model – I forget which one it was, but we had a conf call with one cloud provider that only charged on CPU used, not “servers provisioned.”

Of course you don’t have to take it to an extreme, just roll down to your minimum safe redundancy number on a given tier when you can.

Security – well, you tend not to do some centralized management things (like add to Active Directory) in the cloud.  It makes user management hard.  Or just makes you use LDAP, like God intended.

Cloud bursting – scaling from on premise into the cloud.

Case study – a diaper company.  Had a loyalty program.  It exceeded capacity within an hour of launch.  Humans made a scaling decision to scale at the load balancing tier, and enStratus executed the auto-scale change.  They checked it was valid traffic and all first.

But is this too fiddly for many cases?  If you are working with a “larger than 5 boxes” kind of scale don’t you really want some more active automation?

RightScale

The RightScale blog is full of good info!

They run 1.2 million cloud servers!  hey see things like 600k concurrent users, 100x scaling in 4 days, 15k instances, 1:2000 management ratio…

Now about gaming and social apps.  They power the top 10 Facebook apps.  They are an open management environment that lives atop the cloud suppliers’ APIs.

Games have a natural lifecycle where they start small, maybe take off, get big, eventually taper off.  It’s not a flat demand curve, so flat supply is ‘tarded.

During the early phase, game publishers need a cheap, fast solution that can scale.  They use Chef and other stuff in server templates for dynamic boot-time configuration.

Typically, game server side tech looks like normal Web stuff!  Apache+HAproxy LB, app servers, db cache (memcached), db (sharded mySQL master/slave pairs).  Plus search, queues, admin, logs.

Instance types – you start to see a lot of larger instances – large and extra large.  Is this because of legacy comfort issues?  Is it RAM needs?

CentOS5 dominates!  Generic images, configured at boot.  One company rebundles for faster autoscale.  Not much ubuntu or Windows.  To be agile you need to do that realtime config.

A lot of the boxes are used for databases.  Web/app and load balancing significant too.  There’s a RightScale paper showing a 100k packets per second LB limit with Amazon.

People use autoscaling a lot, but mainly for web app tier.  Not LBs because the DNS changing is a pain.  And people don’t autoscale their DBs.

They claim a lot lower human need on average for management on RightScale vs using the APIs “or the consoles.”  That’s a big or.  One of our biggest gripes with RightScale is that they consume all those lovely cloud APIs and then just give you a GUI and not an API.  That’s lame.  It does a lot of good stuff but then it “terminates” the programmatic relationship. [Edit: Apparently they have a beta API now, added since we looked at them.]

He disagrees with Reese – the problem isn’t that there is too much autoscaling, it’s that it has never existed.  I tend to agree. Dynamic elasticity is key to these kind of business models.

If your whole DB fits into memcache, what is mySQL for?  Writes sometimes?  NoSQL sounds cool but in the meantime use memcache!!!

The cloud has enabled things to exist that wouldn’t have been able to before.  Higher agility, lower cost, improved performance with control, anew levels of resiliency and automation, and full lifecycle support.

1 Comment

Filed under Cloud, Conferences, DevOps

One response to “Velocity 2010 – Dueling Cloud Management Suppliers

  1. Thanks for the excellent write-up, Ernest.

    Just wanted to add that RightScale does indeed have an API, which is actively used. In fact, there’s an iPhone app built entirely on our API that just released recently.

    Keep up the great work!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s