Category Archives: DevOps

Pertaining to agile system administration concepts and techniques.

Defining Agile Operations and DevOps

I recently read a great blog post by Scott Wilson that was talking about the definitions of Agile Operations, DevOps, and related terms.  (Read the comments too, there’s some good discussion.)  From what I’ve heard so far, there are a bunch of semi-related terms people are using around this whole “new thing of ours.”

The first is DevOps, which has two totally different frequently used definitions.

1.  Developers and Ops working closely together – the “hugs and collaboration” definition

2.  Operations folks uptaking development best practices and writing code for system automation

The second is Agile Operations, which also has different meanings.

1.  Same as DevOps, whichever definition of that I’m using

2.  Using agile principles to run operations – process techniques, like iterative development or even kanban/TPS kinds of process stuff.  Often with a goal of “faster!”

3.  Using automation – version control, automatic provisioning/control/monitoring.  Sometimes called “Infrastructure Automation” or similar.

This leads to some confusion, as most of these specific elements can be implemented in isolation.  For example, I think the discussion at OpsCamp about “Is DevOps an antipattern” was predicated on an assumption that DevOps meant only DevOps definition #2, “ops guys trying to be developers,” and made the discussion somewhat odd to people with other assumed definitions.

I have a proposed set of definitions.  To explain it, let’s look at Agile Development and see how it’s defined.

Agile development, according to wikipedia and the agile manifesto, consists of a couple different “levels” of thing.  To sum up the wikipedia breakdown,

  • Agile Principles – like “business/users and developers working together.”  These are the core values that inform agile, like collaboration, people over process, software over documentation, and responding to change over planning.
  • Agile Methods – specific process types.  Iterations, Lean, XP, Scrum.  “As opposed to waterfall.”
  • Agile Practices – techniques often found in conjunction with agile development, not linked to a given method flavor, like test driven development, continuous integration, etc.

I believe the different parts of Agile Operations that people are talking about map directly to these three levels.

  • Agile Operations Principles includes things like dev/ops collaboration (DevOps definition 1 above); things like James Turnbull’s 4-part model seem to be spot on examples of trying to define this arena.
  • Agile Operations Methods includes process you use to conduct operations – iterations, kanban, stuff you’d read in Visible Ops; Agile Operations definition #2 above.
  • Agile Operations Practices includes specific techniques like automated build/provisioning, monitoring, anything you’d have a “toolchain” for.  This contains DevOps definition #2 and Agile Operations definition #3 above.

I think it’s helpful to break them up along the same lines as agile development, however, because in the end some of those levels should merge once developers understand ops is part of system development too…  There shouldn’t be a separate “user/dev collaboration” and “dev/ops collaboration,” in a properly mature model it should become a “user/dev/ops collaboration,” for example.

I think the dev2ops guys’ “People over Process over Tools” diagram mirrors this about exactly – the people being one of the important agile principles, process being a large part of the methods, and tools being used to empower the practices.

What I like about that diagram, and why I want to bring this all back to the Agile Manifesto discussion, is that the risk of having various sub-definitions increases the risk that people will implement the processes or tools without the principles in mind, which is definitely an antipattern.  The Agile guys would tell you that iterations without collaboration is likely to not work out real well.

And it happens in agile development too – there are some teams here at my company that have adopted the methods and/or tools of agile but not its principles, and the results are suboptimal.

Therefore I propose that “Agile Operations” is an umbrella term for all these things, and we keep in mind the principles/methods/practices differentiation.

If we want to call the principles “devops” for short and some of the practices “infrastructure automation” for short I think that would be fine…   Although dev/ops collaboration is ONE of the important principles – but probably not the entirety; and infrastructure automation is one of the important practices, but there are probably others.

2 Comments

Filed under DevOps, Uncategorized

Upcoming Free Velocity WebOps Web Conference

O’Reilly’s Velocity conference is the only generalized Web ops and performance conference out there.  We really like it; you can go to various other conferences and have 10-20% of the content useful to you as a Web Admin, or you can go here and have most of it be relevant!

They’ve been doing some interim freebie Web conferences and there’s one coming up.  Check it out.  They’ll be talking about performance functionality in Google Webmaster Tools, mySQL, Show Slow, provisioning tools, and dynaTrace’s new AJAX performance analysis tool.

O’Reilly Velocity Online Conference: “Speed and Stability”
Thursday, March 17; 9:00am PST
Cost: Free

Leave a comment

Filed under Conferences, DevOps

A Case For Images

After speaking with Luke Kanies at OpsCamp, and reading his good and oft-quoted article “Golden Image or Foil Ball?“, I was thinking pretty hard about the use of images in our new automated infrastructure.  He’s pretty against them.  After careful consideration, however, I think judicious use of images is the right thing to do.

My top level thoughts on why to use images.

  1. Speed – Starting a prebuilt image is faster than reinstalling everything on an empty one.  In the world of dynamic scaling, there’s a meaningful difference between a “couple minute spinup” and a “fifteen minute spinup.”
  2. Reliability – The more work you are doing at runtime, the more there is to go wrong.  I bet I’m not the only person who has run the same compile and install on three allegedly identical Linux boxen and had it go wrong somehow on one of ’em.  And the more stuff you’re pulling to build your image, the more failure points you have.
  3. Flexibility – Dynamically building from stem cell kinda makes sense if you’re using 100% free open source and have everything automated.  What if, however, you have something that you need to install that just hasn’t been scripted – or is very hard to script?  Like an install of some half-baked Windows software that doesn’t have a command line installer and you don’t have a tool that can do it?  In that case, you really need to do the manual install in non-realtime as part of a image build.  And of course many suppliers are providing software as images themselves nowadays.
  4. Traceability – What happens if you need to replicate a past environment?  Having the image is going to be a 100% effective solution to that, even likely to be sufficient for legal reasons.  “I keep a bunch of old software repo versions so I can mostly build a machine like it” – somewhat less so.

In the end, it’s a question of using intermediate deliverables.  Do you recompile all the code and every third party package every time you build a server?  No, you often use binaries – it’s faster and more reliable.  Binaries are the app guys’ equivalent of “images.”

To address Luke’s three concerns from his article specifically:

  1. Image sprawl – if you use images, you eventually have a large library of images you have to manage.  This is very true – but you have to manage a lot of artifacts all up and down the chain anyway.  Given the “manual install” and “vendor supplied image” scenarios noted above, if you can’t manage images as part of your CM system than it’s just not a complete CM system.
  2. Updating your images – Here, I think Luke makes some not entirely valid assumptions.  He notes that once you’re done building your images, you’re still going to have to make changes in the operational environment (“bootstrapping”).  True.  But he thinks you’re not going to use the same tool to do it.  I’m not sure why not – our approach is to use automated tooling to build the images – you don’t *want* to do it manually for sure – and Puppet/Chef/etc. works just fine to do that.  So if you have to update something at the OS level, you do that and let your CM system blow everything on top – and then burn the image.  Image creation and automated CM aren’t mutually exclusive – the only reason people don’t use automation to build their images is the same reason they don’t always use automation on their live servers, which is “it takes work.”  But to me, since you DO have to have some amount of dynamic CM for the runtime bootstrap as well, it’s a good conservation of work to use the same package for both. (Besides bootstrapping, there’s other stuff like moving content that shouldn’t go on images.)
  3. Image state vs running state – This one puzzles me.  With images, you do need to do restarts to pull in image-based changes.  But with virtually all software and app changes you have to as well – maybe not a “reboot,” but a “service restart,” which is virtually as disruptive.  Whether you “reboot  your database server” or “stop and start your database server, which still takes a couple minutes”, you are planning for downtime or have redundancy in place.  And in general you need to orchestrate the changes (rolling restarts, etc.) in a manner that “oh, pull that change whenever you want to Mr. Application Server” doesn’t really work for.

In closing, I think images are useful.  You shouldn’t treat them as a replacement for automated CM – they should be interim deliverables usually generated by, and always managed by, your automated CM.  If you just use images in an uncoordinated way, you do end up with a foil ball.  With sufficient automation, however, they’re more like Russian nesting dolls, and have advantages over starting from scratch with every box.

Leave a comment

Filed under DevOps, Uncategorized

Agile Operations

It’s funny.  When we recently started working on an upgrade of our Intranet social media platform, and we were trying to figure out how to meld the infrastructure-change-heavy operation with the need for devs, designers, and testers to be able to start working on the system before “three months from now,” we broached the idea of “maybe we should do that in iterations!”  First, get the new wiki up and working.  Then, worry about tuning, switching the back end database, etc.  Very basic, but it got me thinking about the problem in terms of “hey, Infrastructure still operates in terms of waterfall, don’t we.”

Then when Peco and I moved over to NI R&D and started working on cloud-based systems, we quickly realized the need for our infrastructure to be completely programmable – that is, not manually tweaked and controlled, but run in a completely automated fashion.  Also, since we were two systems guys embedded in a large development org that’s using agile, we were heavily pressured to work in iterations along with them.  This was initially a shock – my default project plan has, in traditional fashion, months worth of evaluating, installing, and configuring various technology components before anything’s up and running.   But as we began to execute in that way, I started to see that no, really, agile is possible for infrastructure work – at least “mostly.”  Technologies like cloud computing help, but there’s still a little more up front work required than with programming – but you can get mostly towards an agile methodology (and mindset!).

Then at OpsCamp last month, we discovered that there’s been this whole Agile Operations/Automated Infrastructure/devops movement thing already in progress we hadn’t heard about.  I don’t keep in touch with The Blogosphere ™ enough I guess.  Anyway, turns out a bunch of other folks have suddenly come to the exact same conclusion and there’s exciting work going on re: how to make operations agile, automate infrastructure, and meld development and ops work.

So if  you also hadn’t been up on this, here’s a roundup of some good related core thoughts on these topics for your reading pleasure!

Leave a comment

Filed under DevOps

Enterprise Systems vs. Agility

I was recently reading a good Cameron Purdy post where he talks about his eight theses regarding why startups or students can pull stuff off that large enterprise IT shops can’t.

My summary/trenchant restatement of his points:

  1. Changing existing systems is harder than making a custom-built new one (version 2 is harder)
  2. IT veterans overcomplicate new systems
  3. The complexity of a system increases exponentially the work needed to change it (versions 3 and 4 are way way harder)
  4. Students/startups do fail a lot, you just don’t see those
  5. Risk management steps add friction
  6. Organizational overhead (paperwork/meetings) adds friction
  7. Only overconservative goons work in enterprise IT anyway
  8. The larger the org, the more conflict

Though I suspect #1 and #3 are the same, #2 and #5 are the same, and #6 and #8 are the same, really.

I’ve been thinking about this lately with my change from our enterprise IT Web site to a new greenfield cloud-hosted SaaS product in our R&D organization.  It’s definitely a huge breath of fresh air to be able to move fast.  My observations:

Complexity

The problem of systems complexity (theses #1 and #3) is a very real one.  I used to describe our Web site as having reached “system gridlock.”  There were hundreds of apps running dozens to a server with poorly documented dependencies on all kinds of stuff.  You would go in and find something that looked “wrong” – an Apache config, script, load balancer rule, whatever – but if you touched it some house of cards somewhere would come tumbling down.  Since every app developer was allowed to design their own app in its own tightly coupled way, we had to implement draconian change control and release processes in an attempt to stem the tide of people lining up to crash the Web site.

We have a new system design philosophy for our new gig which I refer to as “sharing is the devil.”  All components are separated and loosely coupled.  Using cloud computing for hardware and open source for software makes it easy and affordable to have a box that does “only one thing.”  In traditional compute environments there’s pressure to “use up all that CPU before you add more”, which results in a penny wise, pound foolish strategy of consolidation.  More and more apps and functions get crunched closer together and when you go back to pull them out you discover that all kinds of new connections and dependencies have formed unbidden.

Complication

Overcomplicating systems (#2 and #5) can be somewhat overcome by using agile principles.  We’ve been delving heavily into doing not just our apps but also our infrastructure according to an agile methodology.  It surfaces your requirements – frankly, systems people often get away with implementing whatever they want, without having a spec let alone one open to review.  Also, it makes you prioritize.  “Whatever you can get done in this two week iteration, that’s what you’ll have done, and it should be working.”  It forces focus on what is required to get things to work and delays more complex niceties till later as there’s time.

Conservatism

Both small and large organizations can suffer from #6 and #8.  That’s mostly a mindset issue.  I like to tell the story about how we were working on a high level joint IT/business vision for our Web site.  We identified a number of “pillars” of the strategy we were developing – performance, availability, TCO, etc.  I had identified agility as one, but one of the application directors just wasn’t buying into it.  “Agility, that’s weird, how do we measure that, we should just forget about it.”  I finally had to take all the things we had to the business head of the Web and say “of these, which would you say is the single most important one?”  “Agility, of course,” he said, as I knew he would.  I made it a point to train my staff that “getting it done” was the most important thing, more important than risk mitigation or crossing all the t’s and dotting all the i’s.  That can be difficult if the larger organization doesn’t reward risk and achievement over conservatism, but you can work on it.

Leave a comment

Filed under DevOps, Uncategorized

OpsCamp Debrief

I went to OpsCamp this last weekend here in Austin, a get-together for Web operations folks specifically focusing on the cloud, and it was a great time!  Here’s my after action report.

The event invite said it was in the Spider House, a cool local coffee bar/normal bar.  I hadn’t been there before, but other people that had said “That’s insane!  They’ll never fit that many people!  There’s outside seating but it’s freezing out!”  That gave me some degree of trepidation, but I still racked out in time to get downtown by 8 AM on a Saturday (sigh!).  Happily, it turned out that the event was really in the adjacent music/whatnot venue also owned by Spider House, the United States Art Authority, which they kindly allowed us to use for free!  There were a lot of people there; we weren’t overfilling the place but it was definitely at capacity, there were near 100 people in attendance.

I had just heard of OpsCamp through word of mouth, and figured it was just going to be a gathering of local Austin Web ops types.  Which would be entertaining enough, certainly.  But as I looked around the room I started recognizing a lot of guys from Velocity and other major shows; CEOs and other high ranked guys from various Web ops related tool companies.  Sponsors included John Willis and Adam Jacob (creator of Chef) from Opscode , Luke Kanies from Reductive Labs (creator of Puppet), Damon Edwards and Alex Honor from DTO Solutions (formerly ControlTier), Mark Hinkle and Matt Ray from Zenoss, Dave Nielsen (CloudCamp), Michael Coté (Redmonk), Bitnami, Spiceworks, and Rackspace Cloud.  Other than that, there were a lot of random Austinites and some guys from big local outfits (Dell, IBM).

You can read all the tweets about the event if you swing that way.

OpsCamp kinda grew out of an earlier thing, BarCampESM, also in Austin two years ago.  I never heard about that, wish I had.

How It Went

I had never been to an “unconference” before.  Basically there’s no set agenda, it’s self-emergent.  It worked pretty well.  I’ll describe the process a bit for other noobs.

First, there was a round of lightning talks.  Brett from Rackspace noted that “size matters,” Bill from Zenoss said “monitoring is important,” and Luke from Reductive claimed that “in 2-4 years ‘cloud’ won’t be a big deal, it’ll just be how people are doing things – unless you’re a jackass.”

Then it was time for sessions.  People got up and wrote a proposed session name on a piece of paper and then went in front of the group and pitched it, a hand-count of “how many people find this interesting” was taken.

Candidates included:

  • service level to resolution
  • physical access to your cloud assets
  • autodiscovery of systems
  • decompose monitoring into tool chain
  • tool chain for automatic provisioning
  • monitoring from the cloud
  • monitoring in the cloud – widely dispersed components
  • agent based monitoring evolution
  • devops is the debil – change to the role of sysadmins
  • And more

We decided that so many of these touched on two major topics that we should do group discussions on them before going to sessions.  They were:

  • monitoring in the cloud
  • config mgmt in the cloud

This seemed like a good idea; these are indeed the two major areas of concern when trying to move to the cloud.

Sadly, the whole-group discussions, especially the monitoring one, were unfruitful.  For a long ass time people threw out brilliant quips about “Why would you bother monitoring a server anyway” and other such high-theory wonkery.  I got zero value out of these, which was sad because the topics were crucially interesting – just too unfocused; you had people coming at the problem 100 different ways in sound bytes.  The only note I bothered to write down was that “monitoring porn” (too many metrics) makes it hard to do correlation.  We had that problem here, and invested in a (horrors) non open-source tool, Opnet Panorama, that has an advanced analytics and correlation engine that can make some sense of tens of thousands of metrics for exactly that reason.

Sessions

There were three sessions.  I didn’t take many notes in the first one because, being a Web ops guy, I was having to work a release simultaneously with attending OpsCamp 😛

Continue reading

10 Comments

Filed under DevOps, Uncategorized

Come To OpsCamp!

Next weekend, Jan 30 2009, there’s a Web Ops get-together here in Austin called OpsCamp!  It’ll be a Web ops “unconference” with a cloud focus.  Right up our alley!  We hope to see you there.

Leave a comment

Filed under DevOps, Uncategorized

Velocity 2009 – Best Tidbits

Besides all the sessions, which were pretty good, a lot of the good info you get from conferences is by networking with other folks there and talking to vendors.  Here are some of my top-value takeaways.

Aptimize is a New Zealand-based company that has developed software to automatically do the most high value front end optimizations (image spriting, CSS/JS combination and minification, etc.).  We predict it’ll be big.  On a site like ours, going back and doing all this across hundreds of apps will never happen – we can engineer new ones and important ones better, but something like this which can benefit apps by the handful is great.

I got some good info from the MySpace people.  We’ve been talking about whether to run our back end as Linux/Apache/Java or Windows/IIS/.NET for some of our newer stuff.  In the first workshop, I was impressed when the guy asked who all runs .NET and only one guy raised his hand.   MySpace is one of the big .NET sites, but when I talked with them about what they felt the advantage was, they looked at each other and said “Well…  It was the most expeditious choice at the time…”  That’s damning with faint praise, so I asked about what they saw the main disadvantage being, and they cited remote administration – even with the new PowerShell stuff it’s just still not as easy as remote admin/CM of Linux.  That’s top of my list too, but often Microsoft apologists will say “You just don’t understand because you don’t run it…”  But apparently running it doesn’t necessarily sell you either.

Our friends from Opnet were there.  It was probably a tough show for them, as many of these shops are of the “I never pay for software” camp.  However, you end up wasting far more in skilled personnel time if you don’t have the right tools for the job.  We use the heck out of their Panorama tool – it pulls metrics from all tiers of your system, including deep in the JVM, and does dynamic baselining, correlation and deviation.  If all your programmers are 3l33t maybe you don’t need it, but if you’re unsurprised when one of them says “Uhhh… What’s a thread leak?” then it’s money.

ControlTier is nice, they’re a commercial open source CM tool for app deploys – it works at a higher level than chef/puppet, more like capistrano.

EngineYard was a really nice cloud provisioning solution (sits on top of Amazon or whatever).  The reality of cloud computing as provided by the base IaaS vendors isn’t really the “machines dynamically spinning up and down and automatically scaling your app” they say it is without something like this (or lots of custom work).  Their solution is, sadly, Rails only right now.  But it is slick, very close to the blue-sky vision of what cloud computing can enable.

And also, I joined the EFF!  Cyber rights now!

You can see most of the official proceedings from the conference (for free!):

1 Comment

Filed under Conferences, DevOps

Velocity 2009 – Monday Night

After a hearty trip to Gordon Biersch, Peco went to the Ignite battery of five minute presentations, which he said was very good.  I went to two Birds of a Feather sessions, which were not.  The first was a general cloud computing discussion which covered well-trod ground.  The second was by a hapless Sun guy on Olio and Fabian.  No, you don’t need to know about them.  It was kinda painful, but I want to commend that Asian guy from Google for diplomatically continuing to try to guide the discussion into something coherent without just rolling over the Sun guy.  Props!

And then – we were lame and just turned in.  I’m getting old, can’t party every night like I used to.  (I don’t know what Peco’s excuse is!)

Leave a comment

Filed under Conferences, DevOps

Velocity 2009 – Scalable Internet Architectures

OK, I’ll be honest.  I started out attending “Metrics that Matter – Approaches to Managing High Performance Web Sites” (presentation available!) by Ben Rushlo, Keynote proserv.  I bailed after a half hour to the other one, not because the info in that one was bad but because I knew what he was covering and wanted to get the less familiar information from the other workshop.  Here’s my brief notes from his session:

  • Online apps are complex systems
  • A siloed approach of deciding to improve midtier vs CDN vs front end engineering results in suboptimal experience to the end user – have to take holistic view.  I totally agree with this, in our own caching project we took special care to do an analysis project first where we evaluated impact and benefit of each of these items not only in isolation but together so we’d know where we should expend effort.
  • Use top level/end user metrics, not system metrics, to measure performance.
  • There are other metrics that correlate to your performance – “key indicators.”
  • It’s hard to take low level metrics and take them “up” into a meaningful picture of user experience.

He’s covering good stuff but it’s nothing I don’t know.  We see the differences and benefits in point in time tools, Passive RUM, tagging RUM, synthetic monitoring, end user/last mile synthetic monitoring…  If you don’t, read the presentation, it’s good.  As for me, it’s off to the scaling session.

I hopped into this session a half hour late.  It’s Scalable Internet Architectures (again, go get the presentation) by Theo Schlossnagle, CEO of OmniTI and author of the similarly named book.

I like his talk, it starts by getting to the heart of what Web Operations – what we call “Web Admin” hereabouts – is.  It kinda confuses architecture and operations initially but maybe that’s because I came in late.

He talks about knowledge, tools, experience, and discipline, and mentions that discipline is the most lacking element in the field. Like him, I’m a “real engineer” who went into IT so I agree vigorously.

What specifically should you do?

  • Use version control
  • Monitor
  • Serve static content using a CDN, and behind that a reverse proxy and behind that peer based HA.  Distribute DNS for global distribution.
  • Dynamic content – now it’s time for optimization.

Optimizing Dynamic Content

Don’t pay to generate the same content twice – use caching.  Generate content only when things change and break the system into components so you can cache appropriately.

example: a php news site – articles are in oracle, personalization on each page, top new forum posts in a sidebar.

Why abuse oracle by hitting it every page view?  updates are controlled.  The page should pull user prefs from a cookie.  (p.s. rewrite your query strings)
But it’s still slow to pull from the db vs hardcoding it.
All blog sw does this, for example
Check for a hardcoded php page – if it’s not there, run something that puts it there.  Still dynamically puts in user personalization from the cookie.  In the preso he provides details on how to do this.
Do cache invalidation on content change, use a message queuing system like openAMQ for async writes.
Apache is now the bottleneck – use APC (alternative php cache)
or use memcached – he says no timeouts!  Or… be careful about them!  Or something.

Scaling Databases

1. shard them
2. shoot yourself

Sharding, or breaking your data up by range across many databases, means you throw away relational constraints and that’s sad.  Get over it.

You may not need relations – use files fool!  Or other options like couchdb, etc.  Or hadoop, from the previous workshop!

Vertically scale first by:

  • not hitting the damn db!
  • run a good db.  postgres!  not mySQL boo-yah!

When you have to go horizontal, partition right – more than one shard shouldn’t answer an oltp question.   If that’s not possible, consider duplication.

IM example.  Store messages sharded by recipient.  But then the sender wants to see them too and that’s an expensive operation – so just store them twice!!!

But if it’s not that simple, partitioning can hose you.

Do math and simulate it before you do it fool!   Be an engineer!

Multi-master replication doesn’t work right.  But it’s getting closer.

Networking

The network’s part of it, can’t forget it.

Of course if you’re using Ruby on Rails the network will never make your app suck more.  Heh, the random drive-by disses rile the crowd up.

A single machine can push a gig.  More isn’t hard with aggregated ports.  Apache too, serving static files.  Load balancers too.  How to get to 10 or 20 Gbps though?  All the drivers and firmware suck.  Buy an expensive LB?

Use routing.  It supports naive LB’ing.  Or routing protocol on front end cache/LBs talking to your edge router.  Use hashed routes upstream.  User caches use same IP.  Fault tolerant, distributed load, free.

Use isolation for floods.  Set up a surge net.  Route out based on MAC.  Used vs DDoSes.

Service Decoupling

One of the most overlooked techniques for scalable systems.  Why do now what you can postpone till later?

Break transaction into parts.  Queue info.  Process queues behind the scenes.  Messaging!  There’s different options – AMQP, Spread, JMS.  Specifically good message queuing options are:

Most common – STOMP, sucks but universal.

Combine a queue and a job dispatcher to make this happen.  Side note – Gearman, while cool, doesn’t do this – it dispatches work but it doesn’t decouple action from outcome – should be used to scale work that can’t be decoupled.  (Yes it does, says dude in crowd.)

Scalability Problems

It often boils down to “don’t be an idiot.”  His words not mine.  I like this guy. Performance is easier than scaling.  Extremely high perf systems tend to be easier to scale because they don’t have to scale as much.

e.g. An email marketing campaign with an URL not ending in a trailing slash.  Guess what, you just doubled your hits.  Use the damn trailing slash to avoid 302s.

How do you stop everyone from being an idiot though?  Every person who sends a mass email from your company?  That’s our problem  – with more than fifty programmers and business people generating apps and content for our Web site, there is always a weakest link.

Caching should be controlled not prevented in nearly any circumstance.

Understand the problem.  going from 100k to 10MM users – don’t just bucketize in small chunks and assume it will scale.  Allow for margin for error.  Designing for 100x or 1000x requires a profound understanding of the problem.

Example – I plan for a traffic spike of 3000 new visitors/sec.  My page is about 300k.  CPU bound.  8ms service time.  Calculate servers needed.  If I varnish the static assets, the calculation says I need 3-4 machines.  But do the math and it’s 8 GB/sec of throughput.  No way.  At 1.5MM packets/sec – the firewall dies.  You have to keep the whole system in mind.

So spread out static resources across multiple datacenters, agg’d pipes.
The rest is only 350 Mbps, 75k packets per second, doable – except the 302 adds 50% overage in packets per sec.

Last bonus thought – use zfs/dtrace for dbs, so run them on solaris!

1 Comment

Filed under Conferences, DevOps