Tag Archives: provisioning

Service Delivery Best Practices?

So… DevOps.  DevOps vs NoOps has been making the rounds lately. At Bazaarvoice we are spawning a bunch of decentralized teams not using that nasty centralized ops team, but wanting to do it all themselves.  This led me to contemplate how to express the things that Operations does in a way that turns them into developer requirements?

Because to be honest a lot of our communication in the Ops world is incestuous.  If one were to point a developer (or worse, a product manager) at the venerable infrastructures.org site, they’d immediately poop themselves and flee.  It’s a laundry list of crap for neckbeards, but spends no time on “why do you want this?” “What value will this bring to your customer?”

So for the “NoOps” crowd who want to do ops themselves – what’s an effective way to state these needs?  I took previous work – ideas from the pretty comprehensive but waterfally “Systems Development Framework” Peco and I did for NI, Chris from Bazaarvoice’s existing DevOps work – and here’s a cut.  I’d love feedback. The goal is to have a small discrete set of areas (5-7), with a small discrete set (5-7) of “most important” items – most importantly, stated in a way so that people understand that these aren’t “annoying things some ops guy wants me to do instead of writing 3l33t code” but instead stated as they should be, customer facing requirements like any other.  They could then be each broken into subsidiary more specific user stories (probably a whole lot of them, to be honest) but these could stand in as “epics” or “pillars” for making your Web thing work right.

Service Delivery Best Practices


  • I will make my service highly available, so that customers can use it constantly without issues
  • I will know about any problems with my service’s delivery to customers
  • I can restore my service quickly from any disaster or loss
  • I can make my data resistant to corruption from runtime issues
  • I will test my service under routine disruption to make it rugged and understand its failure modes


  • I know how all parts of the product perform and can measure performance against a customer SLA
  • I can run my application and its data globally, to serve our global customer base
  • I will test my application’s performance and I know my app’s limitations before it reaches them


  • I will make my application supportable by the entire organization now and in the future
  • I know what every user hitting my service did
  • I know who has access to my code and data and release mechanism and what they did
  • I can account for all my servers, apps, users, and data
  • I can determine the root cause of issues with my service quickly
  • I can predict the costs and requirements of my service ahead of time enough that it is never a limiter
  • I will understand the communication needs of customers and the organization and will make them aware of all upgrades and downtime
  • I can handle incidents large and small with my services effectively


  • I will make every service secure
  • I understand Web application vulnerabilities and will design and code my services to not be subject to them
  • I will test my services to be able to prove to customers that they are secure
  • I will make my data appropriately private and secure against theft and tampering
  • I will understand the requirements of security auditors and expectations of customers regarding the security of our services


  • I can get code to production quickly and safely
  • I can deploy code in the middle of the day with no downtime or customer impact
  • I can pilot functionality to limited sets of users
  • I will make it easy for other teams to develop apps integrated with mine

Also to capture that everyone starting out can’t and shouldn’t do all of this… We struggled over whether these were “Goals” or “Best Practices” or what… On the one hand you should only put in e.g. “as much security as you need” but on the other hand there’s value in understanding the optimal condition.


Filed under DevOps

Amazon CloudFormation: Model Driven Automation For The Cloud

You may have heard about Amazon’s newest offering they announced today, CloudFormation.  It’s the new hotness, but I see a lot of confusion in the Twitterverse about what it is and how it fits into the landscape of IaaS/PaaS/Elastic Beanstalk/etc. Read what Werner Vogels says about CloudFormation and its uses first, but then come back here!

Allow me to break it down for you and explain why this is such a huge leverage point for cloud developers.

What Has Come Before

Up till now on Amazon you could configure up a single virtual image the way you wanted it, with an AMI. You could even kind of construct a scalable tier of similar systems using Auto Scaling, by defining Launch Configurations. But if you wanted to construct an entire multitier system it was a lot harder.  There are automated configuration management tools like chef and puppet out there, but their recipes/models tend to be oriented around getting a software loadout on an existing system, not the actual system provisioning – in general they come from the older assumption you have someone doing that on probably-physical systems using bcfg2 or cobber or vagrant or something.

So what were you to do if you wanted to bring up a simple three tier system, with a Web tier, app server tier, and database tier?  Either you had to set them up and start them manually, or you had to write code against the Amazon APIs to explicitly pull up what you wanted. Or you had to use a third party provisioning provider like RightScale or EngineYard that would let you define that kind of model in their Web consoles but not construct your own model programmatically and upload it. (I’d like my product functionality in my own source control and not your GUI, thanks.)

Now, recently Amazon launched Elastic Beanstalk, which is more way over on the PaaS side of things, similar to Google App Engine.  “Just upload your application and we’ll run it and scale it, you don’t have to worry about the plumbing.” Of course this sharply limits what you can do, and doesn’t address the question of “what if my overall system consists of more than just one Java app running in Beanstalk?”

If your goal is full model driven automation to achieve “infrastructure as code,” none of these solutions are entirely satisfactory. I understand CloudFormation deeply because we went down that same path and developed our own system model ourselves as a response!

I’ll also note that this is very similar to what Microsoft Azure does.  Azure is a hybrid IaaS/PaaS solution – their marketing tries to say it’s more like Beanstalk or Google App Engine, but in reality it’s more like CloudFormation – you have an XML file that describes the different roles (tiers) in the system, defines what software should go on each, and lets you control the entire system as a unit.

So What Is CloudFormation?

Basically CloudFormation lets you model your Amazon cloud-based system in JSON and then provision and control it as a unit.  So in our use case of a three tier system, you would model it up in their JSON markup and then CloudFormation would understand that the whole thing is a unit.  See their sample template for a WordPress setup. (A mess more sample templates are here.)

Review the WordPress template; it lets you define the AMIs and instance types, what the security group and ELB setups should be, the RDS database back end, and feed in variables that’ll be used in the consuming software (like WordPress username/password in this case).

Once you have your template you can tell Amazon to start your “stack” in the console! It’ll even let you hook it up to a SNS notification that’ll let you know when it’s done. You name the whole stack, so you can distinguish between your “dev” environment and your “prod” environment for example, as opposed to the current state of the Amazon EC2 console where you get to see a big list of instance IDs – they added tagging that you can try to use for this, but it’s kinda wonky.

Why Do I Want This Again?

Because a system model lets you do a number of clever automation things.

Standard Definition

If you’ve been doing Amazon yourself, you’re used to there being a lot of stuff you have to do manually.  From system build to system build even you do it differently each time, and God forbid you have multiple techies working on the same Amazon system. The basic value proposition of “don’t do things manually” is huge.  You configure the security groups ONCE and put it into the template, and then you’re not going to forget to open port 23 AGAIN next time you start a system. A core part of what DevOps is realizing as its value proposition is treating system configuration as code that you can source control, fix bugs in and have them stay fixed, etc.

And if you’ve been trying to automate your infrastructure with tools like Chef, Puppet, and ControlTier, you may have been frustrated in that they address single systems well, but do not really model “systems of systems” worth a damn.  Via new cloud support in knife and stuff you can execute raw “start me a cloud server” commands but all that nice recipe stuff stops at the box level and doesn’t really extend up to provisioning and tracking parts of your system.

With the CloudFormation template, you have an actual asset that defines your overall system.  This definition:

  • Can be controlled in source control
  • Can be reviewed by others
  • Is authoritative, not documentation that could differ from the reality
  • Can be automatically parsed/generated by your own tools (this is huge)

It’s also nicely transparent; when you go to the console and look at the stack it shows you the history of events, the template used to start it, the startup parameters it used… Moving away from the “mystery meat” style of system config.

Coordinated Control

With CloudFormation, you can start and stop an entire environment with one operation. You can say “this is the dev environment” and be able to control it as a unit. I assume at some point you’ll be able to visualize it as a unit, right now all the bits are still stashed in their own tabs (and I notice they don’t make any default use of their own tagging, which makes it annoying to pick out what parts are from that stack).

This is handy for not missing stuff on startup and teardown… A couple weeks ago I spent an hour deleting a couple hundred rogue EBSes we had left over after a load test.

And you get some status eventing – one of the most painful parts of trying to automate against Amazon is the whole “I started an instance, I guess I’ll sit around and poll and try to figure out when the damn thing has come up right.”  In CloudFront you get events that tell you when each part and then the whole are up and ready for use.

What It Doesn’t Do

It’s not a config management tool like Chef or Puppet. Except for what you bake onto your AMI it has zero software config capabilities, or command dispatch capabilities like Rundeck or mcollective or Fabric. Although it should be a good integration point with those tools.

It’s not a PaaS solution like Beanstalk or GAE; you use those when you just have an app you want to deploy to something that’ll run it.  Now, it does erode some use cases – it makes a middle point between “run it all yourself and love the complexity” and “forget configurable system bits, just use PaaS.”  It allows easy reusability, say having a systems guy develop the template and then a dev use it over and over again to host their app, but with more customization than the pure-play PaaSes provide.

It’s not quite like OVF, which is more fiddly and about virtually defining the guts of a single machine than defining a set of systems with roles and connections.

Competitive Analysis

It’s very similar to Microsoft Azure’s approach with their .cscfg and .csdef files which are an analogous XML model – you really could fairly call this feature “Amazon implements Azure on Amazon” (just as you could fairly call Elastic Beanstalk “Amazon implements Google App Engine on Amazon”.) In fact, the Azure Fabric has a lot more functionality than the primitive Amazon events in this first release. Of course, CloudFormation doesn’t just work on Windows, so that’s a pretty good width vs depth tradeoff.

And it’s similar to something like a RightScale, and ideally will encourage them to let customers actually submit their own definition instead of the current clunky combo of ServerArrays and ServerTemplates (curl or Web console?  Really? Why not a model like this?). RightScale must be in a tizzy right now, though really just integrating with this model should be easy enough.

Where To From Here?

As I alluded, we actually wrote our own tool like this internally called PIE that we’re looking to open source because we were feeling this whole problem space keenly.  XML model of the whole system, Apache Zookeeper-based registry, kinda like CloudFormation and Azure. Does CloudFormation obsolete what we were doing?  No – we built it because we wanted a model that could describe cloud systems on multiple clouds and even on premise systems. The Amazon model will only help you define Amazon bits, but if you are running cross-cloud or hybrid it is of limited value. And I’m sure model visualization tools will come, and a better registry/eventing system will come, but we’re way farther down that path at least at the moment. Also, the differentiation between “provisioning tools” that control and start systems like CloudFormation and bcfg2 and “configuration” tools that control and start software like Chef and Puppet (and some people even differentiate between those and “deploy” tools that control and start applications like Capistrano) is a false dichotomy. I’m all about the “toolchain” approach but at some point you need a toolbelt. This tool differentiation is one of the more harmful “Dev vs Ops” differentiations.

I hope that this move shows the value of system modeling and helps people understand we need an overarching model that can be used to define it all, not just “Amazon” vs “Azure” or “system packages” vs “developed applications” or “UNIX vs Windows…” True system automation will come from a UNIVERSAL model that can be used to reason about and program to your on premise systems, your Amazon systems, your Azure systems, your software, your apps, your data, your images and files…


You need to understand CloudFormation, because it is one of the most foundational changes that will have a lot of leverage that AWS has come out with in some time. I don’t bother to blog about most of the cool new AWS features, because they are cool and I enjoy them but this is part of a more revolutionary change in the way systems are managed, the whole DevOps thing.


Filed under Cloud, DevOps

Velocity 2010 – Facebook Operations

How The Pros Do It

Facebook Operations – A Day In The Life by Tom Cook

Facebook has been very open about their operations and it’s great for everyone.  This session is packed way past capacity.  Should be interesting.  My comments are  in italics.

Every day, 16 billion minutes are spent on Facebook worldwide.  It started in Zuckerberg’s dorm room and now is super huge, with tens of thousands of servers and its own full scale Oregon data center in progress.

So what serves the site?  It’s rerasonably straightforward.  Load balancer, web servers, services servers, memory cache, database.  THey wrote and 100% use use HipHop for PHP, once they outgrew Apache+mod_php – it bakes php down to compiled C++.  They use loads of memcached, and use sharded mySQL for the database. OS-wise it’s all Linux – CentOS 5 actually.

All the site functionality is broken up into separate discrete services – news, search, chat, ads, media – and composed from there.

They do a lot with systems management.  They’re going to focus on deployment and monitoring today.

They see two sides to systems management – config management and on demand tools.  And CM is priority 1 for them (and should be for you).  No shell scripting/error checking to push stuff.  There are a lot of great options out there to use – cfengine, puppet, chef.  They use cfengine 2!  Old school alert!  They run updates every 15 minutes (each run only takes like 30s).

It means it’s easy to make a change, get it peer reviewed, and push it to production.  Their engineers have fantastic tools and they use those too (repo management, etc.)

On demand tools do deliberate fix or data gathering.  They used to use dsh but don’t think stuff like capistrano will help them.  They wrote their own!  He ran a uname -a across 10k distributed hosts in 18s with it.

Up a layer to deployments.  Code is deployed two ways – there’s front end code and back end deployments.  The Web site, they push at least once a day and sometimes more.  Once a week is new features, the rest are fixes etc.  It’s a pretty coordinated process.

Their push tool is built on top of the mystery on demand tool.  They distribute the actual files using an internal BitTorrent swarm, and scaling issues are nevermore!  Takes 1 minute to push 100M of new code to all those 10k distributed servers.  (This doesn’t include the restarts.)

On the back end, they do it differently.  Usually you have engineering, QA, and ops groups and that causes slowdown.  They got rid of the formal QA process and instead built that into the engineers.  Engineers write, debug, test, and deploy their own code.  This allows devs to see response quickly to subsets of real traffic and make performance decisions – this relies on the culture being very intense.  No “commit and quit.”  Engineers are deeply involved in the move to production.  And they embed ops folks into engineering teams so it’s not one huge dev group interfacing with one huge ops group.  Ops participates in architectural decisions, and better understand the apps and its needs.  They can also interface with other ops groups more easily.  Of course, those ops people have to do monitoring/logging/documentation in common.

Change logging is a big deal.  They want the engineers to have freedom to make changes, and just log what is going on.  All changes, plus start and end time.  So when something degrades, ops goes to that guy ASAP – or can revert it themselves.  They have a nice internal change log interface that’s all social.  It includes deploys and “switch flips”.

Monitoring!  They like ganglia even tough it’s real old.  But it’s fast and allows rapid drilldown.  They update every minute; it’s just RRD and some daemons.  You can nest grids and pools.  They’re so big they have to shard ganglia horizontally across servers and store RRD’s in RAM, but you won’t need to do that.

They also have something called ODS (operational data store) which is more application focused and has history, reporting, better graphs.  They have soooo much data in it.

They also use nagios, even though “that’s crazy”.  Ping testing, SSH testing, Web server on a port.  They distribute it and feed alerting into other internal tools to aggregate them as an execution back end.  Aggregating into alarms clumps is critical, and decisions are made based on a tiered data structure – feeding into self healing, etc.  They have a custom interface for it.

At their size, there are some kind of failures going on constantly.  They have to be able to push fixes fast.

They have a lot of rack/cluster/datacenter etc levels of scale, and they are careful to understand dependencies and failure states among them.

They have constant communication – IRC with bots, internal news updates, “top of page” headers on internal tools, change log/feeds.  And using small teams.

How many users per engineer?  At Facebook, 1.1 million – but 2.3 million per ops person!  This means a 2:1 dev to ops ratio, I was going to ask…

To recap:

  • Version control everything
  • Optimize early
  • Automate, automate, automate
  • Use configuration management.  Don’t be a fool with your life.
  • Plan for failure
  • Instrument everything.  Hardware, network, OS, software, application, etc.
  • Don’t spend time on dumb things – you can slow people down if you’re “that guy.”
  • Priorities – Stability, support your engineers

Check facebook.com/engineering for their blog!  And facebook.com/opensource for their tools.

Leave a comment

Filed under Conferences, DevOps

Velocity 2010: Infrastructure Automation with Chef

After a lovely lunch of sammiches, we kick into the second half of Workshop Day at Velocity 2010.  Peco and I (and Jeff and Robert, also from NI) went to Infrastructure Automation with Chef, presented by Adam Jacob, Christopher Brown, and Joshua Timberman of Opscode.  My comments in italics.

Chef is a library for configuration management, and a system written on top of it.  It’s also a systems integration platform, as we will see later.  And it’s an API for your infrastructure.

In the beginning there was cfengine.  Then came puppet.  Then came chef.  It’s the latest in open source UNIXey config management automation.

  • Chef is idempotent, which means you can rerun it and get the same result, and it does minimal work to get there.
  • Chef is reasonable, and has sane defaults, which you can easily change.  You can change its mind about anything.
  • Chef is open source and you can hack it easily.  “There’s more than one way to do it” is its mantra.

A lot of the tools out there (meaning HP/IBM/CA kinds of things) are heavy and don’t understand how quickly the world changes, so they end up being artifacts of “how I should have built my system 10 years ago.”

It’s based on Ruby.  You really need a third gen language to do this effectively; if they created their own config structure it would grow into an even less standard third gen language.  If you’re a sysadmin, you do indeed program, and people that say you’re not are lying to you.  Apache config is a programming language. Chef uses small composable primitives.

You manage configuration as idempotent resources, which are put together in recipes, and tracked like source code with the end goal of configuring your servers.

Infrastructure as Code

The devops mantra.  Infrastructure is code and should be managed with the same rigor.  Source control, etc.  Chef enables this approach.  Can you reconstruct your business from source code, data backup, and bare metal?  Well, you can get there.

When you talk about constraints that affect design, one of the largest and almost unstated assumptions nowadays is that it’s really hard to recover from failure.   Many aspects of technology and the thinking of technologists is built around that.  Infrastructure as code makes that not so true, and is extremely disruptive to existing thought in the field.

Your automation can only be measured by the final solution.  No one cares about your tools, they care about what you make with them.

Chef Basics

There is a chef client that runs on each server, using recipes to configure stuff.  There’s a chef server they can talk to – or not, and run standalone.  They call each system a “node.”

They get a bunch of data points, or attributes, off the nodes and you can search them on the server, like “what version of Perl are you running.”  “knife” is the command line tool you use to do that.

Nodes have a “run list.”  That’s what roles or recipes to apply to a node, in order.

Nodes have “roles.”  A role is a description of what a node should be, like “you’re a Web server.”  A role has a run list of its own, and attributes to modify them – like “base, apache2, modssl” and “maxchildren=50”.

Chef manages resources on nodes.  Resources are declarative descriptions of state.  Resources are of type package or service; basically software install and running software.  Install software at a given version; run a service that supports certain commands.  There’s also a template resource.

Resources take action through providers.  A provider is what knows how to actually do the thing (like install a package, it knows to use apt-get or yum or whatever).

Think about it as resources go through a platform to pick a provider.

Recipes apply resources in order.  Order of execution is determined by the order they’re listed, which is pretty intuitive.  Also, systems that fail within a recipe should generally fail in the same state.  Hooray, structured programming!

Recipes can include other recipes.  They’re just Ruby.  (Everything in Chef is Ruby or JSON). No support for asynchronous actions – you can figure out a way to do it (for file transfers, for example) but that’s really bad for system packages etc.

Cookbooks are packages for recipes.  Like “Apache.”  They have recipes, assets (like the software itself), and attributes.  Assets include files, templates (evaluated with a templating language called ERB), and attributes files (config or properties files).  They try to do some sane smart config defaults (like in nginx, workers = number of cores in the box).  Cookbooks also have definitions, libraries, resources, providers…

Cookbooks are sharablehttp://cookbooks.opscode.com/ FTW! They want the cookbook repo to be like CPAN – no enforced taxonomy.

Data bags store arbitrary data.  It’s kinda like S3 keyed with JSON objects .  “Who all plays D&D?  It’s like a Bag of Holding!”  They’re searchable.  You can e.g. put a mess of users in one.  Then you can execute stuff on them.  And say use it instead of Active Directory to send users out to all your systems.  “That’s bad ass!” yells a guy from the crowd.

Working with Chef

  1. Install it.
  2. Create a chef repo.  Like by git cloning their stock one.
  3. Configure knife with a .chef/knife.rb file.  There’s a Web UI too but it’s for feebs.
  4. Download some cookbooks.  “knife cookbook site vendor rails -d” gets the ruby cookbook and makes a “vendor branch” for it and merges it in.
  5. Read the recipes.  It runs as root, don’t be a fool with your life.
  6. Upload them to the server.
  7. Build a role (knife role create rails).
  8. Add cloud credentials to knife – it knows AWS, Rackspace, Terremark.
  9. Launch a new rails server (knife ec2 server create ‘role[rails]’) – can also bootstrap
  10. Run it!
  11. Verify it!  knife ssh does parallel ssh and does command, or even screen/tmux/macterm
  12. Change it by altering your recipe and running again.

Live Demo

This was a little confusing.  He started out with a data bag, and it has a bunch of stuff configured in it, but a lot of the stuff in it I thought would be in a recipe or something.  I thought I was staying with the presentation well, but apparently not.

The demo goal is good – configure nagios and put in all the hosts without doing manual config.

Well, this workshop was excellent up to here – though I could have used them taking a little more time in “Working with Chef” – but now he’s just flipping from chef file to chef file and they’re all full of stuff that I can’t identify immediately because I’m, you know, not super familiar with Chef.  THey really could have used a more “hello world”y demo or at least stepped through all the pieces and explained them (ideally in the same order as the “working with chef” spiel).

Chef 0.8 introduced the “chef shell,” shef.    You can run recipes line by line in it.

And then there was a fire alarm!  We all evacuate.  End of session.

Afterwards, in the gaggle, Adam mentioned some interesting bits, like there is Windows support in the new version.  And it does cloud stuff automatically by using the “fog” library.  And unicorn, a server for people that know about 200% more about Rails than me.  That’s the biggest thing about chef – if you don’t do any other Ruby work it’s a pretty  high adoption bar.

One more workshop left for Day 1!


Filed under Conferences, DevOps