Here’s a DevOps 101 presentation based on the definition of DevOps here at The Agile Admin I’m delivering at Innotech San Antonio tomorrow as part of a devops.com attempt to spread DevOps learning to IT and the enterprise. (You probably want to go view it on slideshare.com so you can read the notes, too…)
We’re entering cool event season… I thought I’d mention a bunch of the upcoming major events you may want to know about!
- Product Camp Austin, Saturday March 7 (free), for product management types
- Puppet Camp Austin, Tuesday March 10 ($50), for Puppet users
- SXSW Interactive, March 13-17 ($1295)
- Container Days Austin, Friday evening March 27 and Saturday March 28 ($60), for docker, CoreOS, and containerization fanatics.
- DevOpsDays Austin 2015, Monday May 4-Tuesday May 5 ($120), for lovers of DevOps. Sponsorships and call for presenters are also open.
- Keep Austin Agile 2015 – Friday May 8 ($285), for Agilists of all stripes.
In terms of repeating meetings you should be going to,
- CloudAustin – Evening meeting every 3rd Tuesday at Rackspace for cloud and related stuff aficionados! Large group, usually presentations with some discussion.
- Agile Austin DevOps SIG – Lunchtime discussion, Lean Coffee style, at BancVue about DevOps. Sometimes fourth Wednesdays, sometimes not. There are a lot of other Agile Austin SIGs and meetings as well.
- Austin DevOps – Evening meetup all about DevOps. Day and location vary.
- Docker Austin – First Thursday evenings at Rackspace, all about docker.
- Product Austin – Usually early in the month at Capital Factory. Product management!
It’s been a while since there’s been a post here. All the Agile Admins are insanely busy.
- James is working at Signal Sciences, trying to revolutionize app security with a DevOps approach, along with Zane Lackey and Nick Galbreath.
- Karthik is developing madly away at StackEngine, a Docker container orchestration startup, with Eric Anderson.
- Peco is product managing at Riverbed, working to bring their APM solutions to crush-the-market levels.
- Ernest (I) am working as a PM as well at Idera – when I started it was CopperEgg, a SaaS monitoring startup (originally started by Eric of StackEngine), now part of a larger APM machine – Idera does database APM, and bought Precise, up.time, and CopperEgg to form an APM juggernaut of our own. Watch out Peco, we’re coming!
We’re all still in Austin – leaving Austin is for suckers. And though we’ve been away from National Instruments, where we worked together for so many years, we still have lunch weekly and generally help keep each other up to date on all this tech stuff!
Meanwhile the “recreational activities” are just as busy. Peco’s had his second baby in two years, so that’s his off hours work. But in other news,
- CloudAustin – James, Karthik, and I organize this monthly user group, which is in its fifth year and has a solid 40-60 people per meeting.
- DevOpsDays Austin – we also organize this, also coming back for its fourth year this May 4-5! It’ll be just as huge this year, two tracks, the Marchesa, bands, Austin food – get those plane tickets now.
- Container Days Austin– Karthik suckered me into helping him and Boyd Hemphill with this too, a brand new unconference focused on containers March 27-28. “Docker docker docker!” as all the kids say nowadays.
- And we all do random other things like that on our own. My most exciting current news – I’m one of the group of folks Gene Kim is having pre-read his new DevOps Cookbook to give feedback. So my life is changed before everyone else’s!
Some folks have been kind enough to ask me questions here on the blog, and some have actually called me up and asked for advice on their own DevOps journey, so I promise to get back to posting here and answering some of those outstanding questions. Thanks for visiting!
TL;DR – performance improvements and two huge announcements, Docker-based EC2 Container Service and cloud-CEP-like AWS Lambda.
I was in a meeting for the first 45 minutes but I hear I didn’t miss much. Happy customer use cases.
The first big theme of this morning’s keynote is “Containers” – often just shorthand for “docker.” I went to a previous event here in town with even large enterprises and government – State of Texas, Microsoft, Dell, Red Hat – all freaking out about Docker. Docker is similar to VMWare or cloud in that it is a new technology that requires new monitoring and management just for it. (Heck, Eric, the CopperEgg founder, is now running a startup around docker container management, StackEngine.)
- Keynote from pristine.io about how they implemented. Docker, the new low overhead containerization technology, is a heavily cited part of the power (they actually used Flux7 as the expert consultants, they’re based here in Austin!).
- Keynote from Werner Vogels on the new “Amazon EC2 Container Service,” announced to cheers and applause. It allows launching and terminating containers to sets of instances on EC2. Their PM did a demo where they had a big farm of r3 servers and then they deploy a redis cluster and rabbitmq across them, and then front end components on a farm of c3s, and then audio processing across all of them. If you’re new to this it’s basically VMs within VMs but without noticeable overhead.
- Next they had the actual docker cofounder and CEO Ben Golub. He mentioned that docker is only 18 months old and its huge success and ecosystem this early in is “surreal.”
Next… Leapfrogging PaaS?
- Werner is back to announce AWS Lambda available now in preview – event-driven computing service for dynamic applications. No instance running/management required, events go in and “cloud functions” run on them. Holy shit, this replaces a large number of servers running semi-trivial apps. 20 cents per million requests, plus some complex stuff for seconds of execution – free for 3.2M seconds/1M requests.
- Netflix chief product guy came on to show how they’re using lambda as a higher level abstraction and have eliminated a bunch of servers – no system monitoring/management, no inefficient polling, no gaps/opacity. They’re using it to encode video, run backups, run security and compliance checks against instances, and for operational monitoring and dashboards. Replacing procedural control systems with event-driven services.
- AWS core innovations… New c4 instance, Haswell based (crazy fast processor, 36 vCPUs). Diane Bryant, SVP/GM Data Center Group from Intel, came on to go into the CPU specifically. Larger and faster EBS volumes, up to 20,000 IOPS. Enhanced and consistent networking speeds.
And this has been your cloud update! Also see Ben Kepes in Forbes for a similar summary.
The container engine is cool – it’ll certainly remove a lot of instance gerrymandering and instance reservation pain if nothing else. But Lambda is the potential disruptor here. It’s taking the idea of “bring your own algorithm” from MapReduce and saying “hmmm you can probably replace your trivial web app just with this” – it’s halfway between a PaaS and a SaaS, none of the Beanstalk complexity, just “here take this function and run it on stuff when it comes in.” If a library of common lambas becomes available, so much computing work done for trivial purposes becomes obsoleted. Who hasn’t seen a Web service to “upload a file here, then zip it or something, then store it…” OK, no servers needed any more. Very interesting.
Sadly I couldn’t attend this year, but heck that’s what the Internet is for. Here’s the interesting bits from the AWS re:Invent Day 1 keynote (livestreamed here). Loads of interesting stuff.
- AWS is growing revenue >40% YOY, far outstripping other large IT companies – EC2 use grew 99% YOY and S3 usage 137%, they have 1M active customers now. (Microsoft cloud services report 128% YOY growth as well.)
- New product announcement for Aurora – new commercial-grade database engine – fully MySQL compatible but 5x the performance, available through Amazon RDS, 1/10 the cost of the commercial DB engines (starts at 29 cents an hour, ~$210/mo). Can do 6M inserts/second and 30M selects/second. Highly durable (11 9’s), crash recovery in seconds with no data loss. Nice!
- SLDC stuff!
- CodeDeploy (was internal tool called Apollo), a new code-deployment system that lets you do rolling updates, rollbacks, and tracks deployment health. This works for all languages and is free. They use it internally for 95 deploys/hour on their own stuff.
- In early 2015 will come some more software lifecycle management services – first is CodePipeline for continuous integration and deployment (also used internally)
- Second is CodeCommit as a managed code repository that can colocate with where you’re going to deploy and has no size limits of repos or files. These “integrate with” github, jenkins, chef, etc. though it’s not clear how they don’t cannibalize them.
- Security stuff! Big push to be able to say “we easily surpass the security you can do on premise.”
- FISMA, ITAR, FIPS, FedRAMP, HIPAA, ISO 9001
- Current encryption approach is either “let Amazon manage keys” or use their CloudHSM hosted key thing, both of which are still a pain. As a result they’re launching AWS Key Management Service as a HA service that manages keys, provides one-click encryption and transparent key rotation.
- AWS Config is a new-gen agile CMDB with full visibility into all your AWS resources. You can query it and see relationships and show scope of a config change. Streams all config changes out to you.
- A new-gen service catalog called AWS Service Catalog available early 2015. Create and share product portfolios, let internal people launch them, tracking and compliance.
- Enterprise Cloud Adoption Patterns
- Often the first wave of moving into the cloud for enterprises is moving dev and test environments to run in AWS for flexibility and spin up/down for cost savings and brand new apps, custom written for the cloud
- Second wave is web sites and digital transformation (media, corp sites, ecomm) and analytics, since mass processing and sharing is cheap in the cloud – data warehouses (like pfizer’s). And mobile app back ends – phone, tablet, gps, more.
- Third wave is business critical applications. Macmillan and Hoya run their SAP in AWS. Conde Nast runs HR and Legal there.
- New wave – you’re starting to see entire datacenter migration and consolidation as DCs come up for lease (Hess, Conde Nast, NewsCorp). SunCorp. Time Inc., GPT, Nippon Express moving “all in” to AWS – many ISVs as well. The CIA moved to AWS and now Intuit is doing so now as well.
- Intuit moved their “TurboTax AnswerXchange” app there to deal with tax time peaks last year and the scales fell from their eyes when they did so – 6x cost cut, setup 1/5 of the time, faster development. They started doing more and realized the global datacenters, ease of integration with acquisitions, and dev recruiting benefits. They have 33 services on AWS now, and have moved mint.com there. They have decided to move everything else there now. Funny how once companies start looking at how much they accomplish instead of just the monthly cost the “cloud is more expensive at scale” argument gets dropped like a flaming bag of poo.
- Hybrid cloud
- Various stuff like directory service (AD in the cloud) and identity federation and storage gateway and SystemCenter and vCenter integration already exist to power mixed shops
- Johnson & Johnson went on for a while about their use of AWS. They are planning a 25,000 seat deployment of Workspaces (virtual desktop offering, like Citrix).
Whew, that’s the quick notes version. Aurora is obviously of interest – a lot of the fretting over whether to use mySQL or RDS I’ve seen will get settled by this – it was just ‘well, run the same thing yourself or have them do it…” and now it’s “have them run something insanely better”. But the SDLC tools are also interesting – they made noise about how these “work with!” ansible, jenkins, git, etc. but that seems mildly disingenuous, without any more looking into it yet they sound more like direct competition for them. But the config and service catalog could be great extensions – yay for simple composable services, not huge painful “BSM/ITMOM suites”.
Feel free and share your thoughts on the announcements in the comments section!
OK, so it’s been a while since the last installment in this series. I had talked about how we’d brought Scrum to our operations team, which worked fine, and then added in the developers as well, which didn’t. Our first attempt at dividing the whole product up into four integrated DevOps service teams collapsed quickly. Here’s how we got past that to do it again and succeed, with a fully integrated DevOps setup managing both proactive/project/feature work and reactive/support/tactical work in an effective way.
The first challenge we had was just that I couldn’t manage four new scrum teams totally on my own. We had gotten the ops team working well on Scrum but the development team hadn’t been. We didn’t have any scrum masters other than me, and were low on people designated as tech leads as well. So step one, we just mushed everyone into a 30+-person Scrum team, which sucked but was the least of the evils, and I immediately worked on mentoring people up as both Scrum masters and tech leads. I basically led by example and asked for volunteers for who was interested in Scrum mastering and/or being a tech lead for each of the planned sub-teams.
This was interesting – people self-selected in ways I would not have predicted. Some of the willing were employees and others were contractors – fine, as far as I’m concerned. “From each according to his ability, to each according to his need.” I then maximally had them start to lead standups and take ownership over sprints and such, coaching them in the background. In that way, it only took a little while to re-form up into those four teams. This gave some organic burn-in time as well, so that the ops folks got more used to not being “the ops team” and leaning on people that were supposed to be on other sprint teams. As I empowered the new leaders this became self-correcting (“Stop working with that other guy on their stuff, we have things that need doing for our service!”).
The second and largest problem was managing the work.
Part 1: “Planned” Work
We had practically zero product manager support at the beginning, so it was up to us to try to balance planned work from many sectors – feature requests from PM, tooling requests from Customer Service and Technical Support, security stuff from Security, random requests from Legal and Compliance, brainwaves from other products’ engineering managers. It quickly began to take a full person worth of time to handle, because if not managed proactively it turned into attending meetings with people complaining as a reactive result. I went to the PM team and said “come on man, give us more PM support,” and once we got one, I worked with her on being able to manage the whole overall package.
One of the chronic product manager problems in a SaaS environment is the “not my domain” syndrome. PMs want to talk about features, but when it comes to balancing stakeholders like Security and internal operational projects, they wash their hands of it. At best you get a “Well… You guys get 50% of your time to figure out what to do with that stuff and then 50% of your bandwidth will be my features, OK?”
For the record, that’s not OK. As a product manager, you need to be able to balance all the work against all the work. Maybe you don’t have an ops background, that’s fine – you probably didn’t have a <business domain here> background when you came to work either. Learn. A lot of the success of a SaaS product is in the balancing of features against stability/scalability work against compliance work… If you want to take the “I’m the CEO of the product” role, then you need to step up and own all of it, otherwise you’re just that product’s Director of Wishful Thinking.
Anyway, luckily our PM, even though she didn’t have experience doing that, was willing to learn and take it on. So we could reason about spending a month putting in a new cluster versus adding a new display feature – they all went into the same roadmap.
Work Management in JIRA
We managed that in the standard, simple way of a quarterly spreadsheet roadmap correlating to epics in an Agile rapid board in JIRA; we’d do stakeholder meetings and stack rank them and then move them into the team backlogs and flesh them out when they were ready for execution. (It was important to keep the clutter out of the backlogs till then – even if it’s supposed to be future stuff, the more items sitting in a team’s backlog, the worse their focus is.)
We kept each service as one JIRA project but combined them into rapid boards as necessary – a given team might own a couple (like the “Workbench and CMS Team” had those two and a smaller tooling one). This way when we transferred a piece of tech around we could just move the JIRA project as well, and incorporate it into the target team’s rapid board.
Some people say that the ideal world is each team owning one microservice. I don’t agree with this – we had a number of teams under other parts of the org that were only like 2 people because they owned some little service. This was difficult to sustain and transition; when things like Black Friday came up that required 24×7 support from each team for a week it was brutal on them, and even worse once development eased up on a service it just got orphaned.
If you don’t keep a service portfolio and tightly manage who is tasked with supporting each one, you are opening yourself up to a world of hurt. And that’s where we started. Departmentwide, we’d have teams work on something and then wander off and if there was a problem they’d call on the dev who was somewhere working on some other deadline now. This worked terribly. I got the brunt of this earlier when I had joined the company as a release manager and asked “So, those core services, who owns those?” “Well… No one. I mean everyone!”
So for my teams, we put together a tracking list, used both as a service registry but also for cross-training. It was a simple three-level outline, with major services, service components, and low level elements. We had the whole team work together on filling out the list – since we were managing a 7-year-old legacy system we ended up with a list of 275 or so leaf items. Every one had to have an owning team (team, not individual), and unless you retired a service, there was no dropping it on the floor. You owned it, you retired or transitioned it, and I didn’t care if “the dev that worked on that moved to some other team.” Everything was included, from end user facing services like “the user portal” to internal services like our CI servers.
This transitions into how we managed the teams. Teams were standard “two-pizza size” – a team of 5-7 people is optimal and we would combine services up till there was enough work for a team of that size. This avoided the poor coverage of mini-teams for micro-services.
Then we also used the service registry as the “merit badge system.” We had a simple qualification procedure – to get qualified in an element, you had to roll out code on it and you could sign off for yourself to say that you were qualified on one of those leaf elements. To get your “merit badge” in a service component, you needed an existing subject matter expert (SME) to sign off that you knew what you were doing, and you needed to understand how the service was written, deployed, tested, monitored. To become a SME in a service, the existing SMEs would have a board of review. SMEs were then understood to be qualified to make architectural changes on their own initiative, those with merit badges were understood to be qualified to make code changes with nothing more than the usual code review and CI testing.
This was very important for us because we were starting in a place where people had been allowed to specialize too much. Only one person was the SME on any given service and if the person who didn’t understand the Workbench was out, or quit, or whatever, suddenly no one knew a whole lot about it. That’s a terrible velocity burden even without attrition and it’s a clear and present danger to your service’s viability when there is. We started tracking the “merit badges” and had engineers set goals around how many they’d earn (or how many they’d review, for the more experienced). We used a lot of contract programmers and I told the contractor manager that I wanted to use that to rate the expertise of people on our account and that I wanted to see the numbers rise on each person over time.
Part 2 – “Unplanned” Work
Our team was only doing planned work 40% of the time, however. Since we were integrated DevOps teams working on a service with thousands of paying end customers, and that service was custom-integrated with customer Web sites and import/export of customer data, there was a continuous load of work coming in from Customer Support and from our own operations (alerts, etc.). All this “flow” work was interrupt-driven, and urgent to one degree or another.
The usual techniques for handling this have a lot of problems. “Let’s make a separate sustaining team!” Well, this turns into a set of second class developers. Thus you get worse devs there on that team. And those devs are more utilized one week, less the next; when things are quiet they get seconded to other efforts by people that haven’t read the Principles of Product Development Flow and think having everyone highly utilized is valuable, and then the load ramps up and something gives… Plans emerge to make it a training ground for new devs, till you realize that even new devs don’t want to put up with that and just quit… I’ve seen this happen in several places. I am a firm believer in dog fooding – if you are writing a service, you need to handle its problems. If there are a lot of problems, fix it!!! If you are writing the next version of the service – well, doing that in isolation from the previous service dooms you to making the same errors, adding in some new ones as well. No sustaining teams, only evergreen teams.
So we had the rule that each integrated team of devs and ops handled their own dirty laundry. And at 60% of the workload, that was a lot of laundry. How did we do it? Everyone was worried about how we could deliver decent feature velocity while overwhelmed by flow work. Therefore…
In addition to the teams’ daily standups, we had a daily triage meeting where the engineering manager, team leads, PM, and representatives from Support, Sales, and whoever had a crisis item that they felt needed to be handled on an expedited basis (not waiting on the next sprint, in other words) would come to. Each new intake would be reviewed. In the beginning we had to kick a lot of requests back for insufficient detail, but that corrected itself fast. We’d all agree on priority – “that’s affecting one customer but they’re one of our top customers, we’ll work it as P2″ or the like.
For customer reported issues, we had SLAs we committed to in contracts (1 day for P1, etc). Ops issues could be super urgent – “P1, one of our services is down!” – or doable later on. So what we did was to create a separate Kanban board in JIRA for all these kinds of issues. Anything that really could wait or was large/long term would get migrated into the Scrum backlogs, but anything where “we need someone to do this stat” went in here. It served the same purpose as an “expedite lane,” it was just a jumbo four-lane highway of an expedite lane because so much of the work was interrupt driven.
But does this mean “give up on Scrum?” No. Without the sprint cadence it was hard to hit a release cadence, and also it was easy for engineers to just get lost in the soup and stop delivering at a good rate, frankly. So we still ran sprints, and here was the rules of engagement we provided the engineers.
Need More Work?
- Pull things for your team/service off the triage queue
- If there’s nothing in the triage queue, pull the next item from the sprint backlog
- If there’s nothing in the sprint backlog or the triage queue, pull something off the top of the backlog. Or relax, either one.
Then for standups, we had a master Agile board that contained everything – all projects, the triage board, everything. So when you looked at a given engineer’s swimlane, you could see the sprint work they had, the flow work they had, and anything they were working on from someone else’s project (“Hey, why are you doing that?”). Again, via JIRA agile board composition that’s easy to do. Sometimes teams would try to do standups just looking at swimlanes containing “their” projects and it always ended up with things falling in the gap, so each time that happened I reiterated the value of seeing everything that person is assigned, not just what you think/hope they are assigned, since they are exclusively attached to that one team.
At first, everyone fretted about the conflict between the flow work and the sprint work. “What if there’s a lot of support work today?!?” But as we went on sprint by sprint, it became evident that over the course of a two week sprint, the amount of support and operations work evened out. Sprint velocity was regular despite the team working flow work as well. Having full-sized sprint teams and two-week iterations meant that even if one day there was a big production issue and whoever grabbed it was slammed – over the rest of the time and people it evened out. This shouldn’t be too surprising, it’s called “flow” for a reason – there are small ebbs and surges but in general the amount over time at scale is the same. Was it as perfectly “efficient” as having some people working flow and others working sprints? No, there is definitely a % overhead that incurs. But maximum utilization of each person’s time, as I mentioned before, is a golden calf. Lean principles show us that we want the best overall outcome – and this worked.
Flow work was addressed quickly, and our customer ticket SLA attainment percentage whipped up from a starting level of less than 50% over several quarters to sit at 100%, meaning every single support ticket that came in was addressed within its advertised SLA. Once that number hit 100% the support ticket time to live started to fall as well.
At the same time, sprint velocity for each of the four sprint teams went up over time – that is, just the story points they were delivering out of the feature backlog improved, in addition to the improvements in flow work. We’d modify the process based on engineer feedback and try alterations for a sprint to see how they panned out, but in general by keeping the overall frame in place long enough that people got used to it, they became more and more productive. Flow work and planned work both improved at the same time, not at each others’ expense.
The Dynamic Duo
This scheme had two issues. One was that engineers were sometimes confused about what they should be working on, flow tickets with SLAs or their sprint tasks. Our Scrum masters were engineers on the teams too, they weren’t full time PMs that could afford to be manually managing everyone’s work. The second was that operational issues that came in and required sub-day response time couldn’t wait for triage and ended up with either frantic searches for anyone who could help (which often became “everyone sitting in the team area”) or missed items.
I have always been inspired by Tom Limoncelli’s Time Management for System Administrators. He advocates an “interrupt shield” where someone is designated as the person to handle walkups and crises. At NI I had instituted this in process, at BV in the previous Ops team there had been a “The Dude” role (complete with Jeff Bridges bobblehead) that had been that. Thus the Dynamic Duo were born.
The teams each had one dev and one DevOps on call at a given time; we managed the schedule in PagerDuty. Whoever was on call that week became “on the Dynamic Duo” during the days as well. When we went into a sprint, they would not pull sprint tasks and would be dedicated to operational and urgent support issues. It was the Dynamic Duo because we needed someone with ops and someone with dev expertise on this – one without the other meant problems were not effectively solved. I even made a cool wiki page with Batman and Robin stuff all over it and we got Bat-phones painted red. I evangelized this inside the company. “Don’t walk up and grab a dev with your question. Don’t chat your favorite engineer to get them to work on something. Come turn on the Bat-signal, or call the Bat-phone, and the Dynamic Duo will come help you.”
This was good because it blunted the sharp tip of very urgent requests. The remaining flow work (2 people couldn’t handle the 60% of the load that was interrupt driven) was easier for the sprinting devs to pull in as they finished sprint tasks without worrying about timeliness – the real crises were handled by the Dynamic Duo. The Duo also ran the triage meeting and even if they weren’t working on all the triage work, they bird dogged it as a kind of scrum/kanban/project manager over the flow work. In the rare case there wasn’t anything for them to jump on, they could slack – time to read, learn, etc. both as compensation for the oncall and adrenaline rushes that week but also because it’s hard to fit time for that into sprints… And as we know from the Principles of Product Development Flow, running teams at too high of a utilization is actually counterproductive.
That’s the short form of how we did it – I wanted to do a lot more but I realized that since it’s been a year since I intended to write this, I’d better shake and bake and get it out and then I’m happy to follow up with stuff any of you are curious about!
This got us to a pretty happy place. As time went on we tweaked the sprint process, the triage process, and the oncall/Duo process but for a set of teams of our size with our kind of workload it was close to an optimal solution. With largely the same team on the same product, the results of these process changes were:
- Flow work improved as measured by customer ticket SLA attainment and other metrics
- Sprint work velocity improved as measured by JIRA reports
- Engineering satisfaction improved as measured by internal NPS surveys
Improvement of all these factors was not slight, but was instead 50% or more in all cases.
Feel free and ask me about parts of this you find interesting and I’ll try to expand on them. It wasn’t as simple as “add Agile” or “add DevOps,” it definitely took some custom wrangling to balance our specific SaaS service’s needs in the best manner.
I was planning to go to this meeting here in town about “Preparing for the post-IaaS phase of cloud adoption” and it brought home to me how backwards people are generally thinking about cloud. So now you get Ernest’s Cloud Rant of the Day.
What people are doing is moving in order of comfort, basically. “I’ll start with private cloud… Then maybe public IaaS… Eventually we’ll look at that other whizbang stuff.” But here’s what your decision path should be instead.
Cloud Procurement Flowchart
- Is it available as a SaaS solution? If so, use that. You shouldn’t need to host servers or write code for many of your needs – everything from email to ERP is commoditized nowadays.
- Can I do it in a public PaaS? Then use a public PaaS (Heroku/Beanstalk/Google App Engine/Azure), unless you have some real (not FUD) requirements to do it in house.
- Can I do it in a private PaaS? Then use Cloudfoundry or similar. Or do you really (for non-FUD reasons) need access to the hardware?
- Can I do it in public IaaS? Then use Amazon. Or do you really (for non-FUD reasons) need it “on premise” (probably not really on premise, but in some datacenter you’re leasing – which is different from being outsourced in the cloud why)?
- Can I do it in a private cloud? This is your final recourse before doing it the old fashioned way – unless you have extremely unique hardware requirements, you probably can. Also, you can do hybrid cloud – basically private cloud plus public cloud (IaaS only really). This gets you some of the IaaS benefits while complicating your architecture.
What About The Cost?
This seems to be inverted from how people are inching into the cloud. But the lower on this list you are, the less additional value you are getting from the solution (assuming the same price point). You should instead be reluctantly dragged into these lower levels – which require more effort and often (though not always) more expense. “But what about the cost,” you say, “the cloud gets more expensive than me running a couple servers?”
You need to keep in mind the real costs of your infrastructure when you do this – I see a lot of people spending a lot of work on private cloud that really shouldn’t be. If you simply compare “buying servers” with “cost per month in Amazon” it can seem like you need to go hybrid after a couple hundred thousand dollars appear on your bill. But:
1. Make sure you are taking into account your fully loaded cost (includes data center, power cooling, etc.) of all assets (servers, storage, network…) you are using to do this private. Use the real numbers, not the “funny money” numbers – at a previous company we allocated network and other shared costs across the entire company, while “our IT budget” had to pay for servers – don’t be a goon, you should consider what it’s costing your entire company. Storage especially is way cheaper in the cloud versus enterprise SANs.
2. Make sure you are taking into account the cost of the manpower to run it. And that’s not just their salary (fully loaded with benefits/bonuses), and the proportion of each layer of management going up that has to deal with their concerns (Even if the director only has to spend 30% of his time messing with the data center team, and the VP 10%, and the CTO 5%, and the CEO 1% – that’s a lot of freaking money). It’s also the opportunity cost of having people (smart technical people) doing your plumbing instead of doing things to forward your company. I would argue that instead of putting in the employee’s salary, you’d do better to put in your revenue per employee! Why? Because for that same money you could be having someone improving product, making sales, etc. and making you additional revenue. If all you look at is “cost reduction” you are probably divorced enough from the business goals of your organization that you are not making good decisions. This isn’t to say you don’t need any of that manpower, but ideally with more plumbing being outsourced you can turn their technical skills to something of more productive use.
3. Make sure you are taking into account the additional lag time and the cost of that time to market delay from DIYing. Some people couch this as just for purposes of innovation – “well, if you’re a small, quick moving, innovative firm or startup, then this velocity matters to you – if you’re a larger enterprise, with yearly budget cycles, not so much.” That’s not true. Assuming you are implementing all this stuff with some end goal in mind, you are burning value along with time the longer it takes you to deliver it – we like to call that cost of delay. Heck, just plain cost of money over that period is significant – I’ve seen companies go through quite a set of gyrations to be able to bill 30 days earlier to get that additional benefit; if you can deliver projects a month earlier from leveraging reusable work (which is all SaaS/PaaS/IaaS are) then you accelerate your cashflow. If you have to wait 12 months for the IT group to get a private cloud working, you are effectively losing the benefit of your deliverable * 12 months.
4. Account for complexity. The problem with “hybrid cloud,” in most implementations, is that it’s not seamless from on prem to public, and therefore your app architecture has to be doubly complicated. In a previous position where I ran a large SaaS service, we were spread across AWS (virtual everything) and Rackspace (vserver, F5 LBs, etc.) and it was a total nightmare – we were trying to migrate all the way out to the cloud just so we could delete half of the cruft in all our code that touched the infrastructure – complexity that caused production issues (frequently) and slowed our rate of delivering new functionality. The KISS principle is wrathful when ignored.
I’m not saying hybrid cloud, private cloud, etc. are never the answer – but I would say that on average they are usually not the right answer, and if you are using them as your default approach then it’s better than even money you’re being inefficient. Furthermore, using SaaS and PaaS requires less expertise than IaaS which uses less than private cloud – people justify “starting with private” because you are “leveraging skill sets” or whatever – and then 6 months later you have a whole team still trying to bake off OpenStack vs Eucalyptus when you could have had your app already running in a public PaaS. I’m not sure why I need to say out loud “delivering the most amount of value with the least amount of effort, time, and expenditure is good” – but apparently I do. Just because you *can* do something does not mean you *should* do it. You need to carefully shepherd your time to delivery and your costs, and not just let things float in a morass of IT because “these things take time…”