Tag Archives: SaaS

The Cloud Procurement Pecking Order

I was planning to go to this meeting here in town about “Preparing for the post-IaaS phase of cloud adoption” and it brought home to me how backwards many organizations are when they start thinking about cloud options. So now you get Ernest’s Cloud Procurement Pecking Order.

What many people are doing is moving in order of comfort, basically, as they start moving from old school on prem into the cloud.  “I’ll start with private cloud… Then maybe public IaaS… Eventually we’ll look at that other whizbang stuff.” But here’s what your decision path should be instead. It’s the logical extension of the basic buy vs build strategy decision you’re used to doing.

Cloud Procurement Flowchart

Look at the functionality you are trying to fulfull.  Now ask in order:

  1. Is it available as a SaaS solution?  If so, use that. You shouldn’t need to host servers or write code for many of your needs – everything from email to ERP is commoditized nowadays. This is the modern equivalent of “buy, don’t build.” You don’t get 100% control over the functionality if you buy it, but unless the function is super core to your business you should simply get over that.
  2. [Optional] Does it fit the functional profile to do it serverless? Serverless is basically “second gen PaaS with less fiddly IaaS in it” so this would be your second step. Amazon has Lambda and Azure and Google have shipped competitors already. Right this moment serverless tech is still pretty bleeding edge, so you’d be forgiven for skipping this step if you don’t have pretty high caliber techies on staff.
  3. Can I do it in a public PaaS?  Then use a public PaaS (Heroku/Beanstalk/Google App Engine/Azure), unless you have some real (not FUD) requirements to do it in house.
  4. Can I do it in a private PaaS? Then use Cloudfoundry or similar. Or do you really (for non-FUD reasons) need access to the hardware?
  5. Can I do it in public IaaS?  Then use Amazon, or Azure. Or do you really (for non-FUD reasons) need it “on premise” (probably not really on premise, but in some datacenter you’re leasing – which is different from being outsourced in the cloud why)?  Even hardcore hardware render is done in the cloud nowadays (you can get GPU driven instances, SSDs, etc.)
  6. Can I do it in a private cloud? Use VMWare Cloud or Openstack. This is your final recourse before doing it the old fashioned way – unless you have extremely unique hardware requirements, you probably can. Also, you can do hybrid cloud – basically private cloud plus public cloud (IaaS only really). This gets you some of the IaaS benefits while complicating your architecture.

What About Compliance?

Very few compliance requirements exist that cannot be satisfied in the cloud.  There are large financials operating in the cloud, people with SOX and PCI and FISMA and NIST and ISO compliance needs… If your reason for running on prem is “but compliance” there’s a 90% chance you are just plain wrong, and coasting on decade-old received wisdom instead of being well informed about the modern state of cloud technology and security and compliance. I’ve personally helped pure-cloud solutions hit ISO and TUV and various other compliance goals.

What About The Cost?

This ordering seems to be inverted from how people are inching into the cloud. But the lower on this list you are, the less additional value you are getting from the solution (assuming the same price point). You should instead be reluctantly dragged into the lower levels on this list – which require more effort and often (though not always) more expense. A higher level needs to be a lot more expensive to justify the additional complexity and lag of doing more of the work yourself.

“But what about the cost,” you say, “the cloud gets more expensive than me running a couple servers?” It’s easy to be penny wise but pound foolish when making cloud cost decisions.

You need to keep in mind the real costs of your infrastructure when you do this – I see a lot of people spending a lot of work on private cloud that they really shouldn’t be. If you simply compare “buying servers” with “cost per month in Amazon” it can seem, using a naive analysis, like you need to go hybrid on prem after a couple hundred thousand dollars appear on your bill. But:

1. Make sure you are taking into account your fully loaded cost (includes data center, power cooling, etc.) of all assets (servers, storage, network…) you are using to do this private. Use the real numbers, not the “funny money” numbers – at a previous company we allocated network and other shared costs across the entire company, while “our IT budget” had to pay for servers, so that was the only number used in a comparison since it was our own department’s costs only that were considered – don’t be a goon (technical term for a local optimizer),  you should consider what it’s costing your entire company. Storage especially is way cheaper in the cloud versus enterprise SANs.

2. Make sure you are taking into account the cost of the manpower to run it.  And that’s not just the techies’ salary (fully loaded with benefits/bonuses), and the proportion of each layer of management going up that has to deal with their concerns (Even if the director only has to spend 30% of his time messing with the data center team, and the VP 10%, and the CTO 5%, and the CEO 1% – that’s a lot of freaking money you need to account for). It’s also the opportunity cost of having people (smart technical people) doing your plumbing instead of doing things to forward your company.  I would argue that instead of putting in the employee’s salary in this calculation, you’d do better to put in your revenue per employee!  Why? Because for that same money you could have someone improving product, making sales, etc. and making you additional revenue. If all you are looking at is “cost reduction” you are probably divorced enough from the business goals of your organization that you are not making good decisions. This isn’t to say you don’t need any of that manpower, but ideally with more plumbing being outsourced you can turn their technical skills to something of more productive use.

3. Make sure you are taking into account the additional lag time and the cost of that time to market delay from DIYing. Some people couch this as just for purposes of innovation – “well, if you’re a small, quick moving, innovative firm or startup, then this velocity matters to you – if you’re a larger enterprise, with yearly budget cycles, not so much.” That’s not true. Assuming you are implementing all this stuff with some end goal in mind, you are burning value along with time the longer it takes you to deliver it – we like to call that cost of delay. Heck, just plain cost of money over that period is significant – I’ve seen companies go through quite a set of gyrations to be able to bill 30 days earlier to get that additional benefit; if you can deliver projects a month earlier from leveraging reusable work (which is all that SaaS/PaaS/IaaS solutions are) then you accelerate your cashflow. If you have to wait 12 months for the IT group to get a private cloud working, you are effectively losing the benefit of your deliverable * 12 months. “We saved $10k/year on hosting costs!”  “Great, can we deliver our product that will make us $10k/month now, or do we get to continue to put ourselves out of business with cost cutting?”

4. Account for complexity.  The problem with “hybrid cloud,” in most implementations, is that it’s not seamless from on prem to public, and therefore your app architecture has to be doubly complicated.  In a previous position where I ran a large SaaS service, we were spread across AWS (virtual everything) and Rackspace (vserver, F5 LBs, etc.) and it was a total nightmare – we were trying to migrate all the way out to the cloud just so we could delete half of the cruft in all our code that touched the infrastructure – complexity that caused production issues (frequently) and slowed our rate of delivering new functionality. The KISS principle is wrathful when ignored.

I’m not saying hybrid cloud, private cloud, etc. are never the answer – but I would say that on average they are usually not the right answer, and if you are using them as your default approach then it’s better than even money you’re being inefficient. Furthermore, using SaaS and PaaS requires less expertise (and thus money) than IaaS which uses less than private cloud – people justify “starting with private” because you are “leveraging skill sets” or whatever – and then 6 months later you have a whole team still trying to bake off OpenStack vs Eucalyptus when you could have had your app (you know, the thing you actually need to fulfill a business goal) already running in a public PaaS. I’m not sure why I need to say out loud “delivering the most amount of value with the least amount of effort, time, and expenditure is good” – but apparently I do. Just because you *can* do something does not mean you *should* do it.  You need to carefully shepherd your time to delivery and your costs, and not just let things float in a morass of IT because “these things take time…”


Filed under Cloud

Velocity 2013 Day 3 Liveblog: Retooling Adobe: A DevOps Journey from Packaged Software to Service Provider

Retooling Adobe: A DevOps Journey from Packaged Software to Service Provider

Srinivas Peri, Adobe and Alex Honor, SimplifyOPS/DTO

Adobe needed to move from desktop, packaged software to a cloud services model and needed a DevOps transformation as well.

Srini’s CoreTech Tools/Infrastructure group tries to transform wasted time to value time (enabling tools).

So they started talking SaaS and Srini went around talking to them about tooling.

Dan Neff came to Adobe from Facebook as operations guru from Facebook.  He said “let’s stop talking about tools.” He showed him the 10+ deploys a day at Flickr preso. Time to go to Velocity!  And he met Alex and Damon of DTO and learned about loosely coupled toolchains.

They generated CDOT, a service delivery platform. Some teams started using it, then they bought Typekit and Paul Hammond thought it was just lovely.

And now all Adobe software is coming through the cloud.  They are not the CoreTech Solution Engineering team – who makes enabling services.

Do something next week! And don’t reinvent the wheel.

How To Do It

First problem to solve. There are islands of tools – CM, package, build, orchestration, package repos, source repos. Different teams, different philosophies.

And actually, probably in each business unit, you have another instantiation of all of the above.

CDOT – their service delivery platform, the 30k foot view

Many different app architectures and many data center providers (cloud and trad). CDOT bridges the gap.

CDOT has a UI and API service atop an integration layer  It uses jenkins, rundeck, chef, zabbix, splunk under the covers.

On the code side – what is that? App code, app config, and verification code. But also operations code! It is part of YOUR product. It’s an input to CDOT.

So build (CI).  Takes from perforce/github to pk/jenkins, into moddav/nexus, for cloud stuff bake to an AMI, promote packages to S3 and AMIs to an AMI repo.

For deploy (CD), jenkins calls rundeck and chef server. Rundeck instantiates the cloudformation or whatever and does high level orchestration, the AMis pull chef recipes and packages from S3, and chef does the local orchestration.  Is it pull or push?  Both/either. You can bake and you can fry.

So feature branches – some people don’t need to CD to prod, but they sure do to somewhere.  So devs can mess with feature branches on dev boxes, but then all master checkins CD to a CD environment.  You can choose how often to go to prod.

Have a cool “devops workbench” UI with the deployment pipeline and state. So everyone has one-click self service deployment with no manual steps, with high confidence.

Now, CDOT video! It’s not really for us, it’s their internal marketing video to get teams to uptake CDOT.  Getting people on board is most of the effort!

What’s the value prop?

  • Save people time
  • Alleviate their headaches
  • Understand their motivations (for when they play politics)
  • Listen to and address their fears

Bring testimonials, data, presentations, do events, videos!  Sell it!

“Get out of your cube and go talk to people”

Think like a salesperson. Get users (devs/PMs) on board, then the buyers (managers/budget folks), partners and suppliers (other ops guys).

Leave a comment

Filed under Conferences, DevOps

Speeding Up Releases

Hi all!  My new job’s been affording me few opportunities for blogging, but I’m getting into the groove, so you should see more of me now.

Releasing All The Time!

Continuous integration is the bomb.  We can all generally agree on that.  But my life has become one of halfway steps that I think will be familiar to many of you, and I don’t believe in hiding the real world that’s not all case study perfect out there.  So rather than give you the standard theory-list of “what you should do for nice futuristic DevOps releases,” let me tell you of our march from a 10 week to 2 week to 1 week release tempo at Bazaarvoice.

Biweekly Releases!

I started with BV at the start of February of this year. They said, “Our new release manager!  We’ve been waiting for you!  We adopted agile and then tried to move from our big-bang 10 week release cycle to 2 weeks and it blew up like you wouldn’t believe.  Get us to two week releases.  You’ve got a month. Go!”  The product management team really needed us to be able to roll out features more quickly, do piloting and A/B testing, and generally be way more agile in delivery to the customer and not just in dev-land.

Background – our primary application is for the collection and display of user generated content – for example, ratings and reviews – and a lot of the biggest Internet retailers use our solution for that purpose. The codebase started seven years ago and grew monolithically for much of that time . (“The monolith” was the semi-affectionate code name for the stack when I started, as in “is your app’s code on the monolith?”) The app is running across multiple physical and cloud based datacenters and pushing out billions of hits a day, so there’s a low tolerance window for errors – our end user facing display apps have to have zero downtime for releases, though we can do up to two hours of downtime in a 3-5 AM window for customer administrative systems. Stack is Java, Linux, mySQL, Solr, et al. Extremely complex, just like any app added on to for years.

There had been a SWAT team formed after the semi-disastrous 2 week release that identified the main problems.  Just like everywhere else, the main impediments were:

  • Lack of automation in testing
  • Poor SCM code discipline

Our CTO was very invested in solving the problem, so he supported the solution to #1 – the QA team hired up and got some automation folks in, and the product teams were told they had to stop feature development and do several sprints of writing and automating tests until they could sustain the biweekly cadence.

The solution to #2 had two parts.  One was a feature flagging system so we could launch code “dark.” We had a crack team of devs crank this one out. I won’t belabor it because Facebook etc. love to do DevOps presentations on the approach and its benefits, but it’s true.  Now we release all code dark first, and can enable it for certain clients or other segments.

Two was process – a new branching process where a single release branch comes off trunk every two weeks several days before release, and changes aren’t allowed to it except to fix issues found in QA, and those are approved and labeled into discrete release candidates. The dev environment gets trunk twice a day, the QA environment gets branch every time a new release candidate is labeled. Full product CIT must be passing to get a release candidate. As always, process steps like this sound like common sense but when you need 100 developers in 10 teams to uptake them immediately, the little issues come out and play.

There were a couple issues we couldn’t fix in the time allotted.  One was that our Solr indexes are Godawful huge.  Like 20 GB huge.  JVM GC tuning is a popular hobby with us. To make changes, reindex, and distribute the indexes in time to perform a zero-downtime deployment, with replication lag nipping at our heels, was a bigger deal.  The other was that our build and deploy pipeline was pretty bad.  All the keywords you want to hear are there – Puppet, TeamCity, Rundeck, svn, noah, maven/Nexus, yum…  But they are inconsistently implemented, embedded in a huge crufty bash script framework and parts have gone largely untended.

The timeframe was extremely aggressive.  I project managed the hell out of it and all the teams were very on board and helpful, and management was very supportive.  I actually got a slight delay, and was grateful for it, because our IPO date came up on the same date when we were supposed to start biweekly releases, and even the extremely ambitious were taken aback by the risk of cocking up the service on that day. We did our first biweekly release on March 6th and then every two weeks thereafter.  We had a couple rough patches, but they were good learning experiences.

For example, as our first biweekly release day approached, tests just weren’t passing. I brought all the dev managers to the go/no-go meeting (another new institution) and asked them, “are we go?” (The release manager role had been set up by upper management as more prescriptive, with the thought I’d be sitting there yelling at them “It’s no-go,” but that’s really not an effective long term strategy).  They all kinda shuffled, and hemmed, and hawed (a lot of pressure from internal stakeholders wanted this release to go out NOW), but then one manager said “No, we’re no go.  It’s just not safe.” Once she said that everyone else got over that initial taboo of saying “no go” and concurred that some of their areas were no go.  The release went out 5 calendar days late but a lot more smoothly than the last release did (44 major issues then, 5 this time).

The next release, though, was the real make-or-break.  On the one hand everyone had a first real pass through the process and so some of the “but I didn’t know I needed to have testing signoff by that day and time” breaking-in static was gone, but on the other hand they’d had 2 months between the previous two releases to test and plan, and this one allowed only two weeks.  It went off with no delay and only 1 issue.

Of course, we had deliberately sandbagged that a little because it coincided a with ‘test development only” sprint.  But anyone who thinks a complex release in a large scale environment will go smoothly just because you’re deploying code with no functional changes has clearly never been closer than a 10-foot pole to real world Web operations. As we ramped back up on feature development, the process was also becoming more ingrained and testing better, so it went well.

We had one release go bad in May, and when we looked at it we realized a lot of changes weren’t being sufficiently QA’ed.  So what we did was simply add a set of fields to all JIRA tickets for the team to specify who tested the change, and we wrote a script to parse our Subversion commit comments and label JIRA tickets with the appropriate release (trying to get people to actually fill out tickets correctly is pain and usually doomed to failure, so we made an end run with automation).  So then as a release came up, on a wiki page is a list of all the tickets in the release and who tested them and how (automatic, manual, did not test). We actually did this for two releases with paper printouts and physical signoffs to develop the process before we automated it.  This corrected the issue and we ran from then on with very low problem rates. As advertised, releasing fewer changes more frequently allows us to get both a higher throughput of changes and, paradoxically, higher quality with them.

Weekly Releases!

The process worked great through the summer. In the biweekly release communication and presentations, I had explained we’d be moving to weekly and then to continuous deployment as soon as we could make it happen. Well, the solr index distribution problem took a while – two reorgs kicked it around and it was an ambitious “use bittorrent to distribute the index to all the servers in our various DCs” pretty propellerhead kind of thing that had to happen. It took the summer to get that squared away. In the meantime I also conducted a project internally called “Neverland” to fix some of the most egregious technical debt in our TeamCity and Nexus setup and deployment scripts.

The real testament to the culture change that happened as part of the biweekly release project is that while that project was a “big deal” – I had stakeholders from all over the business, big all hands presentations, project plans out the yin-yang, the entire technical leadership team sweating the details – moving from biweekly to weekly releases was largely a non-event.

The QA team worked in the background leading up to it to push test automation levels up higher. Then we basically just said “Hey, you guys want to release faster don’t you?” “Well sure!” “OK. we’re going weekly in two weeks. Check out the updated process docs.” “All right.” And we did, starting the first release in September.  The Solr index got reindexed and redistributed (and man, it had been a while – it compacted down nicely) and deployment ran great. No change in error rate at all. We’ve been weekly since then, the only change  is when we don’t release during critical change freeze windows around Black Friday/Cyber Monday and other holiday prime times. We think our setup is robust enough that it’s safe to release even then, but, heck, no one’s perfect so it’s probably prudent to pause, and many of our clients are really adamant about holiday change freezes to us and to all their suppliers.

The one concern voiced by engineers about the overhead of the release process was addressed by automating it more and by educating.  For example, the go/no-go meeting was, at times, a little messy. Some of the other teams (especially ones not located in Austin) wouldn’t show up, or test signoffs wouldn’t be ready, and it would turn into delays and running around. The opportunity to do it more quickly actually helped a lot! Whereas the meeting had been 30 minutes if we were lucky when we started, now the meeting is taking 5 minutes, and only longer when someone screws around and doesn’t dial into the Webex on time.

“If it’s painful, do it more often” is a message that some folks still balk at when confronted with, but it is absolutely true.

Now, the path wasn’t easy and I was blessed with a very high caliber of people at Bazaarvoice – Dev, Ops, and QA. Everyone was always very focused on “how do we make this work” or “how do we improve this” with very little of the turf warring, blocking, and politics that I sadly have come to expect in a corporate environment. The mindset is very much “if we come up with a new way that’s better and we all agree on that, we will change to do that thing TOMORROW and not spend months dithering about it,” which is awesome and helped drive these changes through much faster than, honestly, I initially estimated it would take.

Releasing All The Time!

Continuous integration on “the monolith” was a distant myth initially, but now we’re seeing how we can get there and the benefits we’ll reap from doing so. Our main impediments remaining are:

1. CIT not passing.  We don’t have a rule where if CIT is failing checkins are blocked, mainly because there’s a bunch of old legacy tests that are flaky. This often results in release milestones being delayed because CIT isn’t passing and there’s 6 devs’ checkins in the last failing build. Step 1 is fix the flaky tests and step 2 is declare work stoppage when CIT is failing. The senior developers see the wisdom in this so I expect it to go down without much friction. Again, the culture is very much about ruthlessly adopting an innovation if the key players agree it will be beneficial.

2. Builds, CIT, and deployment are slow as molasses in January. Build 1 hour, CIT 40 minutes, deploy 3 hours. Why? Various legacy reasons that give me a headache when I have to listen to them. Basically “that’s how it is now, and complete rewrite is potentially beyond any one person’s ability and definitely would take multiple man-months.” We’re analyzing what to do here. We also have a “staging” environment customers use for integration, and so currently we have to deploy to dev, test, deploy to QA, test, deploy to staging (hitting the downtime window), test, deploy to production (hitting the downtime window), test. So basically 2 days minimum. However, staging is really production and step one is release them at the same time.  There’s a couple “but I can only test this kind of change in staging” items left that basically just require telling someone “Figure out how to test it in QA now.” Going to “always release trunk” will remove the whole branch deployment and separate dev and QA environments. So that’s 2 of 4 deployments removed, but then it’s a matter of figuring out cost vs benefit of smashing down parts of that 4:40. I have one proposal in front of me for chucking all the current deploy infrastructure for a Jenkins-driven one, I need to figure out if it is complete enough…

Am I Doing It Wrong?

Chime in in the comments below with questions or if there’s some way I could have cut the Gordian knot better.  I think we’ve moved about as fast as you can given a lot of legacy code and technical debt (and having a lot of other stuff people need to be working on to keep a service up and get out new functionality).   The three step process I used that works, as it does so often, was:

  1. Communicate a clear vision
  2. Drive execution relentlessly
  3. Keep metrics and continually improve

Thanks for reading, and happy releasing!

1 Comment

Filed under DevOps

What Is Cloud Computing?

My recent post on how sick I am of people being confused by the basic concept of cloud computing quickly brought out the comments on “what cloud is” and “what cloud is not.” And the truth is, it’s a little messy, there’s not a clear definition, especially across “the three aaSes“. So now let’s have a post for the advanced students. Chip in with your thoughts!

Here’s my Grand Unified Theory of Cloud Computing. Rather than being a legalistic definition that will always be wrong for some instances of cloud, it attempts to convey the history and related concepts that inform the cloud.

The Grand Unified Theory of Cloud Computing

( ISP -> colo -> MSP ) + virtualization + HPC + (AJAX + SOAP -> REST APIs) = IaaS
(( web site -> web app -> ASP ) + virtualization + fast ubiquitous Internet + [ RIA browsers & mobile ] = SaaS
( IDEs & 4GLs ) + ( EAI -> SOA ) +  SaaS + IaaS = PaaS
[ IaaS | PaaS | SaaS ] + [ devops | open source | noSQL ] = cloud

* Note, I don’t agree with all those Wikipedia definitions, they are only linked to clue in people unsure about a given term

Sure, that’s where the cloud comes from, but “what is the cloud?” Well, here’s my thoughts, the Seven Pillars of Cloud Computing.  Having more of these makes something “more cloudy” and having fewer makes something “less cloudy.” Arguments over whether some specific offering “is cloud” or not, however, is for people without sufficiently challenging jobs.

The Seven Pillars of Cloud Computing

“The Cloud” may be characterized as:

  • An outsourced managed service
  • providing hosted computing or functionality
  • delivered over the Internet
  • offering extreme scalability
  • by using dynamically provisioned, multitenant, virtualized systems, storage, and applications
  • controlled via REST APIs
  • and billed in a utility manner.

You can remove one or more of these pillars to form most of the things people sell you as “private cloud,” for example, losing specific cloud benefits in exchange for other concerns.

Now there’s also the new vs old argument. There’s the technohipsters that say “Cloud is nothing new, I was doing that back in the ’90’s.” And some of that is true, but only in the most uninteresting way. The old and the new have, via alchemy, begun to help users realize benefits beyond what they did before.

Benefits of Cloud – What and How

Not New:

  • Virtualization
  • Outsourcing
  • Integration
  • Intertubes
Pretty New:

  • Multitenancy
  • Massively scalable
  • Elastic self provisioning
  • Pay as you go
Resulting Benefits:

  • agility
  • economy of scale
  • low initial investment
  • scalable cost
  • resilience
  • improved service delivery
  • universal access

Okay, Clouderati – what do you think?


Filed under Cloud

Report from NIWeek

Hey all, sorry it’s been quiet around here – Peco and I took our families on vacation to Bulgaria!  Plus, we’ve been busy in the run-up to our company convention, NIWeek. I imagine most of the Web type folks out there don’t know about NIWeek, but it’s where scientists and engineers who use our products come to learn. It’s always awesome to see the technology innovation going on out there, from the Stormchasers getting data on tornadoes and lightning that no one ever has before, to high school kids solving real problems.

There were a couple things that are really worth checking out.  The first is the demo David Fuller did of NI’s system designer prototype (you can skip ahead to 5:00 in if you want to) . Though the examples he is using is of engineering type systems, you can easily imagine using that same interface for designing Web systems – no ‘separate Visio diagram’ BS any more. Imagine every level from architectural diagram to physical system representation to the real running code all being part of one integrated drill-down. It looks SUPER SWEET. Seems like science fiction to those of us IT-types.

A quick guide to the demo – so first a Xilinx guy talks about their new ARM-based chip, and then David shows drill-up and down to the real hardware parts of a system.  NI now has the “traditional systems” problem in that people buy hardware, buy software, and are turning it into large distributed scalable architectures.  Not being hobbled by preconceptions of how that should be done, our system diagram team has come up with a sweet visualization where you can swap between architecture view (8:30 in), actual pictures and specs of hardware, then down (10:40 in) into the “implementation” box-and-line system and network diagram, and then down into the code (12:00 in for VHDL and 13:20 in for LabVIEW). LabVIEW code is natively graphical, so in the final drilldown he also shows programming using drawing/gestures.

Why have twenty years of “systems management” and design tools from IBM/HP/etc not given us anything near this awesome for other systems?  I don’t know, but it’s high time. We led a session at DevOpsDays about diagramming systems, and “I make a Visio on the side” is state of the art.  There was one guy who took the time to make awesome UML models, but real integration of design/diagram to real system doesn’t exist. And it needs to. And not in some labor intensive  “How about UML oh Lord I pooped myself” kind of way, but an easy and integral part of building the system.

I am really enjoying working in the joint engineering/IT world.  There’s some things IT technology has figured out that engineering technology is just starting to bumble into (security, for example, and Web services). But there are a lot of things that engineering does that IT efforts look like the work of a bumbling child next to. Like instrumentation and monitoring, the IT state of the art is vomitous when placed next to real engineering data metric gathering (and analysis, and visualization) techniques.  Will system design also be revolutionized from that quarter?

The other cool takeaway was how cloud is gaining some foothold in the engineering space.  I was impressed as hell with Maintainable Software, the only proper Web 3.0 company in attendance. Awesome SaaS product, and I talked with the guys for a long time and they are doing all the cool DevOps stuff – automated provisioning, continuous deployment, feature knobs, all that Etsy/Facebook kind of whizbang shit. They’re like what I want our team here to become, and it was great meeting someone in our space who is doing all that – I love goofy social media apps or whatever but it can sometimes be hard to convey the appropriateness of some of those practices to our sector. “If it’s good enough to sell hand knitted tea cozies or try to hook up with old high school sweethearts, then certainly it’s good enough to use in your attempt to cure cancer!” anyway, Mike and Derek were great guys and it was nice to see that new kind of thinking making inroads into our sometimes too-traditional space.

Leave a comment

Filed under Conferences, General

But, You See, Your Other Customers Are Dumber Than We Are

Sorry it’s been so long between blog posts.  We’ve been caught in a month of Release Hell, where we haven’t been able to release because of supplier performance/availability problems. And I wanted to take a minute to vent about something I have heard way too much from suppliers over the last nine years working at NI, which is that “none of our other customers are having that problem.”

Because, you see, the vast majority of the time that ends up not being the case. And that’s a generous estimate.  It’s an understandable mistake on the part of the person we are talking to – we say “Hey, there’s a big underlying issue with your service or software.” The person we’re talking to hasn’t heard that before, so naturally assumes it’s “just us,” something specific to our implementation. because they’re big, right, and have lots of customers, and surely if there was an underlying problem they’d already know from other people, right? And we move on, wasting lots of time digging into our use case rather than the supplier fixing their stuff.

But this requires a fundamental misunderstanding of how problems get identified.  What has to happen to get an issue report to a state where the guy we’re talking to would have heard about it? If it’s a blatant problem – the service is down – then other people report it.  But what if it’s subtle?  What if it’s that 10% of requests fail? (That seems to happen a lot.)

Steps Required For You To Know About This Problem

  1. First, a customer has to detect the problem in the first place.  And most don’t have sufficient instrumentation to detect a sudden performance degradation, or an availability hit short of 100%. Usually they have synthetic monitoring at most, and synthetic monitors are usually poor substitutes for real traffic – plus, usually people don’t have them alert except on multiple failures, so any “33% of the time” problem will likely slip through.  No, most other customers do what the supplier is doing – relying on their customers in turn to report the problem.
  2. Second, let’s say a customer does think they see a problem.  Well, most of the time they think, “I’m sure that’s just one of those transient issues.” If it’s a SaaS service, it must be our network, or the Internet, or whatever.  If it’s software, it must be our network, or my PC, or cosmic rays. 50% of the time the person who detects the problem wanders off and probably never comes back to look for it again – if the way they detected it isn’t an obvious part of their use of the system, it’ll fly under the radar.Or they assume the supplier knows and is surely working on it.
  3. Third, the customer has to care enough to report it. Here’s a little thing you can try.  Count your users.  You have 100k users on your site? 20k visitors a day?  OK, take your site down. Or break something obvious.  Hell, break your home page lead graphic.  Now wait and see how many problem reports you get. If you’re lucky, a dozen. Out of THOUSANDS of users.  Do the math.  What percentage is that.  Less than 1%.  OK, so even a brutally obvious problem gets less than 1% reports sent in. So now apply that percentage to the already-diminished percentages from the previous two steps. To many users, unless using your product is a huge part of their daily work, they’re going to go use something else, figuring someone else will take care of it eventually.
  4. Now, the user has to report the problem via a channel you are watching.  Many customers, and this happens to us too, have developed through long experience an aversion to your official support channel. Maybe it costs money and they aren’t paying you. Maybe they have gotten the “reboot your computer dance” from someone over VoIP in India too many times to bother with it.  Instead they report it on your forums, which you may have someone monitoring and responding to, but how many of those problems then become tickets or make their way to product development or someone important enough to hear/care?  Or they report it on Server Fault, or any number of places they’ve learned they get better support than your support channel. Or they email their sales rep or a technical guy they know at your company or accidentally go to your customer service email – all channels which may eventually lead to your support channel, but every one of those paths being lossy.
  5. So let’s say they do report it to your support channel.  Look, your front line support sucks. Yes, yours. Pretty much anyone’s.  Those same layers that are designed to appropriately escalate issues each have friction associated with them. As a customer you have to do a lot of work to get through level 1 to level 2 to level 3.  If the problem isn’t critical to you, often you’re just going to give up.  How many of your tickets are just abandoned by a customer without any meaningful explanation or fix provided by you?  Some percentage of those are people who have realized they were doing something wrong, but many just don’t have the time to mess with your support org any more – you’re probably one of 20 products they use/support. They usually have to invest days of dedicated effort to prove to each level of support (starting over each time, as no one seems to read tickets) that they’re not stupid and know what they’re talking about.
  6. Let’s say that the unlikely event has happened – a customer detects the problem, cares about it, comes to your support org about it, and actually makes  it through front line support.  Great.  But have YOU heard about it, Supplier Guy Talking To Me?  Maybe you’re a sales guy or SE, in which case you have a 1% chance of having heard about it.  Maybe you’re an ops lead or escalation guy, in which case there’s a slight chance you may have heard of other tickets on this problem.  Your support system probably sucks, and searching on it for incidents with the same  symptoms is unlikely to work right. It’s happened to me many times that on the sixth conference call with the same vendor about an issue, some new guy is on the phone this time and says “Oh, yes, that’s right.  That’s the way it works, it won’t do that.” All the other “architects” and everyone look at each other in surprise.  But not me, I got over being surprised many iterations of this ago.

So let’s look at the Percentage Chance A Big Problem With Your Product Has Been Brought To your Attention.

%CABPWYPHBBTYA =  % of users that detect the problem * % of users that don’t disregard it * % of users that care enough to report it * % of users that report it in the right channel * % of users that don’t get derailed by your support org * % of legit problems in support tickets about this same problem that you personally have seen or heard about.

This lets you calculate the Number of Customers Reporting A Real Problem With Your Product.



Let’s run those numbers for a sample case with reasonable guesses at real world percentages, starting with a nice big 100,000 customer user base (customers using the exact same product or service on the same platform, that is, not all your customers of everything).

NOCRARPWYP =100,000 * (5% of users that detect it * 20% of users don’t wander off * 20% care enough to report it to you * 20% bring it to your support channel you think is “right” * 20% brought to clear diagnosis by your support * 5% of tickets your support org lodges that you ever hear about) = 0.4 customers.  In other words, just us, and you should consider yourselves lucky to be getting it from us at that.

And frankly these percentage estimates are high.  Fill in percentages from your real operations. Tweak them if you don’t believe me. But the message here is that even an endemic issue with your product, even if it’s only mildly subtle,  is not going to come to your attention the majority of the time, and if it does it’ll only be from a couple people.

So listen to us for God’s sake!

We’ve had this happen to us time and time again – we’re pretty detail oriented here, and believe in instrumentation. Our cloud server misbehaving?  Well, we bring up another in the same data center, then copies in other data centers, then put monitoring against it from various international locations, and pore over the logs, and run Wireshark traces.  We rule out the variables and then spend a week trying to explain it to your support techs. The sales team and previous support guys we’ve worked with know that we know what we’re talking about – give us some cred based on that.

In the end, I’ve stopped being surprised when we’re the ones to detect a pretty fundamental issue in someone’s software or service, even when we’re not a lead user, even when it’s a big supplier. And my response now to a supplier’s incredulous statement that “But none of our other customers [that I have heard of] are having this problem, I can only reply:

But, you see, your other customers are dumber than we are.


Filed under DevOps

Our Cloud Products And How We Did It

Hey, I’m not a sales guy, and none of us spend a lot of time on this blog pimping our company’s products, but we’re pretty proud of our work on them and I figured I’d toss them out there as use cases of what an enterprise can do in terms of cloud products if they get their act together!

Some background.  Currently all the agile admins (myself, Peco, and James) work together in R&D at National Instruments.  It’s funny, we used to work together on the Web Systems team that ran the ni.com Web site, but then people went their own ways to different teams or even different companies. Then we decided to put the dream team back together to run our new SaaS products.

About NI

Some background.  National Instruments (hereafter, NI) is a 5000+ person global company that makes hardware and software for test & measurement, industrial control, and graphical system design. Real Poindextery engineering stuff. Wireless sensors and data acquisition, embedded and real-time, simulation and modeling. Our stuff is used to program the Lego Mindstorms NXT robots as well as control CERN’s Large Hadron Collider. When a crazed highlander whacks a test dummy on Deadliest Warrior and Max the techie looks at readouts of the forces generated, we are there.

About LabVIEW

Our main software product is LabVIEW.  Despite being an electrical engineer by degree, we never used LabVIEW in school (this was a very long time ago, I’ll note, most programs use it nowadays), so it wasn’t till I joined NI I saw it in action. It’s a graphical dataflow programming language. I assumed that was BS when I heard it. I had so many companies try to sell be “graphical” programming over the years, like all those crappy 4GLs back in the ‘9o’s, that I figured that was just an unachieved myth. But no, it’s a real visual programming language that’s worked like a champ for more than 20 years. In certain ways it’s very bad ass, it does parallelism for you and can be compiled and dropped onto a FPGA. It’s remained niche-ey and hasn’t been widely adopted outside the engineering world, however, due to company focus more than anything else.

Anyway, we decided it was high time we started leveraging cloud technologies in our products, so we created a DevOps team here in NI’s LabVIEW R&D department with a bunch of people that know what they’re doing, and started cranking on some SaaS products for our customers! We’ve delivered two and have announced a third that’s in progress.

Cloud Product #1: LabVIEW Web UI Builder

First out of the gate – LabVIEW Web UI Builder. It went 1.0 late last year. Go try it for free! It’s a Silverlight-based RIA “light” version of LabVIEW – you can visually program, interface with hardware and/or Web services. As internal demos we even had people write things like “Duck Hunt” and “Frogger” in it – it’s like Flash programming but way less of a pain in the ass. You can run in browser or out of browser and save your apps to the cloud or to your local box. It’s a “freemium” model – totally free to code and run your apps, but you have to pay for a license to compile your apps for deployment somewhere else – and that somewhere else can be a Web server like Apache or IIS, or it can be an embedded hardware target like a sensor node. The RIA approach means the UI can be placed on a very low footprint target because it runs in the browser, it just has to get data/interface with the control API of whatever it’s on.

It’s pretty snazzy. If you are curious about “graphical programming” and think it is probably BS, give it a spin for a couple minutes and see what you can do without all that “typing.”

A different R&D team wrote the Silverlight code, we wrote the back end Web services, did the cloud infrastructure, ops support structure, authentication, security, etc. It runs on Amazon Web Services.

Cloud Product #2: LabVIEW FPGA Compile Cloud

This one’s still in beta, but it’s basically ready to roll. For non-engineers, a FPGA (field programmable gate array) is essentially a rewritable chip. You get the speed benefits of being on hardware – not as fast as an ASIC but way faster than running code on a general purpose computer – as well as being able to change the software later.

We have a version of LabVIEW, LabVIEW FPGA, used to target LabVIEW programs to an FPGA chip. Compilation of these programs can take a long time, usually a number of hours for complex designs. Furthermore the software required for the compilation is large and getting more diverse as there’s more and more chips out there (each pretty much has its own dedicated compiler).

So, cloud to the rescue. The FPGA Compile Cloud is a simple concept – when you hit ‘compile’ it just outsources the compile to a bunch of servers in the cloud instead of locking up your workstation for hours (assuming you’ve bought a subscription).  FPGA compilations have everything they need with them, there’s not unique compile environments to set up or anything, so it’s very commoditizable.

The back end for this isn’t as simple as the one for UI Builder, which is just cloud storage and load balanced compile servers – we had to implement custom scaling for the large and expensive compile workers, and it required more extensive monitoring, performance, and security work. It’s running on Amazon too. We got to reuse a large amount of the infrastructure we put in place for systems management and authentication for UI Builder.

Cloud Product #3: Technical Data Cloud

It’s still in development, but we’ve announced it so I get to talk about it! The idea behind the Technical Data Cloud is that more and more people need to collect sensor data, but they don’t want to fool with the management of it. They want to plop some sensors down and have the acquired data “go to the cloud!” for storage, visualization, and later analysis. There are other folks doing this already, like the very cool Pachube (pronounced “patch-bay”, there’s a LabVIEW library for talking to it), and it seems everyone wants to take their sensors to the cloud, so we’re looking at making one that’s industrial strength.

For this one we are pulling our our big guns, our data specialist team in Aachen, Germany. We are also being careful to develop it in an open way – the primary interface will be RESTful HTTP Web services, though LabVIEW APIs and hardware links will of course be a priority.

This one had a big technical twist for us – we’re implementing it on Microsoft Windows Azure, the MS guys’ cloud offering. Our org is doing a lot of .NET development and finding a lot of strategic alignment with Microsoft, so we thought we’d kick the tires on their cloud. I’m an old Linux/open source bigot and to be honest I didn’t expect it to make the grade, but once we got up to speed on it I found it was a pretty good bit of implementation. It did mean we had to do significant expansion of our underlying platform we are reusing for all these products – just supporting Linux and Windows instance in Amazon already made us toss a lot of insufficiently open solutions in the garbage bin, and these two cloud worlds are very different as well.

How We Did It

I find nothing more instructive than finding out the details – organizational, technical, etc. – of how people really implement solutions in their own shops.  So in the interests of openness and helping out others, I’m going to do a series on how we did it!  I figure it’ll be in about three parts, most likely:

  • How We Did It: People
  • How We Did It: Process
  • How We Did It: Tools and Technologies

If there’s something you want to hear about when I cover these areas, just ask in the comments!  I can’t share everything, especially for unreleased products, but promise to be as open as I can without someone from Legal coming down here and Tasering me.


Filed under Cloud, DevOps

Austin Cloud User Group Nov 17 Meeting Notes

This month’s ACUG meeting was cooler than usual – instead of having one speaker talk on a cloud-related topic, we had multiple group members do short presentation on what they’re actually doing in the cloud.  I love talks like that, it’s where you get real rubber meets the road takeaways.

I thought I’d share my notes on the presentations.  I’ll write up the one we did separately, but I got a lot out of these:

  1. OData to the Cloud, by Craig Vyal from Pervasive Software
  2. Moving your SaaS from Colo to Cloud, by Josh Arnold from Arnold Marziani (previously of PeopleAdmin)
  3. DevOps and the Cloud, by Chris Hilton from Thoughtworks
  4. Moving Software from On Premise to SaaS, by John Mikula from Pervasive Software
  5. The Programmable Infrastructure Environment, by Peco Karayanev and Ernest Mueller from National Instruments (see next post!)

My editorial comments are in italics.  Slides are linked into the headers where available.

OData to the Cloud

OData was started by Microsoft (“but don’t hold that against it”) under the Open Specification Promise.  Craig did an implementation of it at Pervasive.

It’s a RESTful protocol for CRUDdy GET/POST/DELETE of data.  Uses AtomPub-based feeds and returns XML or JSON.  You get the schema and the data in the result.

You can create an OData producer of a data source, consume OData from places that support it, and view it via stuff like iPhone/Android apps.

Current producers – Sharepoint, SQL Azure, Netflix, eBay, twitpic, Open Gov’t Data Initiative, Stack Overflow

Current consumers – PowerPivot in Excel, Sesame, Tableau.  Libraries for Java (OData4J), .NET 4.0/Silverlight 4, OData SDK for PHP

It is easier for “business user” to consume than SOAP or REST.  Craig used OData4J to create a producer for the Pervasive product.

Questions from the crowd:

Compression/caching?  Nothing built in.  Though normal HTTP level compression would work I’d think. It does “page” long lists of results and can send a section of n results at a time.

Auth? Your problem.  Some people use Oauth.  He wrote a custom glassfish basic HTTP auth portal.

Competition?  Gdata is kinda like this.

Seems to me it’s one part REST, one part “making you have a DTD for your XML”.  Which is good!  We’re very interested in OData for our data centric services coming up.

Moving your SaaS from Colo to Cloud

Josh Arnold was from PeopleAdmin, now he’s a tech recruiter, but can speak to what they did before he left.  PeopleAdmin was a Sungard type colo setup.  Had a “rotting” out of country DR site.

They were rewriting their stack from java/mssql to ruby/linux.

At the time they were spending $15k/mo on the colo (not including the cost of their HW).  Amazon estimated cost was 1/3 that but really they found out after moving it’s 1/2.  What was the surprise cost?  Lower than expected perf (disk io) forced more instances than physical boxes of equivalent “size.”

Flexible provisioning and autoscaling was great, the colo couldn’t scale fast enough.  How do you scale?

The cloud made having an out of country DR site easy, and not have it rot and get old.

Question: What did you lose in the move?  We were prepared for mental “control issues” so didn’t have those.  There’s definitely advanced functionality (e.g. with firewalls) and native hardware performance you lose, but that wasn’t much.

They evalled Rackspace and Amazon (cursory eval).  They had some F5s they wanted to use and the ability to mix in real hardware was tempting but they mainly went straight to Amazon.  Drivers were the community around it and their leadership in the space.

Timeline was 2 years (rewrite app, slowly migrate customers).  It’ll be more like 3-4 before it’s done.  There were issues where they were glad they didn’t mass migrate everyone at once.

Technical challenges:

Performance was a little lax (disk performance, they think) and they ended up needing more servers.  Used tricks like RAIDed EBSes to try to get the most io they could (mainly for the databases).

Every customer had a SSL cert, and they had 600 of them to mess with.  That was a problem because of the 5 Elastic IP limit.  Went to certs that allow subsidiary domains – Digicert allowed 100 per cert (other CAs limit to much less) so they could get 100 per IP.

App servers did outbound LDAP conns to customer premise for auth integration and they usually tried to allow those in via IP rules in their corporate firewalls, but now on Amazon outbound IPs are dynamic.  They set up a proxy with a static (elastic) Ip to route all that through.

Rightscale – they used it.  They like it.

They used nginx for the load balancing, SSL termination.  Was a single point of failure though.

Remember that many of the implementations you are hearing about now were started back before Rackspace had an API, before Amazon had load balancers, etc.

In discussion about hybrid clouds, the point was brought up a lot of providers talk about it – gogrid, opsource, rackspace – but often there are gotchas.

DevOps and the Cloud

Chris Hilton from Thoughtworks is all about the DevOps, and works on stuff like continuous deployment for a living.

DevOps is:

  • collaboration between devs and operations staff
  • agile sysadmin, using agile dev tools
  • dev/ops/qa integration to achieve business goals

Why DevOps?

Silos.  agile dev broke down the wall between dev/qa (and biz).

devs are usually incentivized for change, and ops are incentivized for stability, which creates an innate conflict.

but if both are incentivized to deliver business value instead…

DevOps Practices

  • version control!
  • automated provisioning and deployment (Puppet/chef/rPath)
  • self healing
  • monitoring infra and apps
  • identical environments dev/test/prod
  • automated db mgmt

Why DevOps In The Cloud?

cloud requires automation, devops provides automation


  • “Continuous Delivery” Humble and Farley
  • Rapid and Reliable Releases InfoQ
  • Refactoring Databases by Ambler and Sadalage

Another tidbit: they’re writing puppet lite in powershell to fill the tool gap – some tool suppliers are starting, but the general degree of tool support for people who use both Windows and Linux is shameful.

Moving Software from On Premise to SaaS

John Mikula of Pervasive tells us about the Pervasive Data Cloud.  They wanted to take their on premise “Data Integrator” product, basically a command line tool ($, devs needed to implement), to a wider audience.

Started 4 years ago.  They realized that the data sources they’re connecting to and pumping to, like Quickbooks Online, Salesforce, etc are all SaaS from the get go.   “Well, let’s make our middle part the same!”

They wrote a Java EE wrapper, put it on Rackspace colo initally.

It gets a customer’s metadata, puts it on a queue, another system takes it off and process it.  A very scaling-friendly architecture.  And Rackspace (colo) wasn’t scaling fast enough, so they moved it to Amazon.

Their initial system had 2 glassfish front ends, 25 workers

For queuing, they tried Amazon SQS but it was limited, then went to Apache Zookeeper

First effort was about “deploy a single app” – namely salesforce/quickbooks integration.  Then they made a domain specific model and refactored and made an API to manage the domain specific entities so new apps could be created easily.

Recommended approach – solve easy problems and work from there.  That’s more than enough for people to buy in.

Their core engine’s not designed for multitenancy – have batches of workers for one guy’s code – so their code can be unsafe but it’s in its own bucket and doesn’t mess up anyone else.

Changing internal business processes in a mature company was a challenge – moving from perm license model to per month just with accounting and whatnot was a big long hairy deal.

Making the API was rough.  His estimate of a couple months grew to 6.  Requirements gathering was a problem, very iterative.  They weren’t agile enough – they only had one interim release and it wasn’t really usable; if they did it again they’d do the agile ‘right thing’ of putting out usable milestones more frequently to see what worked and what people really needed.

In Closing

Whew!  I found all the presentations really engaging and thank everyone for sharing the nuts and bolts of how they did it!

Leave a comment

Filed under Cloud

Beware the Deceptive SLA, My Friend

We’re trying to come to an agreement with a SaaS vendor about performance and availability service level agreements (SLAs).  I discussed this topic some in my previous “SaaS Headaches” post.  I thought it would be instructive to show people the standard kind of “defense in depth” that suppliers can have to protect against being held responsible for what they host for you.

We’ve been working on a deal with one specific supplier.  As part of it, they’ll be hosting some images for our site.  There’s a business team primarily responsible for evaluating their functionality etc., we’re just in the mix as the faithful watchdogs of performance and availability for our site.

Round 1 – “What are these SLAs you speak of?”  The vendor offers no SLA.  “Unacceptable,” we tell the project team.  They fret about having to worry about that along with the 100 other details of coming to an agreement with the supplier, but duly go back and squeeze them.  It takes a couple squeezes because the supplier likes to forget about this topic – send a list of five questions with one of them being “SLA,” you get four answers back, ignoring the SLA question.

Round 2 – “Oh, you said ‘SLA’!  Oh, sure, we have one of those.”  We read the SLA and it only commits to their main host being pingable.  Our service could be completely down, and it doesn’t speak to that.  Back to our project team, who now between the business users, procurement agent, and legal guy need more urging to lean on the supplier.  The supplier plays dumb for a while, and then…

Continue reading

1 Comment

Filed under Cloud, General

Cloud Headaches?

The industry is abuzz with people who are freaked out about the outages that Amazon and other cloud vendors have had.  “Amazon S3 Crash Raises Doubts Among Cloud Customers,” says InformationWeek!

This is because people are going into cloud computing with retardedly high expectations.  This year at Velocity, Interop, etc. I’ve seen people just totally in love with cloud computing – Amazon’s specifically but in general as well.  And it’s a good concept for certain applications.  However, it is a computing system just like every other computing system devised previously by man.  And it has, and will have, problems.

Whether you are using in house systems, or a SaaS vendor, or building “in the cloud,” you have the same general concerns.  Am I monitoring my systems?  What is my SLA?  What is my recourse if my system is not hitting it?  What’s my DR plan?

SaaS is a special case of cloud computing in general.  And if you’re a company relying on it, when you contract with a SaaS vendor you get SLAs established and figure out what the remedy is if they breach it.  If you are going into a relationship where you are just paying money for a cloud VM, storage, etc. and there is no enforceable SLA in the relationship, then you need to build the risk of likely and unremediable outages into your business plan.

I hate to break it to you, but the IT people working at Amazon, Google, etc. are not all that smarter than the IT people working with you.  So an unjustified faith in a SaaS or cloud vendor – “Oh, it’s Amazon, I’m sure they’ll never have an outage of any sort – either across their entire system or localized to my part of it – and if they do I’m sure the $100/month I’m paying them will cause them to give a damn about me” – is an unreasonable expectation on its face.

Clouds and cloud vendors are a good innovation.  But they’re like every other computing innovation and vendor selling it to you.  They’ll have bugs and failures.  But treating them as if they won’t is a failure on your part, not theirs.


Filed under Cloud, Uncategorized