Category Archives: DevOps

Pertaining to agile system administration concepts and techniques.

by Ernest Mueller | September 15, 2011 · 7:00 am

How We Do Cloud and DevOps: The Motion Picture

Our good friend Damon Edwards from dev2ops came by our Austin office and recorded a video of Peco and I explaining how we do what we do! Peco never blogs, so this is a rare opportunity to hear him talk about these topics, and he’s full of great sound bytes. 🙂

I apologize in advance for how much I say “right.”

Leave a comment

Filed under Cloud, DevOps

Tagged as Cloud, DevOps, dto, labview, national instruments, ni, podcast, video

by Ernest Mueller | September 12, 2011 · 4:07 pm

Won’t somebody please think of the systems?

Won't somebody please think of the systems?

What is the goal of DevOps? If you ask a lot of people, they say “Continuous integration! Pushing functionality out faster!” The first cut at a DevOps Wikipedia article pretty much only considered this goal. Unfortunately, this is a naive view largely popular among developers who don’t fully understand the problems of service management. More rapid delivery and the practice of continuous integration/deployment is cool and it’s part of the DevOps umbrella of concerns, but it is not the largest part.

Let us review the concepts behind IT Service Management. I don’t like ITIL in terms of a prescriptive thing to implement, but as a cognitive framework to understand IT work, it’s great. Anyway, depending on what version you are looking at, there are a lot of parts of delivering a service to end users/customers.

1. Service Strategy (tie-guy stuff)

2. Service Design (including capacity, availability, risk management)

3. Service Transition (release and change management)

4. Service Operation (operations)

5. Continual Service Improvement (metrics)

Let’s concentrate on the middle three. Service transition (release) is where CI fits in. And that’s great. But most of the point of DevOps is the need for ops to be involved in Service Design and for the developers to be involved in Service Operation!

Service Transition

Sure, in the old waterfall mindset, service transition is where “the work moves from dev to ops.” Dev guys do everything before that, ops guys do everything after that, we just need a more graceful handoff. DevOps is not about trying to file some of the rough bits off the old way of doing things. It’s about improving service quality by more profoundly integrating the teams through the whole pipeline.

Here at NI, continuous integration was our lowest ranked DevOps priority. It’s a nice to have, while improving service design and operation was way, way more important to us. We’re starting work on it now, but consider our DevOps implementation to be successful without it. If you don’t have service design and operation nailed, then focusing on service transition risks “delivering garbage more quickly to users, and having it be unsupportable.”

Service Design

Services will not be designed correctly without embedded operational experience in their design and creation. You can call this “systems engineering” and say it’s not ops… But it’s ops. Look at the career paths if that confuses you. Our #1 priority in our DevOps implementation was to avoid the pain and problems of developing services with only input from functional developers. A working service is, more than ever, a synthesis of systems and applications and need to be designed as such. We required our systems architect and applications architect to work hand in hand, mutually design the team and tools and products, review changes from both devs and ops…

Service Operation

Services can not be operated correctly if thrown over the wall to an ops team, and they will not be improved at the appropriate rate. Developers need to be on hand to help handle problems and need to be following extremely closely what their application does in the users’ hands to make a better product. This was our #2 priority when implementing DevOps, and self service is a high implementation priority. Developers should be able to see their logs and production metrics in realtime so we put things like Splunk and Cloudkick in place and made their goal to not be operations facing, but developer facing, tools.

The Bottom Line

DevOps is not about just making the wall you throw stuff over shorter. With apologies to Dev2Ops,

Improvement? I think not!

To me the point of DevOps is to not have a wall – to have both development and operations involved in the service design, the transition, and the operation. Just implementing CI without doing that isn’t DevOps – it’s automating waterfall. Which is fine and all, but you’re missing a lot of the point and are not going to get all the benefits you could.

3 Comments

Filed under DevOps

Tagged as CI, DevOps, integration, release

by Ernest Mueller | September 8, 2011 · 11:29 am

Analysts on DevOps

DevOps is getting enough traction that there are papers coming out on if from the various analyst groups. I thought I’d spur a roundup of this research – they can be very valuable in converting your upper management types into understanding and seeing the value of DevOps.

451 Group: The Rise of DevOps (September 2010, $3750)
Gartner Research: DevOps: The Changing Nature of System Administration (April 2011, $195)
Gartner Research: Deconstructing DevOps (May 2011, $195)
Forrester Research: Improving the Ops in DevOps (July 2011, $499)
Forrester Research: I’m deliberately not listing Gualtieri’s confused NoOps piece
Cutter Consortium’s Cutter IT Journal: Devops: A Software Revolution in the Making? (Sept 2011, free)
Not full reports, but Redmonk did a lot of good and free research on DevOps before Cote went into the belly of the beast

Chime in below with more to add to the list! Analyst stuff, not random blogs, please – something that you would put in front of upper management.

Leave a comment

Filed under DevOps

Tagged as DevOps, research

by Ernest Mueller | September 2, 2011 · 12:12 pm

Addressing the IT Skeptic’s View on DevOps

A recent blog post on DevOps by the IT Skeptic entitled DevOps and traditional ITSM – why DevOps won’t change the world anytime soon got the community a’frothing. And sure, the article is a little simmered in anti-agile hate speech (apparently the Agilistias and cloud hypesters and cowboys are behind the whole DevOps thing and are leering at his wife and daughter and dropping his property values to boot) but I believe his critiques are in general very perceptive and that they are areas we, the DevOps movement, should work on.

Go read the article – it’s really long so I won’t sum the whole thing up here.

Here’s the most germane critiques and what we need to do about them. He also has some poor and irrelevant or misguided critiques, but why would I waste time on those? Let’s take and action on the good stuff that can make DevOps better!

Lack of a coherent definition

This is a very good point. I went to the first meeting of an Austin DevOps SIG lately and was treated to the usual debate about “the definition of DevOps” and all the varied viewpoints going into that. We need to emerge more of a structured definition that either includes and organizes or excludes the various memetic threads. It’s been done with Agile, and we can do it too. My imperfect definition of DevOps on this site tries to clarify this by showing there are different levels (principles, methods, and practices) that different thoughts about DevOps slot into.

Worry about cowboys

This is a valid concern, and one I share. Here at NI, back in the day programmers had production passwords, and they got taken away for real good reasons. “Oh, let’s just give the programmers pagers and the root password” is not a responsible interpretation of DevOps but it’s one I’ve heard bandied about; it’s based on a false belief that as long as you have “really smart” developers they’ll never jack everything up.

Real DevOps shops that are uptaking practices that could be risky, like continuous deployment, are doing it with extreme levels of safeguard put into place (automated testing, etc.). This is similar to the overall problem in agile – some people say “agile? Great! I’ll code at random,” whereas really you need to have a very high percentage of unit test coverage. And sure, when you confront people with this they say “Oh, sure, you need that” but there is very little constructive discussion or tooling around it. How exactly do I build a good systems + app code integration/smoke test rig? “Uh you could write a bunch of code hooked to Hudson…” This should be one of the most discussed and best understood parts of the chain, not one of the least, to do DevOps responsibly.

We’re writing our own framework for this right now – James is doing it in Ruby, it’s called Sparta, and devs (and system folks) provide test chunks that the framework runs and times in an automated fashion. It’s not a well solved problem (and the big-dollar products that claim to do test automation are nightmares and not really automated in the “devs easily contribute tests to integrate into a continuous deploy” sense.

Team size

Working at a large corporation, I also share his concern about people’s cunning DevOps schemes that don’t scale past a 12 person company. “We’ll just hire 7 of the best and brightest and they’ll do everything, and be all crossfunctional, and write code and test and do systems and ops and write UIs and everything!” is only a legit plan for about 10 little hot VC funded Web 2.0 companies out there. The rest of us have to scale, and doing things right means some specialization and risks siloization.

For example, performance testing. When we had all our developers do their own performance testing, the limit of the sophistication of those tests was “I’ll run 1000 hits against it and time how long it takes to finish. There, 4 seconds. Done, that’s my performance testing!” The only people who think Ops, QA, etc. are such minor skill sets that someone can just do them all is someone who is frankly ignorant of those fields. Oh, P.S. The NoOps guys fall into this category, please don’t link them to DevOps.

We have struggled with this. We’ve had to work out what testing our devs do versus how we closely align with external test teams. Same with security, performace, etc. The answer is not to completely generalize or completely silo – Yahoo! had a great model with their performance team, where their is a central team of super-experts but there are also embedded folks on each product team.

Hiring people

Very related to the previous point – again unless you’re one of the 10 hottest Web 2.0 plays and you can really get the best of the best, you are needing to staff your organization with random folks who graduated from UT with a B average. You have to have and manage tiers as well as silos – some folks are only ready to be “level 1 support” and aren’t going to be reading some dev’s Java code.

Traditional organizations and those following ITIL very closely can definitely create structures that promote bad silos and bad tiering. But just assuming everyone will be of the same (high) skill level and be able to know everything is a fallacy that is easy to fall into, since it’s those sort of elite individuals who are the leading uptakers of DevOps. Maybe Gene Kim’s book he’s working on (“Visible DevOps” or similar) will help with that.

Tools fixation

Definitely an issue. An enhanced focus on automation is valuable. Too many ops shops still just do the same crap by hand day after day, and should be challenged to automate and use tools. But a lot of the DevOps discussions do become “cool tool litanies” and that’s cart before the horse. In my terminology, you don’t want to drive the principles out of the practices and methods – tooling is great but it should serve the goals.

We had that problem on our team. I had to talk to our Ops team and say “Hey, why are we doing all these tool implementations? What overall goal are they serving? ” Tools for the sake of tools are worse than pointless.

Process

It is true that with agile and with DevOps that some folks are using it as an excuse to toss out process. It should simply be a different kind of process! And you need to take into account all the stuff that should be in there.

A great example is Michael Howard et al. at Microsoft with their Security Development Lifecycle. The first version of it was waterfall. But now they’ve revamped it to have an agile security development lifecycle, so you know when to do your threat modeling etc.

Build instead of buy

Well, there are definitely some open source zealots involved with most movements that have any sysadmins involved. We would like to buy instead of build, but the existing tools tend to either not solve today’s problems or have poor ROI.

In IT, we implemented some “ITIL compliant” HP tools for problem tracking, service desk, and software deployment. They suck, and are very rigid, and cost a lot of money, and required as much if not more implementation time than writing something from scratch that actually addressed our specific requirements. And in general that’s been everyone’s experience. The Ops world has learned to fear the HP/IBM/CA/etc systems management suites because it’s just one of those niches that is expensive and bad (like medical or legal software).

But having said that, we buy when we can! Splunk gave us a lot more than cobbling together our own open source thing. Cloudkick did too. Sure, we tend to buy SaaS a lot more than on prem software now because of the velocity that gives us, but I agree that you need to analyze the hidden costs of building as part of a build/buy – you just need to also see the hidden costs and compromised benefits of a buy.

Risk Control

This simply goes back to the cowboy concern. It’s clearly shown that if you structure your process correctly, with the right testing and signoff gates, then agile/devops/rapid deploys are less risky.

We came to this conclusion independently as well. In IT, we ran (still do) these Web go lives once a month. Our Web site consists of 200+ applications and we have 70 or so programmers, 7 Web ops, a whole Infrastructure department, a host of third party stuff (Oracle and many more)… Every release plan was 100 lines long and the process of planning them and executing on them was horrific. The system gets complex enough, both technically and organizationally, that rollbacks + dependencies + whatnot simply turn into paralysis, and you have to roll stuff out to make money. When the IT apps director suggested “This is too painful – we should just do these quarterly instead, and tell the business they get to wait 2 more months to make their money,” the light went on in my mind. Slower and more rigorous is actually worse. It’s not more efficient to put all the product you’re shipping for the month onto a huge ass warehouse on the back of a giant truck and drive it around doing deliveries, either; this should be obvious in retrospect. Distribution is a form of risk management. “All the eggs in one big basket that we’ll do all at one time” is the antithesis of that.

The Future

We started DevOps here at NI from the operations guys. We’d been struggling for years to get the programmers to take production responsibility for their apps. We had struggled to get them access to their own logs, do their own deploys (to dev and test), let business users input Apache redirects into a Web UI rather than have us do it… We developed a whole process, the Systems Development Framework, that we used to engage with dev teams and make sure all the performance, reliability, security, manageability, provisioning, etc. stuff was getting taken care of… But it just wasn’t as successful as we felt like it could be. Realizing that a more integrated model was possible, we realized success was actually an option. Ask most sysadmin shows if they think success is actually a possible outcome of their work, and you’ll get a lot of hedging kinds of “well success is not getting ruined today” kinds of responses.

By combining ops and devs onto one team, by embedding ops expertise onto other dev teams, by moving to using the same tools and tracking systems between devs and ops, and striving for profound automation and self service, we’ve achieved a super high level of throughput within a large organization. We have challenges (mostly when management decides to totally change track on a product, sigh) but from having done it both ways – OMG it’s a lot better. Everything has challenges and risks and there definitely needs to to be some “big boy” compatible thinking on DevOps – but it’s like anything else, those who adopt early will reap the rewards and get competitive advantage on the others. And that’s why we’re all in. We can wait till it’s all worked out and drool-proof, but that’s a better fit for companies that don’t actually have to produce/achieve any more (government orgs, people with more money than God like oil and insurance…).

1 Comment

Filed under DevOps

Tagged as agile, DevOps, skeptics

by Ernest Mueller | July 27, 2011 · 2:25 pm

DevOps at OSCON

I was watching the live stream from OSCON Ignite and saw a good presentation on DevOps Antipatterns by Aaron Blew. Here’s the video link, it’s 29:15 in, go check it out. And I was pleasantly surprised to see that he mentions us here at the agile admin in his links! Hi Aaron and OSCONners!

1 Comment

Filed under DevOps

Tagged as DevOps, ignite, oscon

by Ernest Mueller | June 14, 2011 · 7:05 pm

Velocity 2011: The Workshops

Peco and I split up to cover more ground. I went to four workshops and here’s the details… Peco will have to chime in on his.

First, Adrian Cockroft, Director of Cloud Architecture for Netflix, spoke on Netflix in the Cloud. This session was excellent. He talked about the importance of model driven architecture, a runtime registry, how too many of the monitoring etc. tools don’t do cloud worth a damn… All great stuff. Included a love letter to AppDynamics, a cool cloud-friendly app instrumentation tool similar to our beloved Opnet Panorama.

Next, I saw John Rauser of Amazon talk about Just Enough Statistics To Be Dangerous. He talked about basic probability stats and how to use them. Pretty good, though could have used more “and here’s how this applies to WebOps” examples instead of “how many quarters are in this jar” examples. I missed a bit of this because I ran out to go to the head and Patrick Debois grabbed me to talk to a guy from Dell about DevOps, which was loads of fun! I missed the part on Bayesian stats though, I’ll have to watch the session video once it’s available.

Over lunch we met up with all the other guys here from NI, and my college friend Jon Whitney! Woot! Rice University in the house!

After lunch, it was John Allspaw talking about reliability engineering and Postmortems and Human Error. Root cause is a myth! So is human error! Mindbending stuff. You should read the “How Complex Systems Fail” chapter in the Web Ops book to lube you up first, then watch the video for this session. Very relevant to all ops folks. We were a little split, though, on how a militant no-blame philosophy jives with places that aren’t hiring the absolute cream of the crop – if you don’t work at Etsy or similar 3l33t place, you do have some folks that are… a disproportionate source of errors.

My last workshop was a little disappointing – Automating Web Performance Testing by 5 PM, by the Neustar crew. There was some good info in there – Selenium, proxies, HAR format – but delivery was weak. Sample code though, you can download some Python and Java automation examples. But “I can’t read text that small” combined with bad presentation technique (asking 5 times for “raise your hand if you don’t know X,” for example) made it a bit of a chore. Ah well.

Now it’s time for dinner and then the evening Ignite! sessions!

1 Comment

Filed under Conferences, DevOps

Tagged as conference, velocity, velocityconf, velocityconf11

by Ernest Mueller | June 14, 2011 · 11:35 am

Velocity 2011 Kickoff!

Two of the agile admins, Ernest and Peco, are in Santa Clara this week for our fourth Velocity conference! We’ve been to all of them and always get a lot out of them. It’s the first conference focused on Web performance and operations. Today is the day of workshops, then Wed-Thurs is normal sessions. On Fri-Sat we’re going to DevOpsDays 2011 Mountain View. The third agile admin, James, is in Penang hanging out with our follow-the-sun WebOps staff!

If any of you are out in sunny CA for these events (or heck, if you’re in Penang and bored), ping us, we’d love to meet you! Tweet me at @ernestmueller for the hookup.

Now to our first workshops – I’m watching Adrian Cockroft talk about Netflix’s use of the Amazon cloud and Peco is going to see the Openstack workshop!

Leave a comment

Filed under Conferences, DevOps

Tagged as conference, velocity, velocityconf, velocityconf11

by Ernest Mueller | May 26, 2011 · 11:14 am

But, You See, Your Other Customers Are Dumber Than We Are

Sorry it’s been so long between blog posts. We’ve been caught in a month of Release Hell, where we haven’t been able to release because of supplier performance/availability problems. And I wanted to take a minute to vent about something I have heard way too much from suppliers over the last nine years working at NI, which is that “none of our other customers are having that problem.”

Because, you see, the vast majority of the time that ends up not being the case. And that’s a generous estimate. It’s an understandable mistake on the part of the person we are talking to – we say “Hey, there’s a big underlying issue with your service or software.” The person we’re talking to hasn’t heard that before, so naturally assumes it’s “just us,” something specific to our implementation. because they’re big, right, and have lots of customers, and surely if there was an underlying problem they’d already know from other people, right? And we move on, wasting lots of time digging into our use case rather than the supplier fixing their stuff.

But this requires a fundamental misunderstanding of how problems get identified. What has to happen to get an issue report to a state where the guy we’re talking to would have heard about it? If it’s a blatant problem – the service is down – then other people report it. But what if it’s subtle? What if it’s that 10% of requests fail? (That seems to happen a lot.)

Steps Required For You To Know About This Problem

First, a customer has to detect the problem in the first place. And most don’t have sufficient instrumentation to detect a sudden performance degradation, or an availability hit short of 100%. Usually they have synthetic monitoring at most, and synthetic monitors are usually poor substitutes for real traffic – plus, usually people don’t have them alert except on multiple failures, so any “33% of the time” problem will likely slip through. No, most other customers do what the supplier is doing – relying on their customers in turn to report the problem.
Second, let’s say a customer does think they see a problem. Well, most of the time they think, “I’m sure that’s just one of those transient issues.” If it’s a SaaS service, it must be our network, or the Internet, or whatever. If it’s software, it must be our network, or my PC, or cosmic rays. 50% of the time the person who detects the problem wanders off and probably never comes back to look for it again – if the way they detected it isn’t an obvious part of their use of the system, it’ll fly under the radar.Or they assume the supplier knows and is surely working on it.
Third, the customer has to care enough to report it. Here’s a little thing you can try. Count your users. You have 100k users on your site? 20k visitors a day? OK, take your site down. Or break something obvious. Hell, break your home page lead graphic. Now wait and see how many problem reports you get. If you’re lucky, a dozen. Out of THOUSANDS of users. Do the math. What percentage is that. Less than 1%. OK, so even a brutally obvious problem gets less than 1% reports sent in. So now apply that percentage to the already-diminished percentages from the previous two steps. To many users, unless using your product is a huge part of their daily work, they’re going to go use something else, figuring someone else will take care of it eventually.
Now, the user has to report the problem via a channel you are watching. Many customers, and this happens to us too, have developed through long experience an aversion to your official support channel. Maybe it costs money and they aren’t paying you. Maybe they have gotten the “reboot your computer dance” from someone over VoIP in India too many times to bother with it. Instead they report it on your forums, which you may have someone monitoring and responding to, but how many of those problems then become tickets or make their way to product development or someone important enough to hear/care? Or they report it on Server Fault, or any number of places they’ve learned they get better support than your support channel. Or they email their sales rep or a technical guy they know at your company or accidentally go to your customer service email – all channels which may eventually lead to your support channel, but every one of those paths being lossy.
So let’s say they do report it to your support channel. Look, your front line support sucks. Yes, yours. Pretty much anyone’s. Those same layers that are designed to appropriately escalate issues each have friction associated with them. As a customer you have to do a lot of work to get through level 1 to level 2 to level 3. If the problem isn’t critical to you, often you’re just going to give up. How many of your tickets are just abandoned by a customer without any meaningful explanation or fix provided by you? Some percentage of those are people who have realized they were doing something wrong, but many just don’t have the time to mess with your support org any more – you’re probably one of 20 products they use/support. They usually have to invest days of dedicated effort to prove to each level of support (starting over each time, as no one seems to read tickets) that they’re not stupid and know what they’re talking about.
Let’s say that the unlikely event has happened – a customer detects the problem, cares about it, comes to your support org about it, and actually makes it through front line support. Great. But have YOU heard about it, Supplier Guy Talking To Me? Maybe you’re a sales guy or SE, in which case you have a 1% chance of having heard about it. Maybe you’re an ops lead or escalation guy, in which case there’s a slight chance you may have heard of other tickets on this problem. Your support system probably sucks, and searching on it for incidents with the same symptoms is unlikely to work right. It’s happened to me many times that on the sixth conference call with the same vendor about an issue, some new guy is on the phone this time and says “Oh, yes, that’s right. That’s the way it works, it won’t do that.” All the other “architects” and everyone look at each other in surprise. But not me, I got over being surprised many iterations of this ago.

So let’s look at the Percentage Chance A Big Problem With Your Product Has Been Brought To your Attention.

%CABPWYPHBBTYA = % of users that detect the problem * % of users that don’t disregard it * % of users that care enough to report it * % of users that report it in the right channel * % of users that don’t get derailed by your support org * % of legit problems in support tickets about this same problem that you personally have seen or heard about.

This lets you calculate the Number of Customers Reporting A Real Problem With Your Product.

NOCRARPWYP= Your User Base * %CABPWYPHBBTYA

Example

Let’s run those numbers for a sample case with reasonable guesses at real world percentages, starting with a nice big 100,000 customer user base (customers using the exact same product or service on the same platform, that is, not all your customers of everything).

NOCRARPWYP =100,000 * (5% of users that detect it * 20% of users don’t wander off * 20% care enough to report it to you * 20% bring it to your support channel you think is “right” * 20% brought to clear diagnosis by your support * 5% of tickets your support org lodges that you ever hear about) = 0.4 customers. In other words, just us, and you should consider yourselves lucky to be getting it from us at that.

And frankly these percentage estimates are high. Fill in percentages from your real operations. Tweak them if you don’t believe me. But the message here is that even an endemic issue with your product, even if it’s only mildly subtle, is not going to come to your attention the majority of the time, and if it does it’ll only be from a couple people.

So listen to us for God’s sake!

We’ve had this happen to us time and time again – we’re pretty detail oriented here, and believe in instrumentation. Our cloud server misbehaving? Well, we bring up another in the same data center, then copies in other data centers, then put monitoring against it from various international locations, and pore over the logs, and run Wireshark traces. We rule out the variables and then spend a week trying to explain it to your support techs. The sales team and previous support guys we’ve worked with know that we know what we’re talking about – give us some cred based on that.

In the end, I’ve stopped being surprised when we’re the ones to detect a pretty fundamental issue in someone’s software or service, even when we’re not a lead user, even when it’s a big supplier. And my response now to a supplier’s incredulous statement that “But none of our other customers [that I have heard of] are having this problem, I can only reply:

But, you see, your other customers are dumber than we are.

4 Comments

Filed under DevOps

Tagged as bug, case, debug, product, SaaS, supplier, support, ticket, troubleshoot, vendor

by Ernest Mueller | April 21, 2011 · 2:40 pm

The Real Lessons To Learn From The Amazon EC2 Outage

As I write this, our ops team is working furiously to bring up systems outside Amazon’s US East region to recover from the widespread outage they are having this morning. Naturally the Twitterverse, and in a day the blogosphere, and in a week the trade rag…axy? will be talking about how the cloud is unreliable and you can’t count on it for high availability.

And of course this is nonsense. These outages make the news because they hit everyone at once, but all the outages people are continually having at their own data centers are just as impactful, if less hit-generating in terms of news stories.

Sure, we’re not happy that some of our systems are down right now. But…

The outage is only affecting one of our beta products, our production SaaS service is still running like a champ, it just can’t scale up right now.
Our cloud product uptime is still way higher than our self hosted uptime. We have network outages, Web system outages, etc. all the time even though we have three data centers and millions of dollars in network gear and redundant Internet links and redundant servers.

People always assume they’d have zero downtime if they were hosting it. Or maybe that they could “make it get fixed” when it does go down. But that’s a false sense of security based on an archaic misconception that having things on premise gives you any more control over them.We run loads of on premise Web applications and a batch of cloud ones, and once we put some Keynote/Gomez/Alertsite against them we determined our cloud systems have much higher uptime.

Now, there are things that Amazon could do to make all this better on customers. In the Amazon SLAs, they say of course you can have super high uptime – if you are running redundantly across AZs and, in this case, regions. But Amazon makes it really unattractive and difficult to do this.

What Amazon Can Do Better

We can work around this issue by bringing up instances in other regions. Sadly, we didn’t already have our AMIs transferred into those regions, and you can only bring up instances off AMIs that are already in those regions. And transferring regions is a pain in the ass. There is absolutely zero reason Amazon doesn’t provide an API call to copy an AMI from region 1 to region 2. Bad on them. I emailed my Amazon account rep and just got back the top Google hits for “Amazon AMI region migrate”. Thanks, I did that already.
We weren’t already running across multiple regions and AZs because of cost. Some of that is the cost of redundancy in and of itself, but more importantly is the hateful way Amazon does reserve pricing, which very much pushes you towards putting everything in one AZ.
Also, redundancy only really works if you have everything, including data, in that AZ. If you are running redundant app servers across 4 AZs, but have your database in one of them – 0r have a database master in one and slaves in the others – you still get hosed by a particular region downtime.

Amazon needs to have tools that inherently let you distribute your stuff across their systems and needs to make their pricing/reserve strategy friendlier to doing things in what they say is “the right way.”

What We Could Do Better

We weren’t completely prepared for this. Once that region was already borked, it was impossible to migrate AMIs out of it, and there are so many ticky little region specific things all through the Amazon config – security groups, ELBs, etc – doing that on the fly is not possible unless you have specifically done it before, and we hadn’t.

We have an automation solution (PIE) that will regen our entire cloud for us in a short amount of time, but it doesn’t handle the base images, some of which we modify and re-burn from the Amazon ones. We don’t have that process automated and the documentation was out of date since Fedora likes to move their crap around all the time.

In the end, Amazon started coming back just as we got new images done in us-west-1. We’ll certainly work on automating that process, and hope that Amazon will also step up to making it easier for their customers to do so.

Leave a comment

Filed under Cloud, DevOps

Tagged as amazon, aws, ec2, outage

by Ernest Mueller | April 13, 2011 · 8:34 am

Our Cloud Products And How We Did It

Hey, I’m not a sales guy, and none of us spend a lot of time on this blog pimping our company’s products, but we’re pretty proud of our work on them and I figured I’d toss them out there as use cases of what an enterprise can do in terms of cloud products if they get their act together!

Some background. Currently all the agile admins (myself, Peco, and James) work together in R&D at National Instruments. It’s funny, we used to work together on the Web Systems team that ran the ni.com Web site, but then people went their own ways to different teams or even different companies. Then we decided to put the dream team back together to run our new SaaS products.

About NI

Some background. National Instruments (hereafter, NI) is a 5000+ person global company that makes hardware and software for test & measurement, industrial control, and graphical system design. Real Poindextery engineering stuff. Wireless sensors and data acquisition, embedded and real-time, simulation and modeling. Our stuff is used to program the Lego Mindstorms NXT robots as well as control CERN’s Large Hadron Collider. When a crazed highlander whacks a test dummy on Deadliest Warrior and Max the techie looks at readouts of the forces generated, we are there.

About LabVIEW

Our main software product is LabVIEW. Despite being an electrical engineer by degree, we never used LabVIEW in school (this was a very long time ago, I’ll note, most programs use it nowadays), so it wasn’t till I joined NI I saw it in action. It’s a graphical dataflow programming language. I assumed that was BS when I heard it. I had so many companies try to sell be “graphical” programming over the years, like all those crappy 4GLs back in the ‘9o’s, that I figured that was just an unachieved myth. But no, it’s a real visual programming language that’s worked like a champ for more than 20 years. In certain ways it’s very bad ass, it does parallelism for you and can be compiled and dropped onto a FPGA. It’s remained niche-ey and hasn’t been widely adopted outside the engineering world, however, due to company focus more than anything else.

Anyway, we decided it was high time we started leveraging cloud technologies in our products, so we created a DevOps team here in NI’s LabVIEW R&D department with a bunch of people that know what they’re doing, and started cranking on some SaaS products for our customers! We’ve delivered two and have announced a third that’s in progress.

Cloud Product #1: LabVIEW Web UI Builder

First out of the gate – LabVIEW Web UI Builder. It went 1.0 late last year. Go try it for free! It’s a Silverlight-based RIA “light” version of LabVIEW – you can visually program, interface with hardware and/or Web services. As internal demos we even had people write things like “Duck Hunt” and “Frogger” in it – it’s like Flash programming but way less of a pain in the ass. You can run in browser or out of browser and save your apps to the cloud or to your local box. It’s a “freemium” model – totally free to code and run your apps, but you have to pay for a license to compile your apps for deployment somewhere else – and that somewhere else can be a Web server like Apache or IIS, or it can be an embedded hardware target like a sensor node. The RIA approach means the UI can be placed on a very low footprint target because it runs in the browser, it just has to get data/interface with the control API of whatever it’s on.

It’s pretty snazzy. If you are curious about “graphical programming” and think it is probably BS, give it a spin for a couple minutes and see what you can do without all that “typing.”

A different R&D team wrote the Silverlight code, we wrote the back end Web services, did the cloud infrastructure, ops support structure, authentication, security, etc. It runs on Amazon Web Services.

Cloud Product #2: LabVIEW FPGA Compile Cloud

This one’s still in beta, but it’s basically ready to roll. For non-engineers, a FPGA (field programmable gate array) is essentially a rewritable chip. You get the speed benefits of being on hardware – not as fast as an ASIC but way faster than running code on a general purpose computer – as well as being able to change the software later.

We have a version of LabVIEW, LabVIEW FPGA, used to target LabVIEW programs to an FPGA chip. Compilation of these programs can take a long time, usually a number of hours for complex designs. Furthermore the software required for the compilation is large and getting more diverse as there’s more and more chips out there (each pretty much has its own dedicated compiler).

So, cloud to the rescue. The FPGA Compile Cloud is a simple concept – when you hit ‘compile’ it just outsources the compile to a bunch of servers in the cloud instead of locking up your workstation for hours (assuming you’ve bought a subscription). FPGA compilations have everything they need with them, there’s not unique compile environments to set up or anything, so it’s very commoditizable.

The back end for this isn’t as simple as the one for UI Builder, which is just cloud storage and load balanced compile servers – we had to implement custom scaling for the large and expensive compile workers, and it required more extensive monitoring, performance, and security work. It’s running on Amazon too. We got to reuse a large amount of the infrastructure we put in place for systems management and authentication for UI Builder.

Cloud Product #3: Technical Data Cloud

It’s still in development, but we’ve announced it so I get to talk about it! The idea behind the Technical Data Cloud is that more and more people need to collect sensor data, but they don’t want to fool with the management of it. They want to plop some sensors down and have the acquired data “go to the cloud!” for storage, visualization, and later analysis. There are other folks doing this already, like the very cool Pachube (pronounced “patch-bay”, there’s a LabVIEW library for talking to it), and it seems everyone wants to take their sensors to the cloud, so we’re looking at making one that’s industrial strength.

For this one we are pulling our our big guns, our data specialist team in Aachen, Germany. We are also being careful to develop it in an open way – the primary interface will be RESTful HTTP Web services, though LabVIEW APIs and hardware links will of course be a priority.

This one had a big technical twist for us – we’re implementing it on Microsoft Windows Azure, the MS guys’ cloud offering. Our org is doing a lot of .NET development and finding a lot of strategic alignment with Microsoft, so we thought we’d kick the tires on their cloud. I’m an old Linux/open source bigot and to be honest I didn’t expect it to make the grade, but once we got up to speed on it I found it was a pretty good bit of implementation. It did mean we had to do significant expansion of our underlying platform we are reusing for all these products – just supporting Linux and Windows instance in Amazon already made us toss a lot of insufficiently open solutions in the garbage bin, and these two cloud worlds are very different as well.

How We Did It

I find nothing more instructive than finding out the details – organizational, technical, etc. – of how people really implement solutions in their own shops. So in the interests of openness and helping out others, I’m going to do a series on how we did it! I figure it’ll be in about three parts, most likely:

How We Did It: People
How We Did It: Process
How We Did It: Tools and Technologies

If there’s something you want to hear about when I cover these areas, just ask in the comments! I can’t share everything, especially for unreleased products, but promise to be as open as I can without someone from Legal coming down here and Tasering me.

5 Comments

Filed under Cloud, DevOps

Tagged as agile, amazon, azure, Cloud, DevOps, labview, ni, SaaS

Category Archives: DevOps

Won’t somebody please think of the systems?

Service Transition

Service Design

Service Operation

The Bottom Line

Analysts on DevOps

Addressing the IT Skeptic’s View on DevOps

Lack of a coherent definition

Worry about cowboys

Team size

Hiring people

Tools fixation

Process

Build instead of buy

Risk Control

The Future

DevOps at OSCON

Velocity 2011: The Workshops

Velocity 2011 Kickoff!

But, You See, Your Other Customers Are Dumber Than We Are

Steps Required For You To Know About This Problem

Example

So listen to us for God’s sake!

The Real Lessons To Learn From The Amazon EC2 Outage

What Amazon Can Do Better

What We Could Do Better

Subscribe

Recent Comments

Recent Posts

Austinites

Cloud

DevOps

Archives