Category Archives: Cloud

Cloud computing and all its permutations.

The Real Lessons To Learn From The Amazon EC2 Outage

As I write this, our ops team is working furiously to bring up systems outside Amazon’s US East region to recover from the widespread outage they are having this morning. Naturally the Twitterverse, and in a day the blogosphere, and in a week the trade rag…axy? will be talking about how the cloud is unreliable and you can’t count on it for high availability.

And of course this is nonsense. These outages make the news because they hit everyone at once, but all the outages people are continually having at their own data centers are just as impactful, if less hit-generating in terms of news stories.

Sure, we’re not happy that some of our systems are down right now. But…

  1. The outage is only affecting one of our beta products, our production SaaS service is still running like a champ, it just can’t scale up right now.
  2. Our cloud product uptime is still way higher than our self hosted uptime. We have network outages, Web system outages, etc. all the time even though we have three data centers and millions of dollars in network gear and redundant Internet links and redundant servers.

People always assume they’d have zero downtime if they were hosting it. Or maybe that they could “make it get fixed” when it does go down. But that’s a false sense of security based on an archaic misconception that having things on premise gives you any more control over them.We run loads of on premise Web applications and a batch of cloud ones, and once we put some Keynote/Gomez/Alertsite against them we determined our cloud systems have much higher uptime.

Now, there are things that Amazon could do to make all this better on customers. In the Amazon SLAs, they say of course you can have super high uptime – if you are running redundantly across AZs and, in this case, regions. But Amazon makes it really unattractive and difficult to do this.

What Amazon Can Do Better

  1. We can work around this issue by bringing up instances in other regions.  Sadly, we didn’t already have our AMIs transferred into those regions, and you can only bring up instances off AMIs that are already in those regions. And transferring regions is a pain in the ass. There is absolutely zero reason Amazon doesn’t provide an API call to copy an AMI from region 1 to region 2. Bad on them. I emailed my Amazon account rep and just got back the top Google hits for “Amazon AMI region migrate”. Thanks, I did that already.
  2. We weren’t already running across multiple regions and AZs because of cost. Some of that is the cost of redundancy in and of itself, but more importantly is the hateful way Amazon does reserve pricing, which very much pushes you towards putting everything in one AZ.
  3. Also, redundancy only really works if you have everything, including data, in that AZ. If you are running redundant app servers across 4 AZs, but have your database in one of them – 0r have a database master in one and slaves in the others – you still get hosed by a particular region downtime.

Amazon needs to have tools that inherently let you distribute your stuff across their systems and needs to make their pricing/reserve strategy friendlier to doing things in what they say is “the right way.”

What We Could Do Better

We weren’t completely prepared for this. Once that region was already borked, it was impossible to migrate AMIs out of it, and there are so many ticky little region specific things all through the Amazon config – security groups, ELBs, etc – doing that on the fly is not possible unless you have specifically done it before, and we hadn’t.

We have an automation solution (PIE) that will regen our entire cloud for us in a short amount of time, but it doesn’t handle the base images, some of which we modify and re-burn from the Amazon ones. We don’t have that process automated and the documentation was out of date since Fedora likes to move their crap around all the time.

In the end, Amazon started coming back just as we got new images done in us-west-1.  We’ll certainly work on automating that process, and hope that Amazon will also step up to making it easier for their customers to do so.

Leave a comment

Filed under Cloud, DevOps

Our Cloud Products And How We Did It

Hey, I’m not a sales guy, and none of us spend a lot of time on this blog pimping our company’s products, but we’re pretty proud of our work on them and I figured I’d toss them out there as use cases of what an enterprise can do in terms of cloud products if they get their act together!

Some background.  Currently all the agile admins (myself, Peco, and James) work together in R&D at National Instruments.  It’s funny, we used to work together on the Web Systems team that ran the ni.com Web site, but then people went their own ways to different teams or even different companies. Then we decided to put the dream team back together to run our new SaaS products.

About NI

Some background.  National Instruments (hereafter, NI) is a 5000+ person global company that makes hardware and software for test & measurement, industrial control, and graphical system design. Real Poindextery engineering stuff. Wireless sensors and data acquisition, embedded and real-time, simulation and modeling. Our stuff is used to program the Lego Mindstorms NXT robots as well as control CERN’s Large Hadron Collider. When a crazed highlander whacks a test dummy on Deadliest Warrior and Max the techie looks at readouts of the forces generated, we are there.

About LabVIEW

Our main software product is LabVIEW.  Despite being an electrical engineer by degree, we never used LabVIEW in school (this was a very long time ago, I’ll note, most programs use it nowadays), so it wasn’t till I joined NI I saw it in action. It’s a graphical dataflow programming language. I assumed that was BS when I heard it. I had so many companies try to sell be “graphical” programming over the years, like all those crappy 4GLs back in the ‘9o’s, that I figured that was just an unachieved myth. But no, it’s a real visual programming language that’s worked like a champ for more than 20 years. In certain ways it’s very bad ass, it does parallelism for you and can be compiled and dropped onto a FPGA. It’s remained niche-ey and hasn’t been widely adopted outside the engineering world, however, due to company focus more than anything else.

Anyway, we decided it was high time we started leveraging cloud technologies in our products, so we created a DevOps team here in NI’s LabVIEW R&D department with a bunch of people that know what they’re doing, and started cranking on some SaaS products for our customers! We’ve delivered two and have announced a third that’s in progress.

Cloud Product #1: LabVIEW Web UI Builder

First out of the gate – LabVIEW Web UI Builder. It went 1.0 late last year. Go try it for free! It’s a Silverlight-based RIA “light” version of LabVIEW – you can visually program, interface with hardware and/or Web services. As internal demos we even had people write things like “Duck Hunt” and “Frogger” in it – it’s like Flash programming but way less of a pain in the ass. You can run in browser or out of browser and save your apps to the cloud or to your local box. It’s a “freemium” model – totally free to code and run your apps, but you have to pay for a license to compile your apps for deployment somewhere else – and that somewhere else can be a Web server like Apache or IIS, or it can be an embedded hardware target like a sensor node. The RIA approach means the UI can be placed on a very low footprint target because it runs in the browser, it just has to get data/interface with the control API of whatever it’s on.

It’s pretty snazzy. If you are curious about “graphical programming” and think it is probably BS, give it a spin for a couple minutes and see what you can do without all that “typing.”

A different R&D team wrote the Silverlight code, we wrote the back end Web services, did the cloud infrastructure, ops support structure, authentication, security, etc. It runs on Amazon Web Services.

Cloud Product #2: LabVIEW FPGA Compile Cloud

This one’s still in beta, but it’s basically ready to roll. For non-engineers, a FPGA (field programmable gate array) is essentially a rewritable chip. You get the speed benefits of being on hardware – not as fast as an ASIC but way faster than running code on a general purpose computer – as well as being able to change the software later.

We have a version of LabVIEW, LabVIEW FPGA, used to target LabVIEW programs to an FPGA chip. Compilation of these programs can take a long time, usually a number of hours for complex designs. Furthermore the software required for the compilation is large and getting more diverse as there’s more and more chips out there (each pretty much has its own dedicated compiler).

So, cloud to the rescue. The FPGA Compile Cloud is a simple concept – when you hit ‘compile’ it just outsources the compile to a bunch of servers in the cloud instead of locking up your workstation for hours (assuming you’ve bought a subscription).  FPGA compilations have everything they need with them, there’s not unique compile environments to set up or anything, so it’s very commoditizable.

The back end for this isn’t as simple as the one for UI Builder, which is just cloud storage and load balanced compile servers – we had to implement custom scaling for the large and expensive compile workers, and it required more extensive monitoring, performance, and security work. It’s running on Amazon too. We got to reuse a large amount of the infrastructure we put in place for systems management and authentication for UI Builder.

Cloud Product #3: Technical Data Cloud

It’s still in development, but we’ve announced it so I get to talk about it! The idea behind the Technical Data Cloud is that more and more people need to collect sensor data, but they don’t want to fool with the management of it. They want to plop some sensors down and have the acquired data “go to the cloud!” for storage, visualization, and later analysis. There are other folks doing this already, like the very cool Pachube (pronounced “patch-bay”, there’s a LabVIEW library for talking to it), and it seems everyone wants to take their sensors to the cloud, so we’re looking at making one that’s industrial strength.

For this one we are pulling our our big guns, our data specialist team in Aachen, Germany. We are also being careful to develop it in an open way – the primary interface will be RESTful HTTP Web services, though LabVIEW APIs and hardware links will of course be a priority.

This one had a big technical twist for us – we’re implementing it on Microsoft Windows Azure, the MS guys’ cloud offering. Our org is doing a lot of .NET development and finding a lot of strategic alignment with Microsoft, so we thought we’d kick the tires on their cloud. I’m an old Linux/open source bigot and to be honest I didn’t expect it to make the grade, but once we got up to speed on it I found it was a pretty good bit of implementation. It did mean we had to do significant expansion of our underlying platform we are reusing for all these products – just supporting Linux and Windows instance in Amazon already made us toss a lot of insufficiently open solutions in the garbage bin, and these two cloud worlds are very different as well.

How We Did It

I find nothing more instructive than finding out the details – organizational, technical, etc. – of how people really implement solutions in their own shops.  So in the interests of openness and helping out others, I’m going to do a series on how we did it!  I figure it’ll be in about three parts, most likely:

  • How We Did It: People
  • How We Did It: Process
  • How We Did It: Tools and Technologies

If there’s something you want to hear about when I cover these areas, just ask in the comments!  I can’t share everything, especially for unreleased products, but promise to be as open as I can without someone from Legal coming down here and Tasering me.

5 Comments

Filed under Cloud, DevOps

Why Amazon Reserve Instances Torment Me

We’ve been using over 100 Amazon EC2 instances for a year now, but I’ve just now made my first reserve instance purchase. For the untutored, reserve instances are where you pay a yearly upfront per instance and you get a much, much lower hourly cost. On its face, it’s a good deal – take a normal Large instance you’d use for a database.  For a Linux one, it’s $0.34 per hour.  Or you can pay $910 up front for the year, and then it’s only $0.12 per hour. So theoretically, it takes your yearly cost from $2978.40 to $1961.2.  A great deal right?

Well, not so much. The devil is in the details.

First of all, you have to make sure and be running all those instances all the time.  If you buy a reserve instance and then don’t use it some of the time, you immediately start cutting into your savings.  The crossover is at 172 days – if you don’t run the instance at least 172 days out of the year then you are going upside down on the deal.

But what’s the big deal, you ask?  Sure, in the cloud you are probably (and should be!) scaling up and down all the time, but as long as you reserve up to your low water mark it should work out, right?

So the big second problem is that when you reserve instances, you have to specify everything about that instance.  You aren’t reserving “10 instances”, or even “10 large instances” – you have to specify:

  • Platform (UNIX/Linux, UNIX/Linux VPC, SUSE Linux, Windows, Windows VPC, or Windows with SQL Server)
  • Instance Type (m1.small, etc.)
  • AZ (e.g. us-east-1b)

And tenancy and term. So you have to reserve “a small multitenant Linux instance in us-east-1b for one year.” But having to specify down to this level is really problematic in any kind of dynamic environment.

Let’s say you buy 10 m1.large instances for your databases, and then you realize later you really need to move up to an m1.xlarge.  Well, tough. You can, but if you don’t have 10 other things to run on those larges, you lose money. Or if you decide to change OS.  One of our biggest expenditures is our compile farm workers, and on those we hope to move from Windows to Linux once we get the software issues worked out, and we’re experimenting with best cost/performance on different instance sizes. I’m effectively blocked from buying reserve for those, since if I do it’ll put a stop to our ability to innovate.

And more subtly, let’s say you’re doing dynamic scaling and splitting across AZs like they always say you should do for availability purposes.  Well, if I’m running 20 instances, and scaling them across 1b and 1c, I am not guaranteed I’m always running 10 in 1b and 10 in 1c, it’s more random than that.  Instead of buying 20 reserve, you instead have to buy say 7 in 1b and 7 in 1c, to make sure you don’t end up losing money.

Heck, they even differentiate between Linux and Suse and Linux VPC instances, which clearly crosses over into annoyingly picky territory.

As a result of all this, it is pretty undesirable to buy reserve instances unless you have a very stable environment, both technically and scale-wise. That sentence doesn’t describe the typical cloud use case in my opinion.

I understand, obviously, why they are doing this.  From a capacity planning standpoint, it’s best for them if they make you specify everything. But what I don’t think they understand is that this cuts into people willing to buy reserve, and reserve is not only upfront money but also a lockin period, which should be grotesquely attractive to a business. I put off buying reserve for a year because of this, and even now that I’ve done it I’m not buying near as many reserve as I could be because I have to hedge my bets against ANY changes to my service. It seems to me that this also degrades the alleged point of reserves, which is capacity planning – if you’re so picky about it that no one buys reserve and 95% of your instances are on demand, then you can’t plan real well can you?

What Amazon needs to do is meet customers halfway.  It’s all a probabilities game anyway. They lose specificity of each given reserve request, but get many more reserve requests (and all the benefits they convey – money, lockin, capacity planning info) in return.

Let’s look at each axis of inflexibility and analyze it.

  • Size.  Sure, they have to allocate machines, right?  But I assume they understand they are using this thing called “virtualization.”  If I want to trade in 20 reserved small instances for 5 large instances (each large is 4x a small), why not?  It loses them nothing to allow this. They just have to make the effort to allow it to happen in their console/APIS. I can understand needing to reserve a certain number of “units” but those should be flexible on exact instance types at a given time.
  • OS. Why on God’s green earth do I need to specify OS?  Again, virtualized right? Is it so they can buy enough Windows licenses from Microsoft?  Cry me a river.  This one needs to leave immediately and never come back.
  • AZ. This is annoying from the user POV but probably the most necessary from the Amazon POV because they have to put enough hardware in each data center, right?  I do think they should try to make this a per region and not a per AZ limit, so I’m just reserving “us-east” in Virginia and not the specific AZ, that would accommodate all my use cases.

In the end, one of the reasons people move to the cloud in the first place is to get rid of the constraints of hardware.  When Amazon just puts those constraints back in place, it becomes undesirable. Frankly even now, I tried to just pay Amazon up front rather than actually buy reserve, but they’re not really enterprise friendly yet from a finance point of view so I couldn’t make that happen, so in the end I reluctantly bought reserve.  The analytics around it are lacking too – I can’t look in the Amazon console and see “Yes, you’re using all 9 of your large/linux/us-east-1b instances.”

Amazon keeps innovating greatly in the technology space but in terms of the customer interaction space, they need a lot of help – they’re not the only game in town, and people with less technically sophisticated but more aggressive customer services/support options will erode their market share. I can’t help that Rackspace’s play with backing OpenStack seems to be the purest example of this – “Anyone can run the same cloud we do, but we are selling our ‘fanatical support'” is the message.

8 Comments

Filed under Cloud

DevOps In Action Podcast

While at SXSW, I recorded an episode of the IT Management and Cloud Podcast with Michael Coté and John Willis. You can find it on Coté’s People Over Process blog! Sorry about the background noise, we were doing it in the upstairs bar at the Driskill.

Leave a comment

Filed under Cloud, Conferences, DevOps

DevOps at CloudCamp at SXSWi!

Isn’t that some 3l33t jargon.  Anyway, Dave Nielsen is holding a CloudCamp here in Austin during SXSW Interactive on Monday, March 14 (followed by the Cloudy Awards on March 15) and John Willis of Opscode and I reached out to him, and he has generously offered to let us have a DevOps meetup in conjunction.

John’s putting together the details but is also traveling, so this is going to be kind of emergent.  If you’re in Austin (normally or just for SXSW) and are interested in cloud, DevOps, etc., come on out to the CloudCamp (it’s free and no SXSW badge required) and also participate in the DevOps meetup!

In fact, if anyone’s interested in doing short presentations or show-and-tells or whatnot, please ping John (@botchagalupe) or me (@ernestmueller)!

Leave a comment

Filed under Cloud, Conferences, DevOps

Amazon CloudFormation: Model Driven Automation For The Cloud

You may have heard about Amazon’s newest offering they announced today, CloudFormation.  It’s the new hotness, but I see a lot of confusion in the Twitterverse about what it is and how it fits into the landscape of IaaS/PaaS/Elastic Beanstalk/etc. Read what Werner Vogels says about CloudFormation and its uses first, but then come back here!

Allow me to break it down for you and explain why this is such a huge leverage point for cloud developers.

What Has Come Before

Up till now on Amazon you could configure up a single virtual image the way you wanted it, with an AMI. You could even kind of construct a scalable tier of similar systems using Auto Scaling, by defining Launch Configurations. But if you wanted to construct an entire multitier system it was a lot harder.  There are automated configuration management tools like chef and puppet out there, but their recipes/models tend to be oriented around getting a software loadout on an existing system, not the actual system provisioning – in general they come from the older assumption you have someone doing that on probably-physical systems using bcfg2 or cobber or vagrant or something.

So what were you to do if you wanted to bring up a simple three tier system, with a Web tier, app server tier, and database tier?  Either you had to set them up and start them manually, or you had to write code against the Amazon APIs to explicitly pull up what you wanted. Or you had to use a third party provisioning provider like RightScale or EngineYard that would let you define that kind of model in their Web consoles but not construct your own model programmatically and upload it. (I’d like my product functionality in my own source control and not your GUI, thanks.)

Now, recently Amazon launched Elastic Beanstalk, which is more way over on the PaaS side of things, similar to Google App Engine.  “Just upload your application and we’ll run it and scale it, you don’t have to worry about the plumbing.” Of course this sharply limits what you can do, and doesn’t address the question of “what if my overall system consists of more than just one Java app running in Beanstalk?”

If your goal is full model driven automation to achieve “infrastructure as code,” none of these solutions are entirely satisfactory. I understand CloudFormation deeply because we went down that same path and developed our own system model ourselves as a response!

I’ll also note that this is very similar to what Microsoft Azure does.  Azure is a hybrid IaaS/PaaS solution – their marketing tries to say it’s more like Beanstalk or Google App Engine, but in reality it’s more like CloudFormation – you have an XML file that describes the different roles (tiers) in the system, defines what software should go on each, and lets you control the entire system as a unit.

So What Is CloudFormation?

Basically CloudFormation lets you model your Amazon cloud-based system in JSON and then provision and control it as a unit.  So in our use case of a three tier system, you would model it up in their JSON markup and then CloudFormation would understand that the whole thing is a unit.  See their sample template for a WordPress setup. (A mess more sample templates are here.)

Review the WordPress template; it lets you define the AMIs and instance types, what the security group and ELB setups should be, the RDS database back end, and feed in variables that’ll be used in the consuming software (like WordPress username/password in this case).

Once you have your template you can tell Amazon to start your “stack” in the console! It’ll even let you hook it up to a SNS notification that’ll let you know when it’s done. You name the whole stack, so you can distinguish between your “dev” environment and your “prod” environment for example, as opposed to the current state of the Amazon EC2 console where you get to see a big list of instance IDs – they added tagging that you can try to use for this, but it’s kinda wonky.

Why Do I Want This Again?

Because a system model lets you do a number of clever automation things.

Standard Definition

If you’ve been doing Amazon yourself, you’re used to there being a lot of stuff you have to do manually.  From system build to system build even you do it differently each time, and God forbid you have multiple techies working on the same Amazon system. The basic value proposition of “don’t do things manually” is huge.  You configure the security groups ONCE and put it into the template, and then you’re not going to forget to open port 23 AGAIN next time you start a system. A core part of what DevOps is realizing as its value proposition is treating system configuration as code that you can source control, fix bugs in and have them stay fixed, etc.

And if you’ve been trying to automate your infrastructure with tools like Chef, Puppet, and ControlTier, you may have been frustrated in that they address single systems well, but do not really model “systems of systems” worth a damn.  Via new cloud support in knife and stuff you can execute raw “start me a cloud server” commands but all that nice recipe stuff stops at the box level and doesn’t really extend up to provisioning and tracking parts of your system.

With the CloudFormation template, you have an actual asset that defines your overall system.  This definition:

  • Can be controlled in source control
  • Can be reviewed by others
  • Is authoritative, not documentation that could differ from the reality
  • Can be automatically parsed/generated by your own tools (this is huge)

It’s also nicely transparent; when you go to the console and look at the stack it shows you the history of events, the template used to start it, the startup parameters it used… Moving away from the “mystery meat” style of system config.

Coordinated Control

With CloudFormation, you can start and stop an entire environment with one operation. You can say “this is the dev environment” and be able to control it as a unit. I assume at some point you’ll be able to visualize it as a unit, right now all the bits are still stashed in their own tabs (and I notice they don’t make any default use of their own tagging, which makes it annoying to pick out what parts are from that stack).

This is handy for not missing stuff on startup and teardown… A couple weeks ago I spent an hour deleting a couple hundred rogue EBSes we had left over after a load test.

And you get some status eventing – one of the most painful parts of trying to automate against Amazon is the whole “I started an instance, I guess I’ll sit around and poll and try to figure out when the damn thing has come up right.”  In CloudFront you get events that tell you when each part and then the whole are up and ready for use.

What It Doesn’t Do

It’s not a config management tool like Chef or Puppet. Except for what you bake onto your AMI it has zero software config capabilities, or command dispatch capabilities like Rundeck or mcollective or Fabric. Although it should be a good integration point with those tools.

It’s not a PaaS solution like Beanstalk or GAE; you use those when you just have an app you want to deploy to something that’ll run it.  Now, it does erode some use cases – it makes a middle point between “run it all yourself and love the complexity” and “forget configurable system bits, just use PaaS.”  It allows easy reusability, say having a systems guy develop the template and then a dev use it over and over again to host their app, but with more customization than the pure-play PaaSes provide.

It’s not quite like OVF, which is more fiddly and about virtually defining the guts of a single machine than defining a set of systems with roles and connections.

Competitive Analysis

It’s very similar to Microsoft Azure’s approach with their .cscfg and .csdef files which are an analogous XML model – you really could fairly call this feature “Amazon implements Azure on Amazon” (just as you could fairly call Elastic Beanstalk “Amazon implements Google App Engine on Amazon”.) In fact, the Azure Fabric has a lot more functionality than the primitive Amazon events in this first release. Of course, CloudFormation doesn’t just work on Windows, so that’s a pretty good width vs depth tradeoff.

And it’s similar to something like a RightScale, and ideally will encourage them to let customers actually submit their own definition instead of the current clunky combo of ServerArrays and ServerTemplates (curl or Web console?  Really? Why not a model like this?). RightScale must be in a tizzy right now, though really just integrating with this model should be easy enough.

Where To From Here?

As I alluded, we actually wrote our own tool like this internally called PIE that we’re looking to open source because we were feeling this whole problem space keenly.  XML model of the whole system, Apache Zookeeper-based registry, kinda like CloudFormation and Azure. Does CloudFormation obsolete what we were doing?  No – we built it because we wanted a model that could describe cloud systems on multiple clouds and even on premise systems. The Amazon model will only help you define Amazon bits, but if you are running cross-cloud or hybrid it is of limited value. And I’m sure model visualization tools will come, and a better registry/eventing system will come, but we’re way farther down that path at least at the moment. Also, the differentiation between “provisioning tools” that control and start systems like CloudFormation and bcfg2 and “configuration” tools that control and start software like Chef and Puppet (and some people even differentiate between those and “deploy” tools that control and start applications like Capistrano) is a false dichotomy. I’m all about the “toolchain” approach but at some point you need a toolbelt. This tool differentiation is one of the more harmful “Dev vs Ops” differentiations.

I hope that this move shows the value of system modeling and helps people understand we need an overarching model that can be used to define it all, not just “Amazon” vs “Azure” or “system packages” vs “developed applications” or “UNIX vs Windows…” True system automation will come from a UNIVERSAL model that can be used to reason about and program to your on premise systems, your Amazon systems, your Azure systems, your software, your apps, your data, your images and files…

Conclusion

You need to understand CloudFormation, because it is one of the most foundational changes that will have a lot of leverage that AWS has come out with in some time. I don’t bother to blog about most of the cool new AWS features, because they are cool and I enjoy them but this is part of a more revolutionary change in the way systems are managed, the whole DevOps thing.

7 Comments

Filed under Cloud, DevOps

Inside Microsoft Azure

Recently, I delivered a presentation at the Austin Cloud User Group introducing them to Microsoft Azure.  I’m a UNIX bigot and have been doing the Amazon Cloud and open source thing, but we are delivering a product via Azure next so our team is learning it.  It’s actually quite interesting and has a number of good points; it’s mainly hindered by the Microsoft marketing message trying to pretend it’s all magical fairy dust instead of clearly explaining what it is and what it can do. So if you want to hear what Azure is in straight shooting UNIX admin speak, check it out!

1 Comment

Filed under Cloud

Application Performance Management in the Cloud

Cloud computing has been the buzz and hype lately and everybody is trying to understand what it is and how to use it. In this post, I wanted to explore some of the properties of “the cloud” as they pertain to Application Performance Management. If you are new to cloud offerings, here are a few good materials to get started, but I will assume the reader of this post is somewhat familiar with cloud technology.

As Spiderman would put it, “With great power comes great responsibility” … Cloud abstracts some of the infrastructure parts of your system and gives you the ability to scale up resource on demand. This doesn’t mean that your applications will magically work better or understand when they need to scale, you still need to worry about measuring and managing the performance of your applications and providing quality service to your customers. In fact, I would argue APM is several times more important to nail down in a cloud environment for several reasons:

  1. The infrastructure your apps live in is more dynamic and volatile. For example, cloud servers are VMs and resources given to them by the hypervisor are not constant, which will impact your application. Also, VMs may crash/lock up and not leave much trace of what was the cause of issues with your application.
  2. Your applications can grow bigger and more complex as they have more resources available in the cloud. This is great but it will expose bottlenecks that you didn’t anticipate. If you can’t narrow down the root cause, you can’t fix it. For this, you need scalable APM instrumentation and precise analytics.
  3. On-demand scaling is a feature that is touted by all cloud providers but it all hinges on one really hard problem that they won’t solve for you: Understanding your application performance and workload behave so that you can properly instruct the cloud scaling APIs what to do. If my performance drops do I need more machines? How many? APM instrumentation and analytics will play a key part of how you can solve the on-demand scaling problem.

So where is APM for the cloud, you ask? It is in its very humble beginnings as APM solution providers have to solve the same gnarly problems application developers and operations teams are struggling with:

  1. Cloud is dynamic and there are no fixed resources. IP addresses, hostnames, and MAC addresses change, among other things. Properly addressing, naming and connecting the machines is a challenge – both for deep instrumentation agent technology and for the apps themselves.
  2. Most vendors’ licensing agreements are based on fixed count of resources billing rather than a utility model, and are therefore difficult to use in a cloud environment. Charging a markup on the utility bill seems to be a common approach among the cloud-literate but introduces a kink in the regular budgeting process and no one wants to write a variable but sizable blank check.
  3. Well the cloud scales… a lot. The instrumentation / monitoring technology has to be low enough overhead and be able to support hundreds of machines.

On the synthetic monitoring end there are plenty of options. We selected AlertSite, a distributed synthetic monitoring SaaS provider similar to Keynote and Gomez, for simplicity but in only provides high level performance and availability numbers for SLA management and “keep the lights are on” alerting. We also rely on CloudKick, another SaaS provider, for the system monitoring part. Here is a more detailed use case on the Cloudkick implementation.

For deeper instrumentation we have worked with OPNET to deploy their AppInternals Xpert (former Opnet Panorama) solution in the Amazon EC2 cloud environment. We have already successfully deployed AppInternals Xpert on our own server infrastructure to provide deep instrumentation and analysis of our applications. The cloud version looks very promising for tackling the technical challenges introduced by the cloud, and once fully deployed, we will have a lot of capability to harness the cloud scale and performance. More on this as it unfolds…

In summary, as vendors sell you the cloud but be prepared to tackle the traditional infrastructure concerns and stick to your guns. No, the cloud is not fixing your performance problems and is not going to magically scale for you. You will need some form of APM to help you there. Be on the lookout for providers and do challenge your existing partners to think of how they can help you. The APM space in the cloud is where traditional infrastructure was in 2005 but things are getting better.

To the cloud!!! (And do keep your performance engineers on staff, they will still be saving your bacon.)

Peco

Leave a comment

Filed under Cloud

salesforce.com – Time To Start Caring?

So this week, salesforce.com (the world’s #4 fastest growing company, says Fortune Magazine) bought Ruby PaaS shop Heroku, announced database.com as a cloud database solution, and announced remedyForce, IT config management from BMC.

That’s quite the hat trick. salesforce.com has been the 900 lb gorilla in the closet for a while now; they’ve been hugely successful and have put a lot of good innovation on their systems but so far force.com, their PaaS solution, has been militantly “for existing CRM customers.” This seems like an indication of preparation to move into the general PaaS market and if they do I think they’ll be a force to reckon with – experience, money, and a proven track record of innovation. NI doesn’t use salesforce.com (“Too expensive” I’m told) so I’ve kept them on the back burner in terms of my attention but I’m guessing in 2011 they will come into the pretty meager PaaS space and really kick some ass.

Because for PaaS – what do we have really? Google App Engine, and in traditional Google fashion they pooped it out, put some minimum functionality on it, and wandered off. (We tried it once, determined that it disallowed half the libraries we needed to use, and gave up.) Microsoft Azure, which is really a hybrid IaaS/PaaS, you don’t get to ignore the virtual server aspect and have to do infrastructure stuff like monitor, scale instances, etc. yourself. And of course Heroku. And VMWare’s Cloud Foundry thing for Java, but VMWare is having a bizarrely hard time doing anything right in the cloud – talk about parlaying leadership in a nearby sector unsuccessfully. I have no idea why they’re executing so slowly, but they are.  Even that seems like Salesforce is doing it better, with VMForce (salesforce + vmware collaboration on Java PaaS).

In the end, most of us want PaaS – managing the plumbing is uninteresting, as long as it can manage itself well – but it’s hard, and no one has it nailed yet.  I hope salesforce.com does move into the general-use space; I’d hate for them to be buying up good players and only using them to advance their traditional business.

Quite the hat trick. salesforce.com has been the 900 lb gorilla in the closet for a while now; they’ve been hugely successful and have put a lot of good innovation on their systems but so far have been militantly “for existing CRM customers.” This seems like an indication of preparation to move into the general PaaS market and if they do I think they’ll be a force to reckon with – experience, money, and a proven track record of innovation. NI doesn’t use sf.com (“Too expensive” I’m told) so I’ve kept them on the back burner in terms of my attention but I’m guessing in 2011 they will come into the pretty meager PaaS space and really kick some ass.

Because for PaaS – what do we have really? Google App Engine, and in traditional Google fashion they pooped it out, put some minimum functionality on it, and wandered off. Azure, which is really hybrid IaaS, you don’t get to ignore the virtual server aspect and have to monitor, scale instances, etc. yourself. And of course Heroku. And VMWare’s Cloud Foundry thing for Java, but VMWare is having a bizarrely hard time doing anything right in the cloud – talk about parlaying leadership in a nearby sector unsuccessfully. Even that seems like Salesforce is doing it better, with VMForce (salesforce + vmware collab).

In the end, most of us want PaaS – managing the plumbing is uninteresting, as long as it can manage itself well – but it’s hard, and no one has it nailed yet.

Leave a comment

Filed under Cloud

Austin Cloud User Group Nov 17 Meeting Notes

This month’s ACUG meeting was cooler than usual – instead of having one speaker talk on a cloud-related topic, we had multiple group members do short presentation on what they’re actually doing in the cloud.  I love talks like that, it’s where you get real rubber meets the road takeaways.

I thought I’d share my notes on the presentations.  I’ll write up the one we did separately, but I got a lot out of these:

  1. OData to the Cloud, by Craig Vyal from Pervasive Software
  2. Moving your SaaS from Colo to Cloud, by Josh Arnold from Arnold Marziani (previously of PeopleAdmin)
  3. DevOps and the Cloud, by Chris Hilton from Thoughtworks
  4. Moving Software from On Premise to SaaS, by John Mikula from Pervasive Software
  5. The Programmable Infrastructure Environment, by Peco Karayanev and Ernest Mueller from National Instruments (see next post!)

My editorial comments are in italics.  Slides are linked into the headers where available.

OData to the Cloud

OData was started by Microsoft (“but don’t hold that against it”) under the Open Specification Promise.  Craig did an implementation of it at Pervasive.

It’s a RESTful protocol for CRUDdy GET/POST/DELETE of data.  Uses AtomPub-based feeds and returns XML or JSON.  You get the schema and the data in the result.

You can create an OData producer of a data source, consume OData from places that support it, and view it via stuff like iPhone/Android apps.

Current producers – Sharepoint, SQL Azure, Netflix, eBay, twitpic, Open Gov’t Data Initiative, Stack Overflow

Current consumers – PowerPivot in Excel, Sesame, Tableau.  Libraries for Java (OData4J), .NET 4.0/Silverlight 4, OData SDK for PHP

It is easier for “business user” to consume than SOAP or REST.  Craig used OData4J to create a producer for the Pervasive product.

Questions from the crowd:

Compression/caching?  Nothing built in.  Though normal HTTP level compression would work I’d think. It does “page” long lists of results and can send a section of n results at a time.

Auth? Your problem.  Some people use Oauth.  He wrote a custom glassfish basic HTTP auth portal.

Competition?  Gdata is kinda like this.

Seems to me it’s one part REST, one part “making you have a DTD for your XML”.  Which is good!  We’re very interested in OData for our data centric services coming up.

Moving your SaaS from Colo to Cloud

Josh Arnold was from PeopleAdmin, now he’s a tech recruiter, but can speak to what they did before he left.  PeopleAdmin was a Sungard type colo setup.  Had a “rotting” out of country DR site.

They were rewriting their stack from java/mssql to ruby/linux.

At the time they were spending $15k/mo on the colo (not including the cost of their HW).  Amazon estimated cost was 1/3 that but really they found out after moving it’s 1/2.  What was the surprise cost?  Lower than expected perf (disk io) forced more instances than physical boxes of equivalent “size.”

Flexible provisioning and autoscaling was great, the colo couldn’t scale fast enough.  How do you scale?

The cloud made having an out of country DR site easy, and not have it rot and get old.

Question: What did you lose in the move?  We were prepared for mental “control issues” so didn’t have those.  There’s definitely advanced functionality (e.g. with firewalls) and native hardware performance you lose, but that wasn’t much.

They evalled Rackspace and Amazon (cursory eval).  They had some F5s they wanted to use and the ability to mix in real hardware was tempting but they mainly went straight to Amazon.  Drivers were the community around it and their leadership in the space.

Timeline was 2 years (rewrite app, slowly migrate customers).  It’ll be more like 3-4 before it’s done.  There were issues where they were glad they didn’t mass migrate everyone at once.

Technical challenges:

Performance was a little lax (disk performance, they think) and they ended up needing more servers.  Used tricks like RAIDed EBSes to try to get the most io they could (mainly for the databases).

Every customer had a SSL cert, and they had 600 of them to mess with.  That was a problem because of the 5 Elastic IP limit.  Went to certs that allow subsidiary domains – Digicert allowed 100 per cert (other CAs limit to much less) so they could get 100 per IP.

App servers did outbound LDAP conns to customer premise for auth integration and they usually tried to allow those in via IP rules in their corporate firewalls, but now on Amazon outbound IPs are dynamic.  They set up a proxy with a static (elastic) Ip to route all that through.

Rightscale – they used it.  They like it.

They used nginx for the load balancing, SSL termination.  Was a single point of failure though.

Remember that many of the implementations you are hearing about now were started back before Rackspace had an API, before Amazon had load balancers, etc.

In discussion about hybrid clouds, the point was brought up a lot of providers talk about it – gogrid, opsource, rackspace – but often there are gotchas.

DevOps and the Cloud

Chris Hilton from Thoughtworks is all about the DevOps, and works on stuff like continuous deployment for a living.

DevOps is:

  • collaboration between devs and operations staff
  • agile sysadmin, using agile dev tools
  • dev/ops/qa integration to achieve business goals

Why DevOps?

Silos.  agile dev broke down the wall between dev/qa (and biz).

devs are usually incentivized for change, and ops are incentivized for stability, which creates an innate conflict.

but if both are incentivized to deliver business value instead…

DevOps Practices

  • version control!
  • automated provisioning and deployment (Puppet/chef/rPath)
  • self healing
  • monitoring infra and apps
  • identical environments dev/test/prod
  • automated db mgmt

Why DevOps In The Cloud?

cloud requires automation, devops provides automation

References

  • “Continuous Delivery” Humble and Farley
  • Rapid and Reliable Releases InfoQ
  • Refactoring Databases by Ambler and Sadalage

Another tidbit: they’re writing puppet lite in powershell to fill the tool gap – some tool suppliers are starting, but the general degree of tool support for people who use both Windows and Linux is shameful.

Moving Software from On Premise to SaaS

John Mikula of Pervasive tells us about the Pervasive Data Cloud.  They wanted to take their on premise “Data Integrator” product, basically a command line tool ($, devs needed to implement), to a wider audience.

Started 4 years ago.  They realized that the data sources they’re connecting to and pumping to, like Quickbooks Online, Salesforce, etc are all SaaS from the get go.   “Well, let’s make our middle part the same!”

They wrote a Java EE wrapper, put it on Rackspace colo initally.

It gets a customer’s metadata, puts it on a queue, another system takes it off and process it.  A very scaling-friendly architecture.  And Rackspace (colo) wasn’t scaling fast enough, so they moved it to Amazon.

Their initial system had 2 glassfish front ends, 25 workers

For queuing, they tried Amazon SQS but it was limited, then went to Apache Zookeeper

First effort was about “deploy a single app” – namely salesforce/quickbooks integration.  Then they made a domain specific model and refactored and made an API to manage the domain specific entities so new apps could be created easily.

Recommended approach – solve easy problems and work from there.  That’s more than enough for people to buy in.

Their core engine’s not designed for multitenancy – have batches of workers for one guy’s code – so their code can be unsafe but it’s in its own bucket and doesn’t mess up anyone else.

Changing internal business processes in a mature company was a challenge – moving from perm license model to per month just with accounting and whatnot was a big long hairy deal.

Making the API was rough.  His estimate of a couple months grew to 6.  Requirements gathering was a problem, very iterative.  They weren’t agile enough – they only had one interim release and it wasn’t really usable; if they did it again they’d do the agile ‘right thing’ of putting out usable milestones more frequently to see what worked and what people really needed.

In Closing

Whew!  I found all the presentations really engaging and thank everyone for sharing the nuts and bolts of how they did it!

Leave a comment

Filed under Cloud