Tag Archives: amazon

Our Cloud Products And How We Did It

Hey, I’m not a sales guy, and none of us spend a lot of time on this blog pimping our company’s products, but we’re pretty proud of our work on them and I figured I’d toss them out there as use cases of what an enterprise can do in terms of cloud products if they get their act together!

Some background.  Currently all the agile admins (myself, Peco, and James) work together in R&D at National Instruments.  It’s funny, we used to work together on the Web Systems team that ran the ni.com Web site, but then people went their own ways to different teams or even different companies. Then we decided to put the dream team back together to run our new SaaS products.

About NI

Some background.  National Instruments (hereafter, NI) is a 5000+ person global company that makes hardware and software for test & measurement, industrial control, and graphical system design. Real Poindextery engineering stuff. Wireless sensors and data acquisition, embedded and real-time, simulation and modeling. Our stuff is used to program the Lego Mindstorms NXT robots as well as control CERN’s Large Hadron Collider. When a crazed highlander whacks a test dummy on Deadliest Warrior and Max the techie looks at readouts of the forces generated, we are there.

About LabVIEW

Our main software product is LabVIEW.  Despite being an electrical engineer by degree, we never used LabVIEW in school (this was a very long time ago, I’ll note, most programs use it nowadays), so it wasn’t till I joined NI I saw it in action. It’s a graphical dataflow programming language. I assumed that was BS when I heard it. I had so many companies try to sell be “graphical” programming over the years, like all those crappy 4GLs back in the ‘9o’s, that I figured that was just an unachieved myth. But no, it’s a real visual programming language that’s worked like a champ for more than 20 years. In certain ways it’s very bad ass, it does parallelism for you and can be compiled and dropped onto a FPGA. It’s remained niche-ey and hasn’t been widely adopted outside the engineering world, however, due to company focus more than anything else.

Anyway, we decided it was high time we started leveraging cloud technologies in our products, so we created a DevOps team here in NI’s LabVIEW R&D department with a bunch of people that know what they’re doing, and started cranking on some SaaS products for our customers! We’ve delivered two and have announced a third that’s in progress.

Cloud Product #1: LabVIEW Web UI Builder

First out of the gate – LabVIEW Web UI Builder. It went 1.0 late last year. Go try it for free! It’s a Silverlight-based RIA “light” version of LabVIEW – you can visually program, interface with hardware and/or Web services. As internal demos we even had people write things like “Duck Hunt” and “Frogger” in it – it’s like Flash programming but way less of a pain in the ass. You can run in browser or out of browser and save your apps to the cloud or to your local box. It’s a “freemium” model – totally free to code and run your apps, but you have to pay for a license to compile your apps for deployment somewhere else – and that somewhere else can be a Web server like Apache or IIS, or it can be an embedded hardware target like a sensor node. The RIA approach means the UI can be placed on a very low footprint target because it runs in the browser, it just has to get data/interface with the control API of whatever it’s on.

It’s pretty snazzy. If you are curious about “graphical programming” and think it is probably BS, give it a spin for a couple minutes and see what you can do without all that “typing.”

A different R&D team wrote the Silverlight code, we wrote the back end Web services, did the cloud infrastructure, ops support structure, authentication, security, etc. It runs on Amazon Web Services.

Cloud Product #2: LabVIEW FPGA Compile Cloud

This one’s still in beta, but it’s basically ready to roll. For non-engineers, a FPGA (field programmable gate array) is essentially a rewritable chip. You get the speed benefits of being on hardware – not as fast as an ASIC but way faster than running code on a general purpose computer – as well as being able to change the software later.

We have a version of LabVIEW, LabVIEW FPGA, used to target LabVIEW programs to an FPGA chip. Compilation of these programs can take a long time, usually a number of hours for complex designs. Furthermore the software required for the compilation is large and getting more diverse as there’s more and more chips out there (each pretty much has its own dedicated compiler).

So, cloud to the rescue. The FPGA Compile Cloud is a simple concept – when you hit ‘compile’ it just outsources the compile to a bunch of servers in the cloud instead of locking up your workstation for hours (assuming you’ve bought a subscription).  FPGA compilations have everything they need with them, there’s not unique compile environments to set up or anything, so it’s very commoditizable.

The back end for this isn’t as simple as the one for UI Builder, which is just cloud storage and load balanced compile servers – we had to implement custom scaling for the large and expensive compile workers, and it required more extensive monitoring, performance, and security work. It’s running on Amazon too. We got to reuse a large amount of the infrastructure we put in place for systems management and authentication for UI Builder.

Cloud Product #3: Technical Data Cloud

It’s still in development, but we’ve announced it so I get to talk about it! The idea behind the Technical Data Cloud is that more and more people need to collect sensor data, but they don’t want to fool with the management of it. They want to plop some sensors down and have the acquired data “go to the cloud!” for storage, visualization, and later analysis. There are other folks doing this already, like the very cool Pachube (pronounced “patch-bay”, there’s a LabVIEW library for talking to it), and it seems everyone wants to take their sensors to the cloud, so we’re looking at making one that’s industrial strength.

For this one we are pulling our our big guns, our data specialist team in Aachen, Germany. We are also being careful to develop it in an open way – the primary interface will be RESTful HTTP Web services, though LabVIEW APIs and hardware links will of course be a priority.

This one had a big technical twist for us – we’re implementing it on Microsoft Windows Azure, the MS guys’ cloud offering. Our org is doing a lot of .NET development and finding a lot of strategic alignment with Microsoft, so we thought we’d kick the tires on their cloud. I’m an old Linux/open source bigot and to be honest I didn’t expect it to make the grade, but once we got up to speed on it I found it was a pretty good bit of implementation. It did mean we had to do significant expansion of our underlying platform we are reusing for all these products – just supporting Linux and Windows instance in Amazon already made us toss a lot of insufficiently open solutions in the garbage bin, and these two cloud worlds are very different as well.

How We Did It

I find nothing more instructive than finding out the details – organizational, technical, etc. – of how people really implement solutions in their own shops.  So in the interests of openness and helping out others, I’m going to do a series on how we did it!  I figure it’ll be in about three parts, most likely:

  • How We Did It: People
  • How We Did It: Process
  • How We Did It: Tools and Technologies

If there’s something you want to hear about when I cover these areas, just ask in the comments!  I can’t share everything, especially for unreleased products, but promise to be as open as I can without someone from Legal coming down here and Tasering me.

5 Comments

Filed under Cloud, DevOps

Why Amazon Reserve Instances Torment Me

We’ve been using over 100 Amazon EC2 instances for a year now, but I’ve just now made my first reserve instance purchase. For the untutored, reserve instances are where you pay a yearly upfront per instance and you get a much, much lower hourly cost. On its face, it’s a good deal – take a normal Large instance you’d use for a database.  For a Linux one, it’s $0.34 per hour.  Or you can pay $910 up front for the year, and then it’s only $0.12 per hour. So theoretically, it takes your yearly cost from $2978.40 to $1961.2.  A great deal right?

Well, not so much. The devil is in the details.

First of all, you have to make sure and be running all those instances all the time.  If you buy a reserve instance and then don’t use it some of the time, you immediately start cutting into your savings.  The crossover is at 172 days – if you don’t run the instance at least 172 days out of the year then you are going upside down on the deal.

But what’s the big deal, you ask?  Sure, in the cloud you are probably (and should be!) scaling up and down all the time, but as long as you reserve up to your low water mark it should work out, right?

So the big second problem is that when you reserve instances, you have to specify everything about that instance.  You aren’t reserving “10 instances”, or even “10 large instances” – you have to specify:

  • Platform (UNIX/Linux, UNIX/Linux VPC, SUSE Linux, Windows, Windows VPC, or Windows with SQL Server)
  • Instance Type (m1.small, etc.)
  • AZ (e.g. us-east-1b)

And tenancy and term. So you have to reserve “a small multitenant Linux instance in us-east-1b for one year.” But having to specify down to this level is really problematic in any kind of dynamic environment.

Let’s say you buy 10 m1.large instances for your databases, and then you realize later you really need to move up to an m1.xlarge.  Well, tough. You can, but if you don’t have 10 other things to run on those larges, you lose money. Or if you decide to change OS.  One of our biggest expenditures is our compile farm workers, and on those we hope to move from Windows to Linux once we get the software issues worked out, and we’re experimenting with best cost/performance on different instance sizes. I’m effectively blocked from buying reserve for those, since if I do it’ll put a stop to our ability to innovate.

And more subtly, let’s say you’re doing dynamic scaling and splitting across AZs like they always say you should do for availability purposes.  Well, if I’m running 20 instances, and scaling them across 1b and 1c, I am not guaranteed I’m always running 10 in 1b and 10 in 1c, it’s more random than that.  Instead of buying 20 reserve, you instead have to buy say 7 in 1b and 7 in 1c, to make sure you don’t end up losing money.

Heck, they even differentiate between Linux and Suse and Linux VPC instances, which clearly crosses over into annoyingly picky territory.

As a result of all this, it is pretty undesirable to buy reserve instances unless you have a very stable environment, both technically and scale-wise. That sentence doesn’t describe the typical cloud use case in my opinion.

I understand, obviously, why they are doing this.  From a capacity planning standpoint, it’s best for them if they make you specify everything. But what I don’t think they understand is that this cuts into people willing to buy reserve, and reserve is not only upfront money but also a lockin period, which should be grotesquely attractive to a business. I put off buying reserve for a year because of this, and even now that I’ve done it I’m not buying near as many reserve as I could be because I have to hedge my bets against ANY changes to my service. It seems to me that this also degrades the alleged point of reserves, which is capacity planning – if you’re so picky about it that no one buys reserve and 95% of your instances are on demand, then you can’t plan real well can you?

What Amazon needs to do is meet customers halfway.  It’s all a probabilities game anyway. They lose specificity of each given reserve request, but get many more reserve requests (and all the benefits they convey – money, lockin, capacity planning info) in return.

Let’s look at each axis of inflexibility and analyze it.

  • Size.  Sure, they have to allocate machines, right?  But I assume they understand they are using this thing called “virtualization.”  If I want to trade in 20 reserved small instances for 5 large instances (each large is 4x a small), why not?  It loses them nothing to allow this. They just have to make the effort to allow it to happen in their console/APIS. I can understand needing to reserve a certain number of “units” but those should be flexible on exact instance types at a given time.
  • OS. Why on God’s green earth do I need to specify OS?  Again, virtualized right? Is it so they can buy enough Windows licenses from Microsoft?  Cry me a river.  This one needs to leave immediately and never come back.
  • AZ. This is annoying from the user POV but probably the most necessary from the Amazon POV because they have to put enough hardware in each data center, right?  I do think they should try to make this a per region and not a per AZ limit, so I’m just reserving “us-east” in Virginia and not the specific AZ, that would accommodate all my use cases.

In the end, one of the reasons people move to the cloud in the first place is to get rid of the constraints of hardware.  When Amazon just puts those constraints back in place, it becomes undesirable. Frankly even now, I tried to just pay Amazon up front rather than actually buy reserve, but they’re not really enterprise friendly yet from a finance point of view so I couldn’t make that happen, so in the end I reluctantly bought reserve.  The analytics around it are lacking too – I can’t look in the Amazon console and see “Yes, you’re using all 9 of your large/linux/us-east-1b instances.”

Amazon keeps innovating greatly in the technology space but in terms of the customer interaction space, they need a lot of help – they’re not the only game in town, and people with less technically sophisticated but more aggressive customer services/support options will erode their market share. I can’t help that Rackspace’s play with backing OpenStack seems to be the purest example of this – “Anyone can run the same cloud we do, but we are selling our ‘fanatical support'” is the message.

8 Comments

Filed under Cloud

Amazon CloudFormation: Model Driven Automation For The Cloud

You may have heard about Amazon’s newest offering they announced today, CloudFormation.  It’s the new hotness, but I see a lot of confusion in the Twitterverse about what it is and how it fits into the landscape of IaaS/PaaS/Elastic Beanstalk/etc. Read what Werner Vogels says about CloudFormation and its uses first, but then come back here!

Allow me to break it down for you and explain why this is such a huge leverage point for cloud developers.

What Has Come Before

Up till now on Amazon you could configure up a single virtual image the way you wanted it, with an AMI. You could even kind of construct a scalable tier of similar systems using Auto Scaling, by defining Launch Configurations. But if you wanted to construct an entire multitier system it was a lot harder.  There are automated configuration management tools like chef and puppet out there, but their recipes/models tend to be oriented around getting a software loadout on an existing system, not the actual system provisioning – in general they come from the older assumption you have someone doing that on probably-physical systems using bcfg2 or cobber or vagrant or something.

So what were you to do if you wanted to bring up a simple three tier system, with a Web tier, app server tier, and database tier?  Either you had to set them up and start them manually, or you had to write code against the Amazon APIs to explicitly pull up what you wanted. Or you had to use a third party provisioning provider like RightScale or EngineYard that would let you define that kind of model in their Web consoles but not construct your own model programmatically and upload it. (I’d like my product functionality in my own source control and not your GUI, thanks.)

Now, recently Amazon launched Elastic Beanstalk, which is more way over on the PaaS side of things, similar to Google App Engine.  “Just upload your application and we’ll run it and scale it, you don’t have to worry about the plumbing.” Of course this sharply limits what you can do, and doesn’t address the question of “what if my overall system consists of more than just one Java app running in Beanstalk?”

If your goal is full model driven automation to achieve “infrastructure as code,” none of these solutions are entirely satisfactory. I understand CloudFormation deeply because we went down that same path and developed our own system model ourselves as a response!

I’ll also note that this is very similar to what Microsoft Azure does.  Azure is a hybrid IaaS/PaaS solution – their marketing tries to say it’s more like Beanstalk or Google App Engine, but in reality it’s more like CloudFormation – you have an XML file that describes the different roles (tiers) in the system, defines what software should go on each, and lets you control the entire system as a unit.

So What Is CloudFormation?

Basically CloudFormation lets you model your Amazon cloud-based system in JSON and then provision and control it as a unit.  So in our use case of a three tier system, you would model it up in their JSON markup and then CloudFormation would understand that the whole thing is a unit.  See their sample template for a WordPress setup. (A mess more sample templates are here.)

Review the WordPress template; it lets you define the AMIs and instance types, what the security group and ELB setups should be, the RDS database back end, and feed in variables that’ll be used in the consuming software (like WordPress username/password in this case).

Once you have your template you can tell Amazon to start your “stack” in the console! It’ll even let you hook it up to a SNS notification that’ll let you know when it’s done. You name the whole stack, so you can distinguish between your “dev” environment and your “prod” environment for example, as opposed to the current state of the Amazon EC2 console where you get to see a big list of instance IDs – they added tagging that you can try to use for this, but it’s kinda wonky.

Why Do I Want This Again?

Because a system model lets you do a number of clever automation things.

Standard Definition

If you’ve been doing Amazon yourself, you’re used to there being a lot of stuff you have to do manually.  From system build to system build even you do it differently each time, and God forbid you have multiple techies working on the same Amazon system. The basic value proposition of “don’t do things manually” is huge.  You configure the security groups ONCE and put it into the template, and then you’re not going to forget to open port 23 AGAIN next time you start a system. A core part of what DevOps is realizing as its value proposition is treating system configuration as code that you can source control, fix bugs in and have them stay fixed, etc.

And if you’ve been trying to automate your infrastructure with tools like Chef, Puppet, and ControlTier, you may have been frustrated in that they address single systems well, but do not really model “systems of systems” worth a damn.  Via new cloud support in knife and stuff you can execute raw “start me a cloud server” commands but all that nice recipe stuff stops at the box level and doesn’t really extend up to provisioning and tracking parts of your system.

With the CloudFormation template, you have an actual asset that defines your overall system.  This definition:

  • Can be controlled in source control
  • Can be reviewed by others
  • Is authoritative, not documentation that could differ from the reality
  • Can be automatically parsed/generated by your own tools (this is huge)

It’s also nicely transparent; when you go to the console and look at the stack it shows you the history of events, the template used to start it, the startup parameters it used… Moving away from the “mystery meat” style of system config.

Coordinated Control

With CloudFormation, you can start and stop an entire environment with one operation. You can say “this is the dev environment” and be able to control it as a unit. I assume at some point you’ll be able to visualize it as a unit, right now all the bits are still stashed in their own tabs (and I notice they don’t make any default use of their own tagging, which makes it annoying to pick out what parts are from that stack).

This is handy for not missing stuff on startup and teardown… A couple weeks ago I spent an hour deleting a couple hundred rogue EBSes we had left over after a load test.

And you get some status eventing – one of the most painful parts of trying to automate against Amazon is the whole “I started an instance, I guess I’ll sit around and poll and try to figure out when the damn thing has come up right.”  In CloudFront you get events that tell you when each part and then the whole are up and ready for use.

What It Doesn’t Do

It’s not a config management tool like Chef or Puppet. Except for what you bake onto your AMI it has zero software config capabilities, or command dispatch capabilities like Rundeck or mcollective or Fabric. Although it should be a good integration point with those tools.

It’s not a PaaS solution like Beanstalk or GAE; you use those when you just have an app you want to deploy to something that’ll run it.  Now, it does erode some use cases – it makes a middle point between “run it all yourself and love the complexity” and “forget configurable system bits, just use PaaS.”  It allows easy reusability, say having a systems guy develop the template and then a dev use it over and over again to host their app, but with more customization than the pure-play PaaSes provide.

It’s not quite like OVF, which is more fiddly and about virtually defining the guts of a single machine than defining a set of systems with roles and connections.

Competitive Analysis

It’s very similar to Microsoft Azure’s approach with their .cscfg and .csdef files which are an analogous XML model – you really could fairly call this feature “Amazon implements Azure on Amazon” (just as you could fairly call Elastic Beanstalk “Amazon implements Google App Engine on Amazon”.) In fact, the Azure Fabric has a lot more functionality than the primitive Amazon events in this first release. Of course, CloudFormation doesn’t just work on Windows, so that’s a pretty good width vs depth tradeoff.

And it’s similar to something like a RightScale, and ideally will encourage them to let customers actually submit their own definition instead of the current clunky combo of ServerArrays and ServerTemplates (curl or Web console?  Really? Why not a model like this?). RightScale must be in a tizzy right now, though really just integrating with this model should be easy enough.

Where To From Here?

As I alluded, we actually wrote our own tool like this internally called PIE that we’re looking to open source because we were feeling this whole problem space keenly.  XML model of the whole system, Apache Zookeeper-based registry, kinda like CloudFormation and Azure. Does CloudFormation obsolete what we were doing?  No – we built it because we wanted a model that could describe cloud systems on multiple clouds and even on premise systems. The Amazon model will only help you define Amazon bits, but if you are running cross-cloud or hybrid it is of limited value. And I’m sure model visualization tools will come, and a better registry/eventing system will come, but we’re way farther down that path at least at the moment. Also, the differentiation between “provisioning tools” that control and start systems like CloudFormation and bcfg2 and “configuration” tools that control and start software like Chef and Puppet (and some people even differentiate between those and “deploy” tools that control and start applications like Capistrano) is a false dichotomy. I’m all about the “toolchain” approach but at some point you need a toolbelt. This tool differentiation is one of the more harmful “Dev vs Ops” differentiations.

I hope that this move shows the value of system modeling and helps people understand we need an overarching model that can be used to define it all, not just “Amazon” vs “Azure” or “system packages” vs “developed applications” or “UNIX vs Windows…” True system automation will come from a UNIVERSAL model that can be used to reason about and program to your on premise systems, your Amazon systems, your Azure systems, your software, your apps, your data, your images and files…

Conclusion

You need to understand CloudFormation, because it is one of the most foundational changes that will have a lot of leverage that AWS has come out with in some time. I don’t bother to blog about most of the cool new AWS features, because they are cool and I enjoy them but this is part of a more revolutionary change in the way systems are managed, the whole DevOps thing.

7 Comments

Filed under Cloud, DevOps

Our First Cloud Product Released!

Hey all, I just wanted to take a moment to share with you that our first cloud-based product just went live!  LabVIEW Web UI Builder is National Instruments’ first SaaS application.  It’s actually free to use, go to ni.com and “Try It Now”, all you have to do is make an account.  It’s a freemium model, so you can use it, save your code, run it, etc. all you want; we charge to get the “Build & Deploy” functionality that enables you to compile, download, and deploy the bundled app to an embedded device or whatnot.

Essentially it’s a Silverlight app (can be installed out of browser on your box or just launched off the site) that lets you graphically program test & measurement, control, and simulation type of programs.  You can save your programs to the cloud or locally to your own machine.  The programs can interact via Web services with anything, but in our case it’s especially interesting when they interact with data acquisition devices.  There’s some sample programs on that page that show what can be done, though those are definitely tuned to engineers…  We have apps internally that let you play frogger and duck hunt, or do the usual Web mashup kinds of things calling google maps apis.  So feel free and try out some graphical programming!

Cool technology we used to do this:

And it’s 100% DevOps powered.  Our implementation team consists of developers and sysadmins, and we built the whole thing using an agile development methodology.  All our systems are created by model-driven automation from assets and definitions in source control.  We’ll post more about the specifics now that we’ve gotten version 1 done!  (Of course, the next product is just about ready too…)

Leave a comment

Filed under Cloud, DevOps

Cloud Security: a chicken in every pot and a DMZ for every service

There are a couple military concepts that have bled into technology and in particular into IT security, a DMZ being one of them. A Demilitarized Zone (DMZ) is a concept where there is established control for what comes in and what goes out between two parties and in military terms you can establish a “line of control” by using a DMZ. In a human-based DMZ, controllers of the DMZ make ingress (incoming) and egress (outgoing) decisions based on an approved list–no one is allowed to pass unless they are on the list and have proper identification and approval.

In the technology world, the same thing is done based on traffic between computers. Decisions to allow or disallow the traffic can be made based on where the traffic came from (origination), where it is going (destination) or even dimensions of the traffic like size, length, or even time of day. The basic idea being that all traffic is analyzed and is either allowed or disallowed based on determined rules and just like in a military DMZ, there is a line of control where only approved traffic is allowed to ingress or egress. In many instances a DMZ will protect you from malicious activity like hackers and viruses but it also protects you from other configuration and developer errors and can guarantee that your production systems are not talking to test or development tiers.

Lets look at a basic web tiered architecture. In a corporation that hosts its own website they will more than likely have the following four components: incoming internet traffic, a web server, a database server and an internal network. To create a DMZ or multiple DMZ instances to handle their web traffic, they would want to make sure that someone from the internet could only talk to the web server, they would also want to verify that only the web server can only talk to the database server and they would want to make sure that their internal network is inaccessible by the the web server, database server and the internet traffic.

Using firewalls, you would need to set up at least the below three firewalls to adequately control the DMZ instances:

1. A firewall between the external internet and the web server
2. A firewall in front of the internal network
3. A firewall between your web servers and database server

Of course these firewalls need to be set so that they allow (or disallow) only certain types of traffic. Only traffic that meets certain rules based on its origination, destination, and its dimensions will be allowed.

Sounds great, right? The problem is that firewalls have become quite complicated and now sometimes aren’t even advertised as firewalls, but instead are billed as a network-device-that-does-anything-you-want-that-can-also-be-a-firewall-too. This is due in part to the hardware getting faster, IT budgets shrinking and scope creep. The “firewall” now handles VPN, traffic acceleration, IDS duties, deep packet inspection and making sure your employees aren’t watching YouTube videos when they should be working. All of those are great, but it causes firewalls to be expensive, fragile and difficult to configure.  And to have the firewall watching all ingress and egress points across your network, you usually have to buy several devices to scatter throughout your topology.

Additionally, another recurring problem is that most firewall analysis and implementation is done with an intent to secure the perimeter. Which make sense, but it often stops there and doesn’t protect the interior parts of your network. Most IT security firms that do consulting and penetration tests don’t generally come through the “front door” by that I mean, they don’t generally try to get in through the front-facing web servers but instead they will go through other channels such as dial-in, wireless, partner, third-party services, social engineering, FTP servers or that demo system that was setup back 5 years ago that no hasn’t been taken down–you know the one I am talking about. Once inside, if there are no well-defined DMZs, then it is pretty much game-over because at that point there are no additional security controls. A DMZ will not fix all your problems, but it will provide an extra layer of protection that could protect you from malicious activity. And like I mentioned earlier it can also help prevent configuration errors crossing from dev to test to prod.

In short, a DMZ is a really good idea and should be implemented for every system that you have. The most optimal DMZ would be a firewall in front of each service that applies rules to determine what traffic is allowed in and out. This, however, is expensive to set up and very rarely gets implemented. That was the old days, this is now and good news, the cloud has an answer.

I am most familiar with Amazon Web Services so we will include an example of how do this with security groups from an AWS perspective. The following code creates a web server group and a database server group and allows the web server to talk to the database on port 3306 only.

ec2-add-group web-server-sec-group -d "this is the group for web servers" #This creates the web server group with a description
ec2-add-group db-server-sec-group -d "this is the group for db server" #This creates the db server group with a description
ec2-authorize web-server-sec-group -P tcp -p 80 -s 0.0.0.0/0 #This allows external internet users to talk to the web servers on port 80 only
ec2-authorize db-server-sec-group --source-group web-server-sec-group --source-group-user AWS_ACCOUNT_NUMBER -P tcp -p 3306 #This allows only traffic from the web server group on port 3306 (mysql) to ingress

Under the above example the database server is in a DMZ and only traffic from the web servers are allowed to ingress into it. Additionally, the web server is in a DMZ in that it is protected from the internet on all other ports except for port 80. If you were to implement this for every role in your system, you would be in effect implementing a DMZ between each layer and would provide excellent security protection.

The cloud seems to get a bad rap in terms of security. But I would counter that in some ways the cloud is more secure since it lets you actually implement a DMZ for each service. Sure, they won’t do deep packet analysis or replace an intrusion detection system, but they will allow you to specifically define what ingresses and egresses are allowed on each instance.  We may not ever get a chicken in every pot, but with the cloud you can now put a DMZ on every service.

Leave a comment

Filed under Cloud, Security

Velocity 2010 – Day 3 Keynotes

Ohhh my aching head.  Apparently this is a commonly held problem, as the keynote hall is much more sparsely attended at 8:30 AM today than it was yesterday.  Some great fun last night, we hung with the Cloudkick and Turner Broadcasting guys, drinking and playing the “Can we name all the top level Apache projects” game.

It’s time for another morning of keynotes.  Presentations from yesterday should be appearing on their schedule pages (I link to the schedule page in each blog post, so they should only be a click away).  As always, my own comments below will be set off in italics to avoid libel suits.

First, we have John Rauser from Amazon on Creating Cultural Change.

Since we are technologists and problemsolvers, we of course tend to try to solve problems with technology – but many of our biggest, hardest problems are cultural.

It’s very true for Operations and performance because their goals are often in tension with other parts of the business.  Much like security, legal, and other “non-core” groups. And it’s easy to fall into an adversarial relationship when there are “two sides.”  Having a dedicated ops team is somewhat dangerous for this reason.  So you need to ingrain performance and ops into your org’s mentality.  Idyllic?  Maybe, but doable.

If you determine someone is a bad person, like the coffee “free riders” that take the last cup and don’t make more, you have a dilemma.  Complaining about “free riders” doesn’t work.  Nagging, shaming, etc. same deal  He had a friend that put in some humorous placards that marketed the “product” of making coffee.  And it worked.

Sasquatch dancing guy!  I don’t have the heart to go into it, just Google the video.  Anyway, people join in when there’s unabashed joy and social cover.  If you’re cranky, you add social cover for people to be cranky.  Welcome newcomers.  Lavish praise.  Help them succeed.  “Treat your beta testers as your most valuable resource and they will respond by becoming your most valuable resource.”

Shreddies vs Diamond Shreddies!  Rebranding and perception change.  Is DevOps our opportunity to turn “infrastructure people” into “agile admins?”

Anyway, be relentlessly happy and joyful.  I know I have positivity problems and I definitely look back and say “outcomes would have been better if I didn’t succumb to the temptation to bitch about those “dumbass programmers”…

1.  Try something new.  A little novelty gets through people’s mental filters.  If you’ve tried without metrics, try with.  If you’ve tried metrics, try business drivers.  If you tried that, pull out a single user session and simulate what it’s like.

2. Group identity.  Mark people as special.  Badges!  Authority!  Invite people to review their next project with an “ops expert” or “perf expert.”

3.  Be relentless.  Sending an email and waiting is a chump play.  And be relentlessly happy.

There was a lot of wisdom in this presentation.  As a former IT manager trying to run a team with complex relationships with other infrastructure teams, dev teams, and business teams, times where we hewed to this kind of theory things tended to work, and when we didn’t they tended not to.

Operations at Twitter by John Adams

What’s changed since they spoke last year?  They’ve made headway on Rails performance, more efficient use of Apache, many more servers, lbs, and people.  Up to 210 employees.

One of my questions about the devops plan is how to scale it – same problem agile development has.

More and more it’s API.  75% of the traffic to Twitter is API now.  160k registered apps, 100M searches a day, 65M tweets per day.

They’re trying to work on CM and other stuff.  Scaling doesn’t work the first time – you have to rebuild (refactor, in agile speak).  They’re doing that now.

Shortening Mean Time To Detect Problems drives shorter Mean Time To Recovery.

They continuously evaluate looking for bottlenecks – find the weakest part, fix it, and move on to the next in an iterative manner.

They are all about metrics using ganglia, and feed it to users on dev.twitter.com ans status.twitter.com.

Don’t be a “systems administrator” any more.  Combine statistical analysis and monitoring to produce meaningful results.  Make decisions based on data not gut instincts.

They’re working on low level profiling of Apache, Ruby, etc. Network –  Latency, network usage, memory leaks.  tcpdump + tcpdstat, yconalyze  Apps – Introspect with google perftools

Instrumenting the world pays off.  Data analysis and visualization are necessary skills nowadays.

Rails hasn’t really been their performance problem.  It’s front end problems like caching/cache invalidation problems, bad queries generated by ActiveQuery, garbage collection (20% of the issues!), and replication lag.

Analyze!  Turn data into information.  Understand where the code base is going.

Logging!  Syslog doesn’t work at scale.  No redundancy, no failure recovery.  And moving large files is painful.  They use scribe to HDFS/hadoop with LZO compression.

Dashboard – Theirs has “criticals” view (top 10 metrics), smokeping/mrtg, google analytics (not just for 200s!), XML feeds from managed services.

Whale Watcher – a shell script that looks for errors in the logs.

Change management is the stuff.  They use Reviewboard and puppet+svn.  Hundreds of modules, runs constantly.    It reuses tools that engineers use.

And Deploywatcher, another script that stops deploys if there’s system problems.  They work a lot on deploy.  Graph time of day next to CPU/latency.

They release features in “dark mode” and have on/off switches. Especially computationally/IO heavy stuff.  Changes are logged and reported to all teams (they have like 90 switches).  And they have a static/read-only mode and “emergency stop” button.

Subsystems!  Take a look at how we manage twitter.

loony – a central machine database in mySQL.  They use managed hosting so they’re always mapping names.  Python, django, paraminko SSH (twitter’s OSS SSH library).  Ties into LDAP.  When the data center sends mail, machine definitions are built in real time.  On demand changes with “run.”  Helps with deploy and querying.

murder – a bittorrent based deploy client, python+libtorrent.

memcached – network memory bus isn’t infinite.  Evictions make the cache unreliable for important configs.  They segment into pools for better performance.  Examine slab allocation and watch for high use/eviction with “peep. ”  Manage it like anything else.

Tiers – load balancer, apache, rails (unicorn), flock DB.

Unicorn is awesome and more awesome than mongrel for Rails guys.

Shift from Proxy Balancer to ProxyPass (the slide said PP to PB, but he spoke about it the other way, and putting our heads together we believe the latter more) Apache’s not better than nginx, it’s the proxy.

Aynchronous requests.  Do it.  Workers are expensive.  The request pipeline should not be used to handle third party communication or back end work.  Move long running work to daemons whenever possible.  They’re moving more parts to queuing

kestrel is their queuing server that looks like memcache.  set/get.

They have daemons, in fact have been doing consolidation (not one daemon per job, one per many jobs).

Flock DB shards their social graph through gizzard and stores it in mySQL.

Disk is the new tape

Caching – realtime but heck, 60 seconds is close.  Separate memcache pools for different types.  “Cache everything” is not the best policy – invalidation problems, cold memcache problems.  Use memcache to augment the database.  You don’t want to go down if you lose memcache, your db still needs to handle the load.

MySQL challenges – replication delay.  And social networks don’t fit RDBMS well. Kill long running SQL queries with mkill.  Fail fast!!!

He’s going over a lot of this too fast to write down but there’s GREAT info.  Look for the slides later.  It’s like drinking from a fire hose.

In closing –

  • Use CM early
  • log everything
  • plan to build everything more than once
  • instrument everything and use science

Tim Morrow from ShopZilla on Time Is Money

ShopZilla spoke previous years about their major performance redesign and the effects it had.  It opened their eyes to the $$ benefits and got them addicted to performance.

Performance is top in Fred Wilson’s 10 Golden Rules For Successful Web Apps.  And Google/Microsoft have put out great data this year on performance’s link to site metrics.

Performance can slip away if you take your eyes off the ball.  It doesn’t improve if left alone.

They took their eye off the ball because they were testing back end perf but not front end (and that’s where 80% of the time is spent).  Constant feature development runs it up.  A/B testing needs framework that adds JS and overhead.

It’s easier to attack performance early in the dev cycle, by infecting everyone with a performance mindset.

They put together a virtual performance team and went for more measurements.  Nightly testing using httpwatch and YSlow and stuff.  All these “one time” test tools really need to all be built into a regression rig.

They found they were doing some good things but found things they could fix.  Progressive rendering 8k chunks were too large and they set tomcat to smaller flush intervals.  Too many requests. Bandwidth contention with less critical page elements.

They wanted to focus on the perception of performance via progressive rendering, defer less important stuff.  Flushing faster got the header out quicker. They reordered stuff.   It improved their conversion rate by .4%.  Doesn’t’ sound like much but it’s comparable to a feature release, and they had a 2 month ROI given what they put into it.

They did an infrastructure audit looking for hotspots and underutilization, and saved a lot in future hardware costs ($480k).

Performance is an important feature, but isn’t free, but has a measurable value.  ROI!

Imad Mouline from Compuware on Performance Testing

Actually Gomez, which is now “the Web performance division of Compuware”.  He promises not to do a product pitch, but to share data.

Does better performance impact customer behavior and the bottom line?  They looked at their data.  Performance vs page abandonment.  If you improve your performance, abandon rate goes down by large percentages per second of speedup.

The Web is the new integration platform and the browser’s where the apps come together.  How many hosts are hit by the browser per user transaction on average?  8 or higher, across all industries and localities.

What percent of Web transactions touch Amazon EC2 for at least one object?  Like 20%.  It is going up hideously fast (like 4% in the last month).

Cloud performance concerns – loss of visibility and control especially because existing tools don’t work well.  And multitenant means “someone might jack me!”

They put the same app over many clouds and monitored it!  They have a nice graph that shows variance; I can’t find it with a quick Google search though, otherwise I’d link it here for you.  And availability across many clouds is about 99.5%.

How do you know if the problem is the cloud or you?  They put together cloudsleuth.net to show off these apps and performance.  You can put in your own URL and soon you’ll get data on “is the cloud messed up, or is it you?”

Domain sharding is a common performance optimization.  With S3 you get that “for free” using buckets’ DNS names and you can get a big performance speedup.

The cloud allows dynamic provisioning… Yeah we know.

But… Domain sharding fails to show a benefit on modern browsers.  In fact, it hurts.

At NI we recently did a whole analysis of various optimizations and found exactly this – domain sharding didn’t help and in fact hurt performance a bit.  We thought we might have been crazy or doing something wrong.  Apparently not.

They can see the significant performance differences among browsers/devices.

You have to test and validate your optimizations.  Older wisdom (like “shard domains”) doesn’t always hold any more.

Check some of this stuff out at gomez.com/velocity!

Coming up – lightning demos!

Leave a comment

Filed under Conferences, DevOps

Amazon Web Services – Convert To/From VMs?

In the recent Amazon AWS Newsletter, they asked the following:

Some customers have asked us about ways to easily convert virtual machines from VMware vSphere, Citrix Xen Server, and Microsoft Hyper-V to Amazon EC2 instances – and vice versa. If this is something that you’re interested in, we would like to hear from you. Please send an email to aws-vm@amazon.com describing your needs and use case.

I’ll share my reply here for comment!

This is a killer feature that allows a number of important activities.

1.  Product VMs.  Many suppliers are starting to provide third-party products in the form of VMs instead of software to ease install complexity, or in an attempt to move from a hardware appliance approach to a more-software approach.  This pretty much prevents their use in EC2.  <cue sad music>  As opposed to “Hey, if you can VM-ize your stuff then you’re pretty close to being able to offer it as an Amazon AMI or even SaaS offering.”  <schwing!>

2.  Leveraging VM Investments.  For any organization that already has a VM infrastructure, it allows for reduction of cost and complexity to be able to manage images in the same way.  It also allows for the much promised but under-delivered “cloud bursting” theory where you can run the same systems locally and use Amazon for excess capacity.  In the current scheme I could make some AMIs “mostly” like my local VMs – but “close” is not good enough to use in production.

3.  Local testing.  I’d love to be able to bring my AMIs “down to me” for rapid redeploy.  I often find myself having to transfer 2.5 gigs of software up to the cloud, install it, find a problem, have our devs fix it and cut another release, transfer it up again (2 hour wait time again, plus paying $$ for the transfer)…

4.  Local troubleshooting. We get an app installed up in the cloud and it’s not acting quite right and we need to instrument it somehow to debug.  This process is much easier on a local LAN with the developers’ PCs with all their stuff installed.

5.  Local development. A lot of our development exercises the Amazon APIs.  This is one area where Azure has a distinct advantage and can be a threat; in Visual Studio there is a “local Azure fabric” and a dev can write their app and have it running “in Azure” but on their machine, and then when they’re ready deploy it up.  This is slightly more than VM consumption, it’s VMs plus Eucalyptus or similar porting of the Amazon API to the client side, but it’s a killer feature.

Xen or VMWare would be fine – frankly this would be big enough for us I’d change virtualization solutions to the one that worked with EC2.

I just asked one of our developers for his take on value for being able to transition between VMs and EC2 to include in this email, and his response is “Well, it’s just a no-brainer, right?”  Right.

1 Comment

Filed under Cloud

Amazon EC2 EBS Instances and Ephemeral Storage

Here’s a couple tidbits I’ve gleaned that are useful.

When  you start an “instance-store” Amazon EC2 instance, you get a certain amount of ephemeral storage allocated and mounted automatically.  The amount of space varies by instance size and is defined here.  The storage location and format also varies by instance size and is defined here.

The upshot is that if you start an “instance-store” small Linux EC2 instance, it automagically has a free 150 GB /mnt disk and a 1 GB swap partition up and runnin’ for ya.  (mount points vary by image, but that’s where they are in the Amazon Fedora starter.)

[root@domU-12-31-39-00-B2-01 ~]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             10321208   1636668   8160252  17% /
/dev/sda2            153899044    192072 145889348   1% /mnt
none                    873828         0    873828   0% /dev/shm
[root@domU-12-31-39-00-B2-01 ~]# free
total       used       free     shared    buffers     cached
Mem:       1747660      84560    1663100          0       4552      37356
-/+ buffers/cache:      42652    1705008
Swap:       917496          0     917496

But, you say, I am not old or insane!  I use EBS-backed images, just as God intended.  Well, that’s a good point.  But when you pull up an EBS image, these ephemeral disk areas are not available to you.  The good news is, that’s just by default.

The ephemeral storage is still available and can be used (for free!) by an EBS-backed image.  You just have to set the block devices up either explicitly when you run the instance or bake them into the image.

Runtime:

You refer to the ephemeral chunks as “ephemeral0”, “ephemeral1”, etc. – they don’t tell you explicitly which is which but basically you just count up based on your instance type (review the doc).  For a small image, it has an ephemeral0 (ext3, 15 GB) and an ephemeral1 (swap, 1 GB).  To add them to an EBS instance and mount them in the “normal” places, you do:

ec2-run-instances <ami id> -k <your key> --block-device-mapping '/dev/sda2=ephemeral0'
--block-device-mapping '/dev/sda3=ephemeral1'

On the instance you have to mount them – add these to /etc/fstab and mount -a or do whatever else it is you like to do:

/dev/sda3                 swap                    swap    defaults 0 0
/dev/sda2                 /mnt                    ext3    defaults 0 0

And if you want to turn the swap on immediately, “swapon /dev/sda3”.

Image:

You can also bake them into an image.  Add a fstab like the one above and when you create the image, do it like this, using the exact same –block-device-mapping flag:

ec2-register -n <ami id> -d "AMI Description" --block-device-mapping  /dev/sda2=ephemeral0
--block-device-mapping '/dev/sda3=ephemeral1' --snapshot your-snapname --architecture i386
--kernel<aki id>  --ramdisk <ari id>

Ta da. Free storage that doesn’t persist.  Very useful as /tmp space.  Opinion is split among the Linuxerati about whether you want swap space nowadays or not; some people say some mix of  “if you’re using more than 1.8 GB of RAM you’re doing it wrong” and “swapping is horrid, just let bad procs die due to lack of memory and fix them.”  YMMV.

Ephemeral EBS?

As another helpful tip, let’s say you’re adding an EBS to an image that you don’t want to be persistent when the instance dies.  By default, all EBSes are persistent and stick around muddying up your account till you clean them up.   If you don’t want certain EBS-backed drives to persist, what you do is of the form:

ec2-modify-instance-attribute --block-device-mapping "/dev/sdb=vol-f64c8e9f:true" i-e2a0b08a

Where ‘true’ means “yes, please, delete me when I’m done.”  This command throws a stack trace to the tune of

Unexpected error: java.lang.ClassCastException: com.amazon.aes.webservices.client.InstanceBlockDeviceMappingDescription
cannot be cast to com.amazon.aes.webservices.client.InstanceBlockDeviceMappingResponseDescription

But it works, that’s just a lame API tools bug.

8 Comments

Filed under Cloud, Uncategorized

Microsoft Azure for Dummies – or for Smarties?

What Is Microsoft Azure?

I’m going to attempt to explain Microsoft Azure in “normal Web person” language.  Like many of you, I am more familiar with Linux/open source type solutions, and like many of you, my first forays into cloud computing have been with Amazon Web Services.  It can often be hard for people not steeped in Redmondese to understand exactly what the heck they’re talking about when Microsoft people try to explain their offerings.  (I remember a time some years ago I was trying to get a guy to explain some new Microsoft data access thing with the usual three letter acronym name.  I asked, “Is it a library?  A language?  A protocol?  A daemon?  Branding?  What exactly is this thing you’re trying to get me to uptake?”  The reply was invariably “It’s an innovative new way to access data!”  Sigh.  I never did get an answer and concluded “Never mind.”)

Microsoft has released their new cloud offering, Azure.  Our company is a close Microsoft partner since we use a lot of their technologies in developing our company’s desktop software products, so as “cloud guy” I’ve gotten some in depth briefings and even went to PDC this year to learn more (some of my friends who have known me over the course of my 15 years of UNIX administration were horrified).  “Cloud computing” is an overloaded enough term that it’s not highly descriptive and it took a while to cut through the explanations to understand what Azure really is.  Let me break it down for you and explain the deal.

Point of Comparison: Amazon (IaaS)

In Amazon EC2, as hopefully everyone knows by now, you are basically given entire dynamically-provisioned, hourly-billed virtual machines that you load OSes on and install software and all that.  “Like servers, but somewhere out in the ether.”  Those kinds of cloud offerings (e.g. Amazon, Rackspace, most of them really) are called Infrastructure As A Service (IaaS).  You’re responsible for everything you normally would be, except for the data center work.  Azure is not an IaaS offering but still bears a lot of similarities to Amazon; I’ll get into details later.

Point of Comparison: Google App Engine (PaaS)

Take Google’s App Engine as another point of comparison.  There, you just upload your Python or Java application to their portal and “it runs on the Web.”  You don’t have access to the server or OS or disk or anything.  And it “magically” scales for you.  This approach is called Platform as a Service (PaaS).   They provide the full platform stack, you only provide the end application.  On the one hand, you don’t have to mess with OS level stuff – if you are just a Java programmer, you don’t have to know a single UNIX (or Windows) command to transition your app from “But it works in Eclipse!” to running on a Web server on the Internet.  On the other hand, that comes with a lot of limitations that the PaaS providers have to establish to make everything play together nicely.  One of our early App Engine experiences was sad – one of our developers wrote a Java app that used a free XML library to parse some XML.  Well, that library had functionality in it (that we weren’t using) that could write XML to disk.  You can’t write to disk in App Engine, so its response was to disallow the entire library.  The app didn’t work and had to be heavily rewritten.  So it’s pretty good for code that you are writing EVERY SINGLE LINE OF YOURSELF.  Azure isn’t quite as restrictive as App Engine, but it has some of that flavor.

Azure’s Model

Windows Azure falls between the two.  First of all, Azure is a real “hosted cloud” like Amazon Web Services, like most of us really think about when we think cloud computing; it’s not one of these on premise things that companies are branding as “cloud” just for kicks. That’s important to say because it seems like nowadays the larger the company, the more they are deliberately diluting the term “cloud” to stick their products under its aegis.  Microsoft isn’t doing that, this is a “cloud offering” in the classical (where classical means 2008, I guess) sense.

However, in a number of important ways it’s not like Amazon.  I’d definitely classify it as a PaaS offering.  You upload your code to “Roles” which are basically containers that run your application in a Windows 2008(ish) environment.  (There are two types – a “Web role” has a stripped down IIS provided on it, a “Worker role” doesn’t – the only real difference between the two.)  You do not have raw OS access, and cannot do things like write to the registry.  But, it is less restrictive than App Engine.  You can bundle up other stuff to run in Azure – even run Java apps using Apache Tomcat.  You have to be able to install whatever you want to run “xcopy only” – in other words, no fancy installers, it needs to be something you could just copy the files to a Windows PC, without administrative privilege, and run a command from the command line and have it work.  Luckily, Tomcat/Java fits that description. They have helper packs to facilitate doing this with Tomcat, memcached, and Apache/PHP/MediaWiki.  At PDC they demoed Domino’s Pizza running their Java order app on it and a WordPress blog running on it.  So it’s not only for .NET programmers.  Managed code is easier to deploy, but you can deploy and run about anything that fits the “copy and run command line” model.

I find this approach a little ironic actually.  It’s been a lot easier for us to get the Java and open source (well, the ones with Windows ports) parts of our infrastructure running on Azure than Windows parts!  Everybody provides Windows stuff with an installer, of course, and you can’t run installers on Azure.  Anyway, in its core computing model it’s like Google App Engine – it’s more flexible than that (good) but it doesn’t do automatic scaling (bad).  If it did autoscaling I’d be willing to say “It’s better than App Engine in every way.”

In other ways, it’s a lot like Amazon.  They offer a variety of storage options – blobs (like S3), tables (like SimpleDB), queues (like SQS), drives (like EBS), SQL Azure (like RDS).  They have an integral CDN.  They do hourly billing.  Pricing is pretty similar to Amazon – it’s hard to totally equate apples to apples, but Azure compute is $0.12/hr and an Amazon small Windows image compute is $0.12/hr (Coincidence?  I think not.).  And you have to figure out scaling and provisioning yourself on Amazon too – or pay a lot of scratch to one of the provisioning companies like RightScale.

What’s Unique and Different

Well, the largest thing that I’ve already mentioned is the PaaS approach.  If you need OS level access, you’re out of luck;  if you don’t want to have to mess with OS management, you’re in luck!  So to the first order of magnitude, you can think of Azure as “like Amazon Web Services, but the compute uses more of a Google App Engine model.”

But wait, there’s more!

One of the biggest things that Azure brings to the table is that, using Visual Studio, you can run a local Azure “fabric” on your PC, which means you can develop, test, and run cloud apps locally without having to upload to the cloud and incur usage charges.  This is HUGE.  One of the biggest pains about programming for Amazon, for instance, is that if you want to exercise any of their APIs, you have to do it “up there.”  Also, you can’t move images back and forth between Amazon and on premise.  Now, there are efforts like EUCALYPTUS that try to overcome some of this problem but in the end you pretty much just have to throw in the towel and do all dev and test up in the cloud.  Amazon and Eclipse (and maybe Xen) – get together and make it happen!!!!

Here’s something else interesting.  In a move that seems more like a decision from a typical cranky cult-of-personality open source project, they have decided that proper Web apps need to be asynchronous and message-driven, and by God that’s what you’re going to do.  Their load balancers won’t do sticky sessions (only round robin) and time out all connections between all tiers after 60 seconds without exception.  If you need more than that, tough – rewrite your app to use a multi-tier message queue/event listener model.  Now on the one hand, it’s hard for me to disagree with that – I’ve been sweating our developers, telling them that’s the correct best-practice model for scalability on the Web.  But again you’re faced with the “Well what if I’m using some preexisting software and that’s not how it’s architected?” problem.  This is the typical PaaS pattern of “it’s great, if you’re writing every line of code yourself.”

In many ways, Azure is meant to be very developer friendly.  In a lot of ways that’s good.  As a system admin, however, I wince every time they go on about “You can deploy your app to Azure just by right clicking in Visual Studio!!!”  Of course, that’s not how anyone with a responsibly controlled production environment would do it, but it certainly does make for fast easy adoption in development.   The curve for a developer who is “just” a C++/Java/.NET/whatever wrangler to get up and going on an IaaS solution like Amazon is pretty large comparatively; here, it’s “go sign up for an account and then click to deploy from your IDE, and voila it’s running on the Intertubes.”  So it’s a qualified good – it puts more pressure on you as an ops person to go get the developers to understand why they need to utilize your services.  (In a traditional server environment, they have to go through you to get their code deployed.)  Often, for good or ill, we use the release process as a touchstone to also engage developers on other aspects of their code that need to be systems engineered better.

Now, that’s my view of the major differences.  I think the usual Azure sales pitch would say something different – I’ve forgotten two of their huge differentiators, their service bus and access control components.  They are branded under the name “AppFabric,” which as usual is a name Microsoft is also using for something else completely different (a new true app server for Windows Server, including projects formerly code named Dublin and Velocity – think of it as a real WebLogic/WebSphere type app server plus memcache.)

Their service bus is an ESB.  As alluded to above, you’re going to want to use it to do messaging.   You can also use Azure Queues, which is a little confusing because the ESB is also a message queue – I’m not clear on their intended differentiation really.  You can of course just load up an ESB yourself in any other IaaS cloud solution too, so if you really want one you could do e.g. Apache ServiceMix hosted on Amazon.  But, they are managing this one for you which is a plus.  You will need to use it to do many of the common things you’d want to do.

Their access control – is a mess.  Sorry, Microsoft guys.  The whole rest of the thing, I’ve managed to cut through the “Microsoft acronyms versus the rest of the world’s terms and definitions” factor, but not here.   “You see, you use ACS’s WIF STS to generate a SWT,” says our Microsoft rep with a straight face.   They seem to be excited that it will use people’s Microsoft Live IDs, so if you want people to have logins to your site and you don’t want to manage any of that, it is probably nice.  It takes SAML tokens too, I think, though I’m not sure if the caveats around that end up equating to “Well, not really.”  Anyway, their explanations have been incoherent so far and I’m not smelling anything I’m really interested in behind it.  But there’s nothing to prevent you from just using LDAP and your own Internet SSO/federation solution.  I don’t count this against Microsoft because no one else provides anything like this, so even if I ignore the Azure one it doesn’t put it behind any other solution.

The Future

Microsoft has said they plan to add on some kind of VM/IaaS offering eventually because of the demand.  For us, the PaaS approach is a bit of a drawback – we want to do all kinds of things like “virus scan uploaded files,” “run a good load balancer,” “run an LDAP server”, and other things that basically require more full OS access.  I think we may have an LDAP direction with the all-Java OpenDS, but it’s a pain point in general.

I think a lot of their decisions that are a short term pain in the ass (no installs, no synchronous) are actually good in the long term.  If all developers knew how to develop async and did it by default, and if all software vendors, even Windows based ones, provided their product in a form that could just be “copy and run without admin privs” to install, the world would be a better place.  That’s interesting in that “Sure it’s hard to use now but it’ll make the world better eventually” is usually heard from the other side of the aisle.

Conclusion

Azure’s a pretty legit offering!  And I’m very impressed by their velocity.  I think it’s fair to say that overall Azure isn’t quite as good as Amazon except for specific use cases (you’re writing it all in .NET by hand in Visual Studio) – but no one else is as good as Amazon either (believe me, I evaluated them) and Amazon has years of head start; Azure is brand new but already at about 80%! That puts them into the top 5 out of the gate.

Without an IaaS component, you still can’t do everything under the sun in Azure.  But if you’re not depending on much in the way of big third party software chunks, it’s feasible; if you’re doing .NET programming, it’s very compelling.

Do note that I haven’t focused too much on the attributes and limitations of cloud computing in general here – that’s another topic – this article is meant to compare and contrast Azure to other cloud offerings so that people can understand its architecture.

I hope that was clear.  Feel free and ask questions in the comments and I’ll try to clarify!

Leave a comment

Filed under Cloud