Tag Archives: aws

by karthequian | November 19, 2013 · 1:57 pm

ReInvent 2013- Scaling on AWS for the First 10 Million Users

This was the first talk by @simon_elisha I went to at ReInvent, and was a packed room. It was targeted towards developers going from inception of an app to growing it to 10 million users. Following are the notes I took…

– We will need a bigger box is the first issue, when you start seeing traffic to an application. Single box is an anti pattern because of no failover etc. move out your db from the web server etc…you could use RDS or something too.

– SQL or NoSQL?
Not a binary decision; maybe use both? A blended approach can reduce technical debt. Maybe just start with SQL because it’s familiar and there are clear patterns for scalability. Nosql is great for super low latency apps, metadata data sets, fast lookups and rapid ingesting data.

So for 100 users…
You can get by using route53, ELB, multiple web instances.

For 10000 users…
– Use cloud front to cache any static assets.
– Get your session state out of the webservers. Session state could be stored in dynamo db because it’s just unrelated data.
– Also might be time for elastic cache now which is just hosted redis or memcached.

Auto scaling…
Min, max servers running in multiple az zones. AWS makes this really simple.

If you end up at the 500k users situation you probably really want:
– metrics and alarms
– automated builds and deploys
– centralized logging

must haves for log metrics to collect:
– host level metrics
– aggregate level metrics
– log analysis
– external site performance

Use a product for this, because there are plenty available, and you can focus on what you’re really trying to accomplish.

Create tools to automate so you save your time especially to manage your time. Some of the ones that you can use are: elastic beanstalk, aws opsworks more for developers and cloud formation and raw ec2 for ops. The key is to be able to repeat those deploys quickly. You probably will need to use puppet and chef to manage the actual ec2 instances..

Now you probably need to redesign your app when you’re at the million user mark. Think about using a service oriented architecture. Loose coupling for the win instead of tight coupling. You can probably put a queue between 2 pieces

Key tip: don’t reinvent the wheel.

Example of what to do when you have a user uploading a picture to a site.

Simple workflow service
– workers and deciders: provides orchestration for your code.

When your data tier starts to break down 5-10 mill users
– federation
Split by function or purpose
Gotcha- You will have issues with join queries
– sharding
This works well for one table with billions of rows.
Gotcha- operationally confusing to manage
– shift to nosql
Sorta similar to federation
Gotcha- crazy architecture change. Use dynamo db.

Final Tips

Leave a comment

Filed under Cloud, Conferences

Tagged as 2013, amazon, aws, Cloud, conference, reinvent

by karthequian | November 19, 2013 · 11:54 am

The AWS ReInvent Conference Recap

Last week I attended AWS ReInvent in Las Vegas. It was the largest conference I’ve been to with 9000 people, and a crazy number of sessions. When I was trying to decide what sessions to go to, I realized I had multiple conflicts at every slot (a good problem to have). It was also one of the funnest conferences I’ve been to, and I’ll be back next year (a bit more prepared next time around).

I’ll post about the sessions I went to, but the following are my favorite highlights from the conference:

Day 2 keynote with Werner Vogels: After a more marketing and “C” centric keynote on day 1, the day 2 keynote was tuned more to the large developer crowd in the audience, and I left inspired. Check out all my notes here.
Expo Hall: Holy cow! I got tired after walking around just 1/2 this hall. According to the booklet, there were maybe over 170 sponsors, and it took a while to walk through and check out what everyone was doing. The expo hall was also packed the first couple of days, so I went on the 3rd day when things were a lot quieter (pro tip: If you want the best swag, go the first day), but it also gave me a chance to talk to more of the folks in a leisurely manner! My 2 favorite highlights about the expo hall were Datadog (Most enthusiastic even on day 3), and Cloudability (who knew that I was a customer of theirs even though I didn’t realize another team at Mentor used the product; I thought that was pretty awesome!).
Crazy number of sessions: I’m glad these are all on YouTube now. I hear the slides are also going to be online in a bit. This will give me a way to catch up on the sessions that I missed out on.
AWS Hands on labs: This was pretty cool! You could skip a session or two and do a hands on lab on an AWS technology. I spent some time doing a hands on learning AWS beanstalk, and it was totally worthwhile.
Day 1 (Tuesday): I only got in on Tuesday, but next time I’ll need to register in time to be a part of the hackathon or gameday. I talked to a bunch of folks who attended these, and they ended up having a great time at both of these. The Gameday was pretty cool as well and was targeted at DevOps folks. You ended up forming a team with a bunch of other folks and had to build an application infrastructure that was resilient to any kind of breakage. Then, you’d swap your credentials with another team, and they would try to break your infrastructure; you can imagine how this would end up being entertaining!
Meeting up with folks, and catching up with people I hadn’t seen in a while.
VEGAS! It was good to not lose at the roulette tables this time around 🙂

A lot of developer friends commented that the talks were light on technical side of things, which I thought was true; the way I got more out of these was actually talking to the product managers and the customer at the end of the talk to ask and understand some of the more technical concepts. This is true for most conferences, but was especially true for this one.

Stay tuned for a bunch of post conference session updates!

Leave a comment

Filed under DevOps

Tagged as 2013, aws, Conferences, reinvent

by Ernest Mueller | October 26, 2013 · 11:02 am

LASCON 2013 Report – Second Afternoon

I’m afraid I only got to one session in the afternoon, but I have some good interviews coming your way in exchange!

User Authentication For Winners!

I didn’t get to attend but I know that Karthik’s talk on writing a user auth system was good, here are the slides. When we were at NI he had to write the login/password/reset system for our product and we were aghast that there was no project out there to use, you just had to roll your own in an area where there are so many lurking security flaws. He talks about his journey and you should read it!

AWS CloudHSM And Why It Can Revolutionize Cloud

Oleg Gryb (@oleggryb), security architect at Intuit, and Todd Cignettei, Sr. Product Manager with AWS Security.

Oleg says: There are commonly held concerns about cloud security – key management, legal liability, data sovereignty and access, unknown security policies and processes…

CloudHSM makes objects in partitions not accessible by the cloud provider. It provides multiple layers of security.

[Ed. What is HSM? I didn’t know and he didn’t say. Here’s what Wikipedia says.]

Luckily, Todd gets up and tells us about the HSM, or Hardware Security Module. It’s a purpose built appliance designed to protect key material and perform secure cryptographic operations. The SafeNet Luna SA HSM has different roles – appliance administrator, security officer. It’s all super certified and if tampered with blows up the keys.

AWS is providing dedicated access to SafeNet Luna SA HSM appliances. They are physically in AWS datacenters and in your VPC. You control the keys; they manage the hardware but they can’t see your goodies. And you do your crypto operations there. Here’s the AWS page on CloudHSM.

They are already integrated with various software and APIs like Java JCA/JCE.

It’s being used to encrypt digital content, DRM, securing financial transactions (root of trust for PKI), db encryption, digital signatures for real estate transactions, mobile payments.

Back to Oleg. With the HSM, there’s some manual steps you need to do, Initialize the HSM, configure a server and generate server side certs, generate a client cert on each client, scp the public portion to the server to register it.

Normal client cert generation requires an IP, which in the cloud is lame. You can isntead use a generic client name and use the same one on all systems.

You put their LunaProvider,jar in your Java CLASSPATH and add the provider to java/security and you’re good to go.

Making a Luna HA array is very important of course. If you get two you can group them up.

Suggested architecture – they ahve to run in a VPC. “You want to put on Internet? Is crazy idea! Never!”

Crypto doesn’t solve your problem, it just moves it to another place. How do you get the secrets onto your instances? When your instance starts, you don’t want those creds in S3 or the AMI…

So at instance bootstrap, send a request to a server in an internal DC with IP, instance ID, public and local hostanmes, reservation ID, instance type… Validate using the API including instance start time, validate role, etc. and then pass it back. Check for dupes. This isn’t perfect but what are ya gonna do? You can assign a policy to a role and have an instance profile it uses.

He has written a Python tool to help automate this, you can get it at http://sf.net/p/lunamech.

1 Comment

Filed under Conferences, Security

Tagged as appsec, auth, aws, Cloud, conference, lascon, owasp, password, Security, user management

by Ernest Mueller | June 24, 2013 · 2:51 pm

Crosspost: How Bazaarvoice Weathered The AWS Storm

For regular agile admin readers, I wanted to point out the post I did on the Bazaarvoice engineering blog, How Bazaarvoice Weathered The AWS Storm, on how we have designed for resiliency to the point where we had zero end user facing downtime during last year’s AWS meltdown and Leapocalypse. It’s a bit late, I wrote it like in July and then the BV engineering blog kinda fell dormant (guy who ran it left, etc.) and we’re just getting it reinvigorated. Anyway, go read the article and also watch that blog for more good stuff to come!

Leave a comment

Filed under Cloud, DevOps

Tagged as availability zone, aws, bazaarvoice, Cloud, DevOps, ec2, leapocalypse, outage, region, resiliency

by Ernest Mueller | June 18, 2013 · 4:54 pm

Velocity 2013 Day 1 Liveblog – Using Amazon Web Services for MySQL at Scale

Next up is Using Amazon Web Services for MySQL at Scale. I missed the first bit, on RDS vs EC2, because I tried to get into Choose Your Weapon: A Survey For Different Visualizations Of Performance Data but it was packed.

AWS Scaling Options

Aside: use boto

vertical scaling – tune, add hw.

table level partitioning – smaller indexes, etc. and can drop partitions instead of deleting

functional partitioning (move apps out)

need more reads? add replicas, cache tier, tune ORM

replication lag? see above, plus multiple schemas for parallel rep (5.6/tungsten). take some stuff out of db (timestamp updates, queues, nintrans reads), pre-warm caches, relax durability

writes? above plus sharding

sharding by row range requires frequent rebalancing

hash/modulus based- better distro but harder to rebalance; prebuilt shards

lookup table based

Availability

In EC2 you have regions and AZs. AZs are supposed to be “separate” but have some history of going down with each other.

A given region is about 99.2% up historically.

RDS has multi-AZ replica failover

Pure EC2 options:

master/replicas – async replication. but, data drift, fragile (need rapid rebuild). MySQL MHA for failover. haproxy (see palomino blog)
tungsten – replaced replication and cluster manager. good stuff.
galera – galera/xtradb/mariadb synchronous replication

Performance

io/storage: provisioned IOPS. Also, SSD for ephemeral power replicas

rds has better net perf, the block replication affects speed

instance types – gp, cpu op, memory op, storage op. Tend to use memory op, EBS op. cluster and dedicated also available.

EC2 storage – ephemeral, epehemeral SSD (superfast!), EBS slightly slower, EBS PIOPS faster/consistent/expensive/lower fail

Mitigating Failures

Local failures should not be a problem. AZs, run books, game days, monitoring.

Regional failures – if you have good replication and fast DNS flipping…

You may do master/master but active/active is a myth.

Backups – snap frequently, put some to S3/glacier for long term. Maybe copy them out of Amazon time to time to make your auditors happy.

Cost

Remember, you spend money every minute. There’s some tools out there to help with this (and Netflix released Ice today to this end).

Leave a comment

Filed under Cloud, Conferences, DevOps

Tagged as aws, ec2, mysql, velocity, velocityconf, velocityconf13

by Ernest Mueller | April 21, 2011 · 2:40 pm

The Real Lessons To Learn From The Amazon EC2 Outage

As I write this, our ops team is working furiously to bring up systems outside Amazon’s US East region to recover from the widespread outage they are having this morning. Naturally the Twitterverse, and in a day the blogosphere, and in a week the trade rag…axy? will be talking about how the cloud is unreliable and you can’t count on it for high availability.

And of course this is nonsense. These outages make the news because they hit everyone at once, but all the outages people are continually having at their own data centers are just as impactful, if less hit-generating in terms of news stories.

Sure, we’re not happy that some of our systems are down right now. But…

The outage is only affecting one of our beta products, our production SaaS service is still running like a champ, it just can’t scale up right now.
Our cloud product uptime is still way higher than our self hosted uptime. We have network outages, Web system outages, etc. all the time even though we have three data centers and millions of dollars in network gear and redundant Internet links and redundant servers.

People always assume they’d have zero downtime if they were hosting it. Or maybe that they could “make it get fixed” when it does go down. But that’s a false sense of security based on an archaic misconception that having things on premise gives you any more control over them.We run loads of on premise Web applications and a batch of cloud ones, and once we put some Keynote/Gomez/Alertsite against them we determined our cloud systems have much higher uptime.

Now, there are things that Amazon could do to make all this better on customers. In the Amazon SLAs, they say of course you can have super high uptime – if you are running redundantly across AZs and, in this case, regions. But Amazon makes it really unattractive and difficult to do this.

What Amazon Can Do Better

We can work around this issue by bringing up instances in other regions. Sadly, we didn’t already have our AMIs transferred into those regions, and you can only bring up instances off AMIs that are already in those regions. And transferring regions is a pain in the ass. There is absolutely zero reason Amazon doesn’t provide an API call to copy an AMI from region 1 to region 2. Bad on them. I emailed my Amazon account rep and just got back the top Google hits for “Amazon AMI region migrate”. Thanks, I did that already.
We weren’t already running across multiple regions and AZs because of cost. Some of that is the cost of redundancy in and of itself, but more importantly is the hateful way Amazon does reserve pricing, which very much pushes you towards putting everything in one AZ.
Also, redundancy only really works if you have everything, including data, in that AZ. If you are running redundant app servers across 4 AZs, but have your database in one of them – 0r have a database master in one and slaves in the others – you still get hosed by a particular region downtime.

Amazon needs to have tools that inherently let you distribute your stuff across their systems and needs to make their pricing/reserve strategy friendlier to doing things in what they say is “the right way.”

What We Could Do Better

We weren’t completely prepared for this. Once that region was already borked, it was impossible to migrate AMIs out of it, and there are so many ticky little region specific things all through the Amazon config – security groups, ELBs, etc – doing that on the fly is not possible unless you have specifically done it before, and we hadn’t.

We have an automation solution (PIE) that will regen our entire cloud for us in a short amount of time, but it doesn’t handle the base images, some of which we modify and re-burn from the Amazon ones. We don’t have that process automated and the documentation was out of date since Fedora likes to move their crap around all the time.

In the end, Amazon started coming back just as we got new images done in us-west-1. We’ll certainly work on automating that process, and hope that Amazon will also step up to making it easier for their customers to do so.

Leave a comment

Filed under Cloud, DevOps

Tagged as amazon, aws, ec2, outage

by Ernest Mueller | March 31, 2011 · 9:26 am

Why Amazon Reserve Instances Torment Me

We’ve been using over 100 Amazon EC2 instances for a year now, but I’ve just now made my first reserve instance purchase. For the untutored, reserve instances are where you pay a yearly upfront per instance and you get a much, much lower hourly cost. On its face, it’s a good deal – take a normal Large instance you’d use for a database. For a Linux one, it’s $0.34 per hour. Or you can pay $910 up front for the year, and then it’s only $0.12 per hour. So theoretically, it takes your yearly cost from $2978.40 to $1961.2. A great deal right?

Well, not so much. The devil is in the details.

First of all, you have to make sure and be running all those instances all the time. If you buy a reserve instance and then don’t use it some of the time, you immediately start cutting into your savings. The crossover is at 172 days – if you don’t run the instance at least 172 days out of the year then you are going upside down on the deal.

But what’s the big deal, you ask? Sure, in the cloud you are probably (and should be!) scaling up and down all the time, but as long as you reserve up to your low water mark it should work out, right?

So the big second problem is that when you reserve instances, you have to specify everything about that instance. You aren’t reserving “10 instances”, or even “10 large instances” – you have to specify:

Platform (UNIX/Linux, UNIX/Linux VPC, SUSE Linux, Windows, Windows VPC, or Windows with SQL Server)
Instance Type (m1.small, etc.)
AZ (e.g. us-east-1b)

And tenancy and term. So you have to reserve “a small multitenant Linux instance in us-east-1b for one year.” But having to specify down to this level is really problematic in any kind of dynamic environment.

Let’s say you buy 10 m1.large instances for your databases, and then you realize later you really need to move up to an m1.xlarge. Well, tough. You can, but if you don’t have 10 other things to run on those larges, you lose money. Or if you decide to change OS. One of our biggest expenditures is our compile farm workers, and on those we hope to move from Windows to Linux once we get the software issues worked out, and we’re experimenting with best cost/performance on different instance sizes. I’m effectively blocked from buying reserve for those, since if I do it’ll put a stop to our ability to innovate.

And more subtly, let’s say you’re doing dynamic scaling and splitting across AZs like they always say you should do for availability purposes. Well, if I’m running 20 instances, and scaling them across 1b and 1c, I am not guaranteed I’m always running 10 in 1b and 10 in 1c, it’s more random than that. Instead of buying 20 reserve, you instead have to buy say 7 in 1b and 7 in 1c, to make sure you don’t end up losing money.

Heck, they even differentiate between Linux and Suse and Linux VPC instances, which clearly crosses over into annoyingly picky territory.

As a result of all this, it is pretty undesirable to buy reserve instances unless you have a very stable environment, both technically and scale-wise. That sentence doesn’t describe the typical cloud use case in my opinion.

I understand, obviously, why they are doing this. From a capacity planning standpoint, it’s best for them if they make you specify everything. But what I don’t think they understand is that this cuts into people willing to buy reserve, and reserve is not only upfront money but also a lockin period, which should be grotesquely attractive to a business. I put off buying reserve for a year because of this, and even now that I’ve done it I’m not buying near as many reserve as I could be because I have to hedge my bets against ANY changes to my service. It seems to me that this also degrades the alleged point of reserves, which is capacity planning – if you’re so picky about it that no one buys reserve and 95% of your instances are on demand, then you can’t plan real well can you?

What Amazon needs to do is meet customers halfway. It’s all a probabilities game anyway. They lose specificity of each given reserve request, but get many more reserve requests (and all the benefits they convey – money, lockin, capacity planning info) in return.

Let’s look at each axis of inflexibility and analyze it.

Size. Sure, they have to allocate machines, right? But I assume they understand they are using this thing called “virtualization.” If I want to trade in 20 reserved small instances for 5 large instances (each large is 4x a small), why not? It loses them nothing to allow this. They just have to make the effort to allow it to happen in their console/APIS. I can understand needing to reserve a certain number of “units” but those should be flexible on exact instance types at a given time.
OS. Why on God’s green earth do I need to specify OS? Again, virtualized right? Is it so they can buy enough Windows licenses from Microsoft? Cry me a river. This one needs to leave immediately and never come back.
AZ. This is annoying from the user POV but probably the most necessary from the Amazon POV because they have to put enough hardware in each data center, right? I do think they should try to make this a per region and not a per AZ limit, so I’m just reserving “us-east” in Virginia and not the specific AZ, that would accommodate all my use cases.

In the end, one of the reasons people move to the cloud in the first place is to get rid of the constraints of hardware. When Amazon just puts those constraints back in place, it becomes undesirable. Frankly even now, I tried to just pay Amazon up front rather than actually buy reserve, but they’re not really enterprise friendly yet from a finance point of view so I couldn’t make that happen, so in the end I reluctantly bought reserve. The analytics around it are lacking too – I can’t look in the Amazon console and see “Yes, you’re using all 9 of your large/linux/us-east-1b instances.”

Amazon keeps innovating greatly in the technology space but in terms of the customer interaction space, they need a lot of help – they’re not the only game in town, and people with less technically sophisticated but more aggressive customer services/support options will erode their market share. I can’t help that Rackspace’s play with backing OpenStack seems to be the purest example of this – “Anyone can run the same cloud we do, but we are selling our ‘fanatical support'” is the message.

8 Comments

Filed under Cloud

Tagged as amazon, aws, Cloud, ec2, reserve

by Ernest Mueller | February 25, 2011 · 10:17 am

Amazon CloudFormation: Model Driven Automation For The Cloud

You may have heard about Amazon’s newest offering they announced today, CloudFormation. It’s the new hotness, but I see a lot of confusion in the Twitterverse about what it is and how it fits into the landscape of IaaS/PaaS/Elastic Beanstalk/etc. Read what Werner Vogels says about CloudFormation and its uses first, but then come back here!

Allow me to break it down for you and explain why this is such a huge leverage point for cloud developers.

What Has Come Before

Up till now on Amazon you could configure up a single virtual image the way you wanted it, with an AMI. You could even kind of construct a scalable tier of similar systems using Auto Scaling, by defining Launch Configurations. But if you wanted to construct an entire multitier system it was a lot harder. There are automated configuration management tools like chef and puppet out there, but their recipes/models tend to be oriented around getting a software loadout on an existing system, not the actual system provisioning – in general they come from the older assumption you have someone doing that on probably-physical systems using bcfg2 or cobber or vagrant or something.

So what were you to do if you wanted to bring up a simple three tier system, with a Web tier, app server tier, and database tier? Either you had to set them up and start them manually, or you had to write code against the Amazon APIs to explicitly pull up what you wanted. Or you had to use a third party provisioning provider like RightScale or EngineYard that would let you define that kind of model in their Web consoles but not construct your own model programmatically and upload it. (I’d like my product functionality in my own source control and not your GUI, thanks.)

Now, recently Amazon launched Elastic Beanstalk, which is more way over on the PaaS side of things, similar to Google App Engine. “Just upload your application and we’ll run it and scale it, you don’t have to worry about the plumbing.” Of course this sharply limits what you can do, and doesn’t address the question of “what if my overall system consists of more than just one Java app running in Beanstalk?”

If your goal is full model driven automation to achieve “infrastructure as code,” none of these solutions are entirely satisfactory. I understand CloudFormation deeply because we went down that same path and developed our own system model ourselves as a response!

I’ll also note that this is very similar to what Microsoft Azure does. Azure is a hybrid IaaS/PaaS solution – their marketing tries to say it’s more like Beanstalk or Google App Engine, but in reality it’s more like CloudFormation – you have an XML file that describes the different roles (tiers) in the system, defines what software should go on each, and lets you control the entire system as a unit.

So What Is CloudFormation?

Basically CloudFormation lets you model your Amazon cloud-based system in JSON and then provision and control it as a unit. So in our use case of a three tier system, you would model it up in their JSON markup and then CloudFormation would understand that the whole thing is a unit. See their sample template for a WordPress setup. (A mess more sample templates are here.)

Review the WordPress template; it lets you define the AMIs and instance types, what the security group and ELB setups should be, the RDS database back end, and feed in variables that’ll be used in the consuming software (like WordPress username/password in this case).

Once you have your template you can tell Amazon to start your “stack” in the console! It’ll even let you hook it up to a SNS notification that’ll let you know when it’s done. You name the whole stack, so you can distinguish between your “dev” environment and your “prod” environment for example, as opposed to the current state of the Amazon EC2 console where you get to see a big list of instance IDs – they added tagging that you can try to use for this, but it’s kinda wonky.

Why Do I Want This Again?

Because a system model lets you do a number of clever automation things.

Standard Definition

If you’ve been doing Amazon yourself, you’re used to there being a lot of stuff you have to do manually. From system build to system build even you do it differently each time, and God forbid you have multiple techies working on the same Amazon system. The basic value proposition of “don’t do things manually” is huge. You configure the security groups ONCE and put it into the template, and then you’re not going to forget to open port 23 AGAIN next time you start a system. A core part of what DevOps is realizing as its value proposition is treating system configuration as code that you can source control, fix bugs in and have them stay fixed, etc.

And if you’ve been trying to automate your infrastructure with tools like Chef, Puppet, and ControlTier, you may have been frustrated in that they address single systems well, but do not really model “systems of systems” worth a damn. Via new cloud support in knife and stuff you can execute raw “start me a cloud server” commands but all that nice recipe stuff stops at the box level and doesn’t really extend up to provisioning and tracking parts of your system.

With the CloudFormation template, you have an actual asset that defines your overall system. This definition:

Can be controlled in source control
Can be reviewed by others
Is authoritative, not documentation that could differ from the reality
Can be automatically parsed/generated by your own tools (this is huge)

It’s also nicely transparent; when you go to the console and look at the stack it shows you the history of events, the template used to start it, the startup parameters it used… Moving away from the “mystery meat” style of system config.

Coordinated Control

With CloudFormation, you can start and stop an entire environment with one operation. You can say “this is the dev environment” and be able to control it as a unit. I assume at some point you’ll be able to visualize it as a unit, right now all the bits are still stashed in their own tabs (and I notice they don’t make any default use of their own tagging, which makes it annoying to pick out what parts are from that stack).

This is handy for not missing stuff on startup and teardown… A couple weeks ago I spent an hour deleting a couple hundred rogue EBSes we had left over after a load test.

And you get some status eventing – one of the most painful parts of trying to automate against Amazon is the whole “I started an instance, I guess I’ll sit around and poll and try to figure out when the damn thing has come up right.” In CloudFront you get events that tell you when each part and then the whole are up and ready for use.

What It Doesn’t Do

It’s not a config management tool like Chef or Puppet. Except for what you bake onto your AMI it has zero software config capabilities, or command dispatch capabilities like Rundeck or mcollective or Fabric. Although it should be a good integration point with those tools.

It’s not a PaaS solution like Beanstalk or GAE; you use those when you just have an app you want to deploy to something that’ll run it. Now, it does erode some use cases – it makes a middle point between “run it all yourself and love the complexity” and “forget configurable system bits, just use PaaS.” It allows easy reusability, say having a systems guy develop the template and then a dev use it over and over again to host their app, but with more customization than the pure-play PaaSes provide.

It’s not quite like OVF, which is more fiddly and about virtually defining the guts of a single machine than defining a set of systems with roles and connections.

Competitive Analysis

It’s very similar to Microsoft Azure’s approach with their .cscfg and .csdef files which are an analogous XML model – you really could fairly call this feature “Amazon implements Azure on Amazon” (just as you could fairly call Elastic Beanstalk “Amazon implements Google App Engine on Amazon”.) In fact, the Azure Fabric has a lot more functionality than the primitive Amazon events in this first release. Of course, CloudFormation doesn’t just work on Windows, so that’s a pretty good width vs depth tradeoff.

And it’s similar to something like a RightScale, and ideally will encourage them to let customers actually submit their own definition instead of the current clunky combo of ServerArrays and ServerTemplates (curl or Web console? Really? Why not a model like this?). RightScale must be in a tizzy right now, though really just integrating with this model should be easy enough.

Where To From Here?

As I alluded, we actually wrote our own tool like this internally called PIE that we’re looking to open source because we were feeling this whole problem space keenly. XML model of the whole system, Apache Zookeeper-based registry, kinda like CloudFormation and Azure. Does CloudFormation obsolete what we were doing? No – we built it because we wanted a model that could describe cloud systems on multiple clouds and even on premise systems. The Amazon model will only help you define Amazon bits, but if you are running cross-cloud or hybrid it is of limited value. And I’m sure model visualization tools will come, and a better registry/eventing system will come, but we’re way farther down that path at least at the moment. Also, the differentiation between “provisioning tools” that control and start systems like CloudFormation and bcfg2 and “configuration” tools that control and start software like Chef and Puppet (and some people even differentiate between those and “deploy” tools that control and start applications like Capistrano) is a false dichotomy. I’m all about the “toolchain” approach but at some point you need a toolbelt. This tool differentiation is one of the more harmful “Dev vs Ops” differentiations.

I hope that this move shows the value of system modeling and helps people understand we need an overarching model that can be used to define it all, not just “Amazon” vs “Azure” or “system packages” vs “developed applications” or “UNIX vs Windows…” True system automation will come from a UNIVERSAL model that can be used to reason about and program to your on premise systems, your Amazon systems, your Azure systems, your software, your apps, your data, your images and files…

Conclusion

You need to understand CloudFormation, because it is one of the most foundational changes that will have a lot of leverage that AWS has come out with in some time. I don’t bother to blog about most of the cool new AWS features, because they are cool and I enjoy them but this is part of a more revolutionary change in the way systems are managed, the whole DevOps thing.

7 Comments

Filed under Cloud, DevOps

Tagged as amazon, aws, azure, Cloud, cloudformation, configuration management, ec2, IaaS, model, paas, provisioning

by Ernest Mueller | April 16, 2010 · 9:11 am

Amazon Web Services – Convert To/From VMs?

In the recent Amazon AWS Newsletter, they asked the following:

Some customers have asked us about ways to easily convert virtual machines from VMware vSphere, Citrix Xen Server, and Microsoft Hyper-V to Amazon EC2 instances – and vice versa. If this is something that you’re interested in, we would like to hear from you. Please send an email to aws-vm@amazon.com describing your needs and use case.

I’ll share my reply here for comment!

This is a killer feature that allows a number of important activities.

1. Product VMs. Many suppliers are starting to provide third-party products in the form of VMs instead of software to ease install complexity, or in an attempt to move from a hardware appliance approach to a more-software approach. This pretty much prevents their use in EC2. <cue sad music> As opposed to “Hey, if you can VM-ize your stuff then you’re pretty close to being able to offer it as an Amazon AMI or even SaaS offering.” <schwing!>

2. Leveraging VM Investments. For any organization that already has a VM infrastructure, it allows for reduction of cost and complexity to be able to manage images in the same way. It also allows for the much promised but under-delivered “cloud bursting” theory where you can run the same systems locally and use Amazon for excess capacity. In the current scheme I could make some AMIs “mostly” like my local VMs – but “close” is not good enough to use in production.

3. Local testing. I’d love to be able to bring my AMIs “down to me” for rapid redeploy. I often find myself having to transfer 2.5 gigs of software up to the cloud, install it, find a problem, have our devs fix it and cut another release, transfer it up again (2 hour wait time again, plus paying $$ for the transfer)…

4. Local troubleshooting. We get an app installed up in the cloud and it’s not acting quite right and we need to instrument it somehow to debug. This process is much easier on a local LAN with the developers’ PCs with all their stuff installed.

5. Local development. A lot of our development exercises the Amazon APIs. This is one area where Azure has a distinct advantage and can be a threat; in Visual Studio there is a “local Azure fabric” and a dev can write their app and have it running “in Azure” but on their machine, and then when they’re ready deploy it up. This is slightly more than VM consumption, it’s VMs plus Eucalyptus or similar porting of the Amazon API to the client side, but it’s a killer feature.

Xen or VMWare would be fine – frankly this would be big enough for us I’d change virtualization solutions to the one that worked with EC2.

I just asked one of our developers for his take on value for being able to transition between VMs and EC2 to include in this email, and his response is “Well, it’s just a no-brainer, right?” Right.

1 Comment

Filed under Cloud

Tagged as amazon, aws, Cloud, Cloud Computing, ec2, Virtualization

Tag Archives: aws

The AWS ReInvent Conference Recap

Velocity 2013 Day 1 Liveblog – Using Amazon Web Services for MySQL at Scale

AWS Scaling Options

Availability

Performance

Mitigating Failures

Cost

The Real Lessons To Learn From The Amazon EC2 Outage

What Amazon Can Do Better

What We Could Do Better

Subscribe

Recent Comments

Recent Posts

Austinites

Cloud

DevOps

Archives