Tag Archives: amazon

AWS re:Invent Keynote Day 2 Takeaways

TL;DR – performance improvements and two huge announcements, Docker-based EC2 Container Service and cloud-CEP-like AWS Lambda.

I was in a meeting for the first 45 minutes but I hear I didn’t miss much. Happy customer use cases.

The first big theme of this morning’s keynote is “Containers” – often just shorthand for “docker.”  I went to a previous event here in town with even large enterprises and government – State of Texas, Microsoft, Dell, Red Hat – all freaking out about Docker. Docker is similar to VMWare or cloud in that it is a new technology that requires new monitoring and management just for it. (Heck, Eric, the CopperEgg founder, is now running a startup around docker container management, StackEngine.)

  1. Keynote from pristine.io about how they implemented. Docker, the new low overhead containerization technology, is a heavily cited part of the power (they actually used Flux7 as the expert consultants, they’re based here in Austin!).
  2. Keynote from Werner Vogels on the new “Amazon EC2 Container Service,” announced to cheers and applause. It allows launching and terminating containers to sets of instances on EC2. Their PM did a demo where they had a big farm of r3 servers and then they deploy a redis cluster and rabbitmq across them, and then front end components on a farm of c3s, and then audio processing across all of them. If you’re new to this it’s basically VMs within VMs but without noticeable overhead.
EC2 Container Service

EC2 Container Service

  1. Next they had the actual docker cofounder and CEO Ben Golub. He mentioned that docker is only 18 months old and its huge success and ecosystem this early in is “surreal.”

Next… Leapfrogging PaaS?

  1. Werner is back to announce AWS Lambda available now in preview – event-driven computing service for dynamic applications. No instance running/management required, events go in and “cloud functions” run on them.  Holy shit, this replaces a large number of servers running semi-trivial apps. 20 cents per million requests, plus some complex stuff for seconds of execution – free for 3.2M seconds/1M requests.

    Amazon Lambda

    Amazon Lambda

  2. Netflix chief product guy came on to show how they’re using lambda as a higher level abstraction and have eliminated a bunch of servers – no system monitoring/management, no inefficient polling, no gaps/opacity. They’re using it to encode video, run backups, run security and compliance checks against instances, and for operational monitoring and dashboards. Replacing procedural control systems with event-driven services.
  3. AWS core innovations… New c4 instance, Haswell based (crazy fast processor, 36 vCPUs). Diane Bryant, SVP/GM Data Center Group from Intel, came on to go into the CPU specifically. Larger and faster EBS volumes, up to 20,000 IOPS. Enhanced and consistent networking speeds.

And this has been your cloud update! Also see Ben Kepes in Forbes for a similar summary.

The container engine is cool – it’ll certainly remove a lot of instance gerrymandering and instance reservation pain if nothing else. But Lambda is the potential disruptor here.  It’s taking the idea of “bring your own algorithm” from MapReduce and saying “hmmm you can probably replace your trivial web app just with this” – it’s halfway between a PaaS and a SaaS, none of the Beanstalk complexity, just “here take this function and run it on stuff when it comes in.” If a library of common lambas becomes available, so much computing work done for trivial purposes becomes obsoleted.  Who hasn’t seen a Web service to “upload a file here, then zip it or something, then store it…” OK, no servers needed any more. Very interesting.

Leave a comment

Filed under Cloud, Conferences

AWS re:Invent Keynote Day 1 Takeaways

Sadly I couldn’t attend this year, but heck that’s what the Internet is for.  Here’s the interesting bits from the AWS re:Invent Day 1 keynote (livestreamed here). Loads of interesting stuff.

  1. AWS is growing revenue >40% YOY, far outstripping other large IT companies – EC2 use grew 99% YOY and S3 usage 137%, they have 1M active customers now. (Microsoft cloud services report 128% YOY growth as well.)
  2. New product announcement for Aurora – new commercial-grade database engine – fully MySQL compatible but 5x the performance, available through Amazon RDS, 1/10 the cost of the commercial DB engines (starts at 29 cents an hour, ~$210/mo). Can do 6M inserts/second and 30M selects/second. Highly durable (11 9’s), crash recovery in seconds with no data loss. Nice!
  3. SLDC stuff!
    1. CodeDeploy (was internal tool called Apollo), a new code-deployment system that lets you do rolling updates, rollbacks, and tracks deployment health. This works for all languages and is free. They use it internally for 95 deploys/hour on their own stuff.
    2. In early 2015 will come some more software lifecycle management services – first is CodePipeline for continuous integration and deployment (also used internally)
    3. Second is CodeCommit as a managed code repository that can colocate with where you’re going to deploy and has no size limits of repos or files. These “integrate with” github, jenkins, chef, etc. though it’s not clear how they don’t cannibalize them.
  4. Security stuff! Big push to be able to say “we easily surpass the security you can do on premise.”
    1. FISMA, ITAR, FIPS, FedRAMP, HIPAA, ISO 9001
    2. Current encryption approach is either “let Amazon manage keys” or use their CloudHSM hosted key thing, both of which are still a pain. As a result they’re launching AWS Key Management Service as a HA service that manages keys, provides one-click encryption and transparent key rotation.
    3. AWS Config is a new-gen agile CMDB with full visibility into all your AWS resources. You can query it and see relationships and show scope of a config change. Streams all config changes out to you.
    4. A new-gen service catalog called AWS Service Catalog available early 2015. Create and share product portfolios, let internal people launch them, tracking and compliance.
  5. Enterprise Cloud Adoption Patterns
    1. Often the first wave of moving into the cloud for enterprises is moving dev and test environments to run in AWS for flexibility and spin up/down for cost savings and  brand new apps, custom written for the cloud
    2. Second wave is web sites and digital transformation (media, corp sites, ecomm) and analytics, since mass processing and sharing is cheap in the cloud – data warehouses (like pfizer’s). And mobile app back ends – phone, tablet, gps, more.
    3. Third wave is business critical applications.  Macmillan and Hoya run their SAP in AWS. Conde Nast runs HR and Legal there.
    4. New wave – you’re starting to see entire datacenter migration and consolidation as DCs come up for lease (Hess, Conde Nast, NewsCorp). SunCorp. Time Inc., GPT, Nippon Express moving “all in” to AWS – many ISVs as well. The CIA moved to AWS and now Intuit is doing so now as well.
    5. Intuit moved their “TurboTax AnswerXchange” app there to deal with tax time peaks last year and the scales fell from their eyes when they did so – 6x cost cut, setup 1/5 of the time, faster development. They started doing more and realized the global datacenters, ease of integration with acquisitions, and dev recruiting benefits. They have 33 services on AWS now, and have moved mint.com there. They have decided to move everything else there now. Funny how once companies start looking at how much they accomplish instead of just the monthly cost the “cloud is more expensive at scale” argument gets dropped like a flaming bag of poo.
  6. Hybrid cloud
    1. Various stuff like directory service (AD in the cloud) and identity federation and storage gateway and SystemCenter and vCenter integration already exist to power mixed shops
    2. Johnson & Johnson went on for a while about their use of AWS.  They are planning a 25,000 seat deployment of Workspaces (virtual desktop offering, like Citrix).

Whew, that’s the quick notes version.  Aurora is obviously of interest – a lot of the fretting over whether to use mySQL or RDS I’ve seen will get settled by this – it was just ‘well, run the same thing yourself or have them do it…” and now it’s “have them run something insanely better”. But the SDLC tools are also interesting – they made noise about how these “work with!” ansible, jenkins, git, etc. but that seems mildly disingenuous, without any more looking into it yet they sound more like direct competition for them. But the config and service catalog could be great extensions – yay for simple composable services, not huge painful “BSM/ITMOM suites”.

Feel free and share your thoughts on the announcements in the comments section!

3 Comments

Filed under Cloud, Conferences

AWS Dying! Rackspace Pulls Out Of Cloud! News At 11!

Boy, it’s been quite a week for the cloud-schadenfreude crowd. If you listen to the various news outlets, apparently Rackspace has given up on cloud and Amazon is in free-fall. Here’s some representative hack jobs pieces:

More accurate are these:

Let’s look at what’s actually going on.

First, Rackspace.  I was on the Spiceworks forum yesterday and the news is definitely being interpreted as “Rackspace is getting out of cloud, don’t consider them any more.” Now, it is their own fault for bungling the messaging here, but if you actually go look at what they are doing, at its heart they are making this change:

Rackspace Cloud will be sold only with a support contract now.

Yes, that’s it.  That’s the change. Now it’s “managed cloud!” Which is fine, a heck a lot of software I buy has mandatory maintenance contracts nowadays, but this doesn’t mean “Rackspace is leaving the cloud business!” They just want to add in their “Fanatical Support ™” to the value proposition and not compete purely on a bare-metal (bare-API?) SaaS “how much does a 2-CPU 4 GB server cost” basis.

Rackspace has to get back out in front of this messaging hard – it’s definitely made its way to the practitioner trenches as “they’re pulling out.” I mean, I have to say Rackspace’s strategy is pretty opaque to most folks, but this message misstep has graduated from “muddled and unclear” to “actively harmful.”

Now, Amazon.  The real story is:

Amazon Web Services only grew 38.39% last quarter.

For a large company that’s a pretty good growth rate, right, is yours higher? The press likes to turn IaaS into a 3 provider horse race. But so far – it’s not. Check out this recent (March 2014) Synergy Research graph.

CIS_Q413_graphic

The fact of the matter is that Amazon is beating the holy hell out of everyone else in IaaS. It’s more neck and neck in PaaS, but sadly the entire PaaS market is still low (due to Joe Average IT Shop basically interpreting PaaS pitches as someone standing up and screaming “I’m a sorcerer!!!”).

IBM, HP, etc. don’t have credible offerings yet.  I know they’re investing, I know they have roped some random companies that love them into doing it, but they are just not there yet. IBM is not a commodity company, they’re a “you have a billion dollar contract with us we’re going to build out whatever we feel like with that.”

Google, same thing. It’s cool, it’s well priced, it’s dev friendly – but at the big price cut announcement, we had a big get-together at Capital Factory here in town. I looked around at the crowd of 40 clouderati types and said “OK, so who is comfortable running production apps on Google cloud yet?” Result: zero. Google’s throwing money at it but as with most of Google’s new offerings, it’s hard to trust it’s not just going to dry up tomorrow and get cancelled because they are running after private spaceships or whatever now, and nothing makes them money like their ad business so “it’s revenue generating” won’t save it. And Google is so bad at enterprise support…

Microsoft Azure was really good. Better than it had a right to be!  I was very impressed with Azure in years 1 and 2. Execution was good (we used it for a SaaS service at National Instruments) and the vision was definitely “where the puck is going to be.” But post-Ozzie, it hasn’t exactly been shaking the sheets. At CloudAustin there was more Azure interest two years ago than there is now. They were going strong on dev friendliness and all, but trying to get into IaaS has been a distraction and they just aren’t keeping pace with Amazon’s rate of new features. Docker support, SSDs, new instances, vCenter integration, Dropbox competitor, desktop-as-a-service Citrix competitor…

Let me address the four big “why AWS is crashing and burning (despite being in an obvious position of market dominance)” points from the “Scorpion” article.

  1. AWS is not the low price provider.
    Eh. Not sure why this is relevant and also not sure it’s true for what you are getting… It’s like saying “there are books cheaper than that book you just bought.” Well sure there are, but do they have the information I want in them? See below for why not always having the lowest cent per minute under Google and Microsoft doesn’t really concern me.
  2. AWS is not the best product at anything – most of their features are mediocre knock offs of other products.
    This misses the point – their features are SIMPLER knockoffs of other products. That’s why it’s an accelerator. Dropbox and Salesforce and all the successful cloud entities have said “you know, some enterprise user left to their own devices is going to generate a list of 1000 requirements they don’t really need. Forget that. Let’s make the actual core functionality they need and leave off the rest so it’ll actually get used.” This is why they dominate the IaaS business. Many of their products are named to match. “SIMPLE email service.” “SIMPLE queue service.” “SIMPLE notification service.” This drives a new wave of architectural thought – instead of complicated services packed with stuff, what if instead I integrate simple, well-designed microservices? After doing a lot of cloud architecture work, those attributes are positives, not negatives.
  3. AWS is unbelievably lousy at support.
    I’m not sure I’d want to be in a race with Amazon, Microsoft, and Google to see who supports customers worse. I’m not sure I’ve ever been part of an enterprise happy with its Google support, and all experiences I’ve had with Microsoft support have been some Brazil-esque “you can’t actually ask them questions, only some VP is a designated contact on the corporate contract…”. Amazon is positioning themselves more like a hardware vendor, you don’t bother getting much support from them besides parts replacement, you get support from the managed hosting provider or whatnot that’s a MSP on top of them if you need it.
  4. Once you are at $200k / month of spend, it’s cheaper and much more effective to build your own infrastructure
    This is frequently untrue and based on people not understanding the full costs of getting stuck in the infrastructure business. What’s your cost of delay? Average enterprise “wait for servers” time is about 6 weeks; assuming you’re not just using them for nothing, your ROI is delayed by that amount. And what about all the operation of those complex systems? You can’t just stick in the salary of the developers and sysadmins you’d need – stick in your revenue per employee instead, because that headcount could be doing something useful for your company instead of plumbing. Not to mention the cascading percentage of each layer of management’s time spent worrying ab out the plumbing and the plumbers instead of conducting the core business of the company. Cost of delay from lost agility and opportunity cost are never taken into account but definitely should be.

I know a lot of the old guard want cloud to dry up and go away, it bothers their lovely datacenters.  And some of the very new guard resent it because Amazon continues to be so successful – they keep up a rate of innovation that new players can’t disrupt. But this whole week of “the cloud is falling” news is complete BS, and won’t amount to much.

Leave a comment

Filed under Cloud

Amazon Cuts Prices Too

Well, if nothing else I’m happy to have Google Cloud around to provide some competition to push Amazon Web Services.  Immediately after Google announced dramatic price drops, Amazon has responded doing the same!

Now if they can only also shame them into dropping their whole crazy reserve instance scheme and go to progressive discounts like Google just did too, the world will be better.

5 Comments

by | March 27, 2014 · 7:10 am

ReInvent – Fireside Chat: Part 1

One of the interesting sessions at ReInvent was a fireside chat with Werner Vogels., where CEO’s or CTO’s of different companies/startups who use AWS talked about their applications/platforms and what they liked and wanted form AWS. It was a 3 part series with different folks, and I was able to attend the 1st one, but I’m guessing videos are available for the others online.  Interesting session, giving the audience a window into the way C level people think about problems and solutions…

First up, the CTO of mongodb…

Lots of people use mongo to store things like user profiles etc for their applications. Mongo performance has gotten a lot better because of ssd’s

Recently funded 150 million, and wanting to build out a lot of tools to be able to administer mongo better.

Apparently being a mongodb dba is a really high paying job these days!

User roles may be available in mongo next year to add more security.

Werner and Eliot want to work together to bring a hosted version of mongo like RDS.

Next up twilio’s Jeff Lawson

Jeff is ex amazon.

Untitled

Software people want building blocks and not some crazy monolithic thing to solve a problem. Telecom had this issue, and that is why I started Twilio.

Everyone is agile! We don’t have answers up front, but we figure out these answers as we go.

Started with voice, then moved to SMS followed by a global presence. Most customers of ours wanted something that didn’t want boundaries and just wanted an API to communicate with their customers.

Werner: It’s hard to run an API business. Tell us more…
Lawson: It is really hard. Apis are kinda like webapps when it comes to scaling. REST helps a lot from this perspective. Multi tenancy issues gets amplified when you have an API business.

Twilio apparently deploys 20 times a day. Aws really helps with deployment because you can bring brand new environments that look exactly like prod and then tear it down when things aren’t needed.

When it comes to api’s, we write the documentation first and show our customers first before actually implementing the API. Then iterate iterate iterate on the development.

Jeff asks: Make it easier to make vpc up and running.

Next up: Valentino with adroll (realtime bidding)

Untitled

There’s a data collection pipe which gets like 20 tb of data everyday.

Latency is king: Typically latency is like 50ms and 100ms. This is still a lot for us. I wish we had more transparency when it comes to latency inside aws and otherwise…

Why dynamo db? Didn’t find something simple at the time, and it was nice to be able to scale something without having to worry about it. We had 0 ops people at the time to work on scaling at the time.

Read write rates: 80k reads per second (not consistent), 40k writes per second.

Why erlang? You’re a python god.
I started working on Python with the twisted framework. But I realized that Python didn’t fit our use case well; the twisted system worked just as well but it would be complicated to manage it and needed a bit of hacks..

Today it would be hard to pick between erlang and go….

Leave a comment

Filed under Cloud, Conferences

ReInvent 2013: Day 2 Keynote

I didn’t cover the day 1 keynote, but fortunately it can be found here. The day 2 keynote was a lot more technical and interesting though. Here are my notes from it:

First, we began by talking about how aws plans its projects.

Lots of updates every year!

Before any project is started, and teams are in the brainstorming phase. A few key things are always done.

  • Meeting minutes
  • FAQ
  • Figure out the ux
  • Before any code is written

“2 Pizza Teams”: Small autonomous teams that had roadmap ownership with decoupled lauch schedules.

Customer collaboration

Get the functionality in the hands of customers as soon as possible. It may be feature limited, but it’s in the hands of customers so that they can get feedback as soon as possible. Iterate iterate iterate based on feedback. Different from the old guard where everything is engineering driven and it is unnecessarily complex.

Netflix platform….

Netflix is on stage and we’re taking about the Netflix cloud prizes and talking about the enhancements to the different tools…looks pretty cool, and will need to check them out. There are 14 chaos monkey “tests” to run now instead of just 1 from before.

Cloud prize winners

Werner is back is breaking down the different facets that AWS focuses on:

  • Performance- measure everything; put performance data in log files that can be mined.
  • Security
  • Reliability
  • Cost
  • Scalability

Illya sukhar CEO from Parse is on stage now (platform for mobile apps)
-parse data: store data; it’s 5 lines of code instead of a bunch of code.
-push notification

Parse started with 1 aws instance
From 0-180,000 apps

180,000 collections in mongodb; shows differences between pre and post piops

Security

IAM and IAM roles to set boundaries on who can access what.
How to do this from a db perspective?
Apparently you can have fine grained access controls on dynamodb instead of writing your own code.
Each data block is encrypted in redshift
Cost:
Talking about how customers are using the spot instances to save $.

Scalability:
We transfer usecase, who take care of transferring large files.

Airbnb on stage with mike curtis, VP of engineering
-350k hosts around the world
-4 millions guests (jan 2013)
-9 million guests today.

Host of aws services
1k ec2 instances
Million RDS rows
50tb for photos in s3

“The ops team at Airbnb is with a 5 person ops team.”

Helps devote resources to the real problem.

AirBnB in 2011

AirBnB in 2012

Dropcam came on stage after that to talk about how they use the AWS platform. Nothing too crazy, but interestingly more inbound videos are sent to dropcam than YouTube!

Dropcam

They keynote ended with an Amazon Kinesis demo (and a deadmau5 announcement for the replay party), which on the outside looks like a streaming API and different ways to process data on the backend. A prototype of streaming data from twitter and performing analytics was shown to demonstrate the service.

Announcements

  • RDS for PostgreSQL
  • New instance types-i2 for much better io performance
  • Dynamo db- global secondary indexes!!
  • Federation with saml 2.0 for IAM
  • Amazon RDS- cross region read replicas!
  • G2 instances for media and video intensive application
  • C3 instances are new with fastest processors- 2.8 gig intel e5 v2
  • Amazon kinesis- real time processing, fully managed. It looks like this will help you solve issues of scalability when you’re trying to build realtime streaming applications. It integrates with storage and processing services.

Announcements

Incase you want to watch it, the day 2 keynote is here: http://www.youtube.com/watch?v=Waq8Y6s1Cjs

And also, the day 1 keynote: http://www.youtube.com/watch?v=8ISQbdZ7WWc

2 Comments

Filed under Cloud, Conferences

ReInvent 2013- Scaling on AWS for the First 10 Million Users

This was the first talk by @simon_elisha I went to at ReInvent, and was a packed room. It was targeted towards developers going from inception of an app to growing it to 10 million users. Following are the notes I took…

– We will need a bigger box is the first issue, when you start seeing traffic to an application. Single box is an anti pattern because of no failover etc. move out your db from the web server etc…you could use RDS or something too.

– SQL or NoSQL?
Not a binary decision; maybe use both? A blended approach can reduce technical debt. Maybe just start with SQL because it’s familiar and there are clear patterns for scalability. Nosql is great for super low latency apps, metadata data sets, fast lookups and rapid ingesting data.

So for 100 users…
You can get by using route53, ELB, multiple web instances.

For 10000 users…
– Use cloud front to cache any static assets.
– Get your session state out of the webservers. Session state could be stored in dynamo db because it’s just unrelated data.
– Also might be time for elastic cache now which is just hosted redis or memcached.

Auto scaling…
Min, max servers running in multiple az zones. AWS makes this really simple.

If you end up at the 500k users situation you probably really want:
– metrics and alarms
– automated builds and deploys
– centralized logging

must haves for log metrics to collect:
– host level metrics
– aggregate level metrics
– log analysis
– external site performance

Use a product for this, because there are plenty available, and you can focus on what you’re really trying to accomplish.

Create tools to automate so you save your time especially to manage your time. Some of the ones that you can use are: elastic beanstalk, aws opsworks more for developers and cloud formation and raw ec2 for ops. The key is to be able to repeat those deploys quickly. You probably will need to use puppet and chef to manage the actual ec2 instances..

Now you probably need to redesign your app when you’re at the million user mark. Think about using a service oriented architecture. Loose coupling for the win instead of tight coupling. You can probably put a queue between 2 pieces

Key tip: don’t reinvent the wheel.

Example of what to do when you have a user uploading a picture to a site.

Simple workflow service
– workers and deciders: provides orchestration for your code.

When your data tier starts to break down 5-10 mill users
– federation
Split by function or purpose
Gotcha- You will have issues with join queries
– sharding
This  works well for one table with billions of rows.
Gotcha- operationally confusing to manage
– shift to nosql
Sorta similar to federation
Gotcha- crazy architecture change. Use dynamo db.

Final Tips

Leave a comment

Filed under Cloud, Conferences

Velocity 2012 Day Two

After John Allspaw and Steve Souders caper about in fake muscles, we get started with the keynotes.

Building for a Billion Users (Facebook)

Jay Parikh, Facebook, spoke about building for a billion users.  Several folks yesterday warned with phrases like “if you have a billion users and hundreds of engineers then this advice may be on point…”

Today in 30 minutes, Facebook did…

  • 10 TB log data into hadoop
  • Scan 105 TB in Hive
  • 6M photos
  • 160m newsfeed stories
  • 5bn realtime messages
  • 10bn profile pics
  • 108bn mysql queries
  • 3.8 tn cache ops

Principle 1: Focus on Impact

Day one code deploys. They open sourced fabricator for code review. 30 day coder boot camp – Ops does coder boot camp, then weeks of ops boot camp. Mentorship program.

Principle 2: Move Fast

  • Commits scale with people, but you have to be safe! Perflab does a performance test before every commit with log-replay. Also does checks for slow drift over time.
  • Gatekeeper is the feature flag tool, A/B testing – 500M checks/sec. We have one of these at Bazaarvoice and it’s super helpful.
  • Claspin is a high density heat map viewer for large services
  • fast deployment technique – used shared memory segment to connect cache, so they can swap out binaries on top of it, took weeks of back end deploys down to days/hours
  • Built lots of ops tools – people | tools | process

Random Bits:

  • Be bold
  • They use BGP in the data center
  • Did we mention how cool we are?
  • Capacity engineering isn’t just buying stuff, it’s APM to reduce usage
  • Massive fail from gatekeeper bug.  Had to pull dns to shut the site down
  • Fix more, whine less

Investigating Anomalies, Amazon

John Rauser, Amazon data scientist, on investigating anomalies.  He gave a well received talk on statistics for operations last year.
He used a very long “data in the time of cholera” example.  Watch the video for the long version.
You can’t just look at the summary, look at distributions, then look into the logs themselves.
Look at the extremes and you’ll find things that are broken.
Check your long tail, monitor percentiles long in the tail.

Building Resilient User Experiences

Mike Brittain, Etsy on Building Resilient User Experiences – here’s the video.

Don’t mess up a page just because one of the 14 lame back end services you compose into it is down. Distinguish critical back end service failures and just suppress other stuff when composing a product page. Consider blocking vs non-blocking ajax. Load critical stuff synchronously and noncritical async/not at all if it times out.
Google apps has the nice problem/retrying in an interval messages (use exponential back off in your ui!)

Planning for 100% availability is not enough; plan for failure.  Your UI should adapt to failure. That’s usually a joint dev+ops+product call.
Do operability reviews, postmortems. Watch your page views for your error template!  (use an error template)

Speeding up the web using prediction of user activity

Arvind Jain/Dominic Hamon from Google – see the video.

Why things would be so much faster if you just preload the next pages a user might go to.  Sure.  As long as you don’t mind accidentally triggering all kinds of shit.  It’s like cross browser request forgery as a feature!
<link rel=’prerender’> to provide guidance.

Annoyed by this talk, and since I’ve read How Complex Systems Fail already, I spent rest of the morning plenary time wandering the vendor floor

It’s All About Telemetry

From the famous and irascible Theo Schlossnagle (@postwait). Here’s the slides.
Monitor what matters! Most new big data problems are created by our own solutions in the first place, and are thus solvable despite their ROI. e.g. logs.

What’s the cost/benefit of your data?
Don’t erode granularity (as with RRD).  It controls your storage but your ability to, say, do YOY black friday compares sucks.
As you zoom in you see obscured major differences and patterns.

There’s a cost/benefit curve for monitoring that goes positive but eventually goes negative.Value=benefit-cost.  So you don’t go to the end of the curve, you want the max difference!

Technique 1: Text
Just store changes- be careful not to have too many changes and store them

Technique 2: Numeric
store rollups, over 1 minute min/max/avg/sttdev/covar/50/95/99%
store first order derivative and then derivative of that (jerkiness)
db replication – lag AND RATE OF LAG CHANGE
“It’s a lot easier to see change when you’re actually graphing change”
Project numbers out.   Graph storage space doing down!
With simple numeric data you can do prediction (Holt-Winters) even hacked into RRD

Technique 3: Histograms
about 2k for a 1 minute histogram (5b for a single bucket)

Event correlation – change mgmt vs performance, is still human eye driven

Monitor everything.

  • the business – financials, marketing, support! problems, custsat, res time
  • operations! durr
  • system/db/yeah
  • middleware! messaging, apis, etc.

They use reconnoiter, statsd, d3, flot for graphing.

This was one of the best sessions of the day IMO.

RUM For Breakfast

Carlos from Facebook on routing users to the closest datacenter with Doppler. Normal DNS geo-routing depends on resolvers being near the user, etc.
They inject js into some browsers and log packet latency.  Map resolver IP to user IP, datacenter, latency. Then you can cluster users to nearest data centers. They use for planning and analysis – what countries have poor latency, where do we need peering agreements
Akamai said doing it was impractical, so they did it.
Amazon has “latency based routing” but no one knows how it works.
Google as part of their standard SOP has proposed a DNS extension that will never be adopted by enough people to work.

Looking at log-normal perf data – look at the whole spread but that can be huge, so filter then analyze.
Margin of error – 1.96*sd/sqrt(num), need less than 5% error.

How does performance influence human behavior?
Very fast sessions had high bounce rates (probably errors)
Strong bounce rate to load time correlation, and esp front end speed
Toxicology has the idea of a “median lethal dose”-  our LD50 is where we pass 50% bounce rate
These are: Back end 1.7s, dom load 1.8s, dom interactive 2.75s, front end 3.5s, dom complete 4.75s, load event 5.5s

Rollback, the Impossible Dream

by James Turnbull! Here’s the slides.

Rollback is BS.  You are insane if you rely on it. It’s theoretically possible, if you apply sufficient capital, and all apps are idempotent, resources…

  • Solve for availability, not rollback. Do small iterative changes instead.
  • Accept that failure happens, and prevent it from happening again.
  • Nobody gives a crap whose fault it was.
  • Assumption is the mother of all fuckups.
  • This can’t be upgraded like that because…  challenge it.

Stability Patterns

by Michael Nygard (@mtnygard) – Slides are here.   For more see his pragmatic programmers book Release It!

Failures come in patterns. Here’s some major ones!

Integrations are the #1 risk to stability, from both “in spec” and “out of spec” errors. Integrations are a necessary evil.  To debug you have to peel back the layers of abstraction; you have to diagnose a couple levels lower than the level an error manifests. Larger systems fail faster than small ones.

Chain Reactions – #2!
Failure moves horizontally across tiers, search engines and app servers get overloaded, esp. with connection pools. Look for resource leaks.

Cascading Failure
Failure moves vertically cross tiers. Common in SOA and enterprise services. Contain the damage – decouple higher tiers with timeouts, circuit breakers.

Blocked Threads
All threads blocked = “crash”. Use java.util/concurrent or system.threading (or ruby/php, don’t do it). Hung request handlers = less capacity and frustrated users. Scrutinize resource pools. Decompile that lame ass third party code to see how it works, if you’re using it you’re responsible for it.

Attacks of Self-Denial
Making your own DoS attack, often via mass unexpected promotions. Open lines of communication with marketers

Unbalanced Capacities
Your environment scaling ratios are different dev to qa to prod. Simulate back end failures or overloading during testing!

Unbounded result sets
Dev/test have smaller data volumes and unreal relationships (what’s the max in prod?). SOA with chatty providers – client can’t trust not being hurt. Don’t trust data producers, put limits in your APIs.

Stability patterns!

Circuit Breaker
Remote call wrapped with a retry loop (always 3!)
Immediate retries are very likely to fail again – TCP fixes packet drops for you man
Makes the user wait longer for their error, and floods the servers (cascading failure)
Count failures (leaky bucket) and stop calling the back end for a cool-off period
Can critical work be queued for later, or rejected, or what?
State of circuit breakers is a good dashboard!

Bulkheads
Partition the system and allow partial failure without losing service.
Classes of customer, etc.
If foo and bar are coupled with baz, then hitting baz can bork both. Make baz pools or whatever.
Less efficient resource use but important with shared-service models

Test Harness
Real-world failures are hard to create in QA
Integration tests don’t find those out-of-spec errors unless you force them
Service acting like a back end but doing crazy shit – slow, endless, send a binary
Supplement testing methods

The Colo or the Cloud?

Ken King/James Sheridan, Yammer

They started in a colo.
Dual network, racks have 4 2U quad and 10 1Us
EC2 provides you: network, cabling, simple load balancing, geo distribution, hardware repair, network and rack diagrams (?)
There are fixed and variable costs, assume 3 year depreciation
AWS gives you 20% off for 2M/year?
Three year commit, reserve instances
one rack one year – $384k cool, $378k ec2, $199k reserved
20 racks one year – $105k colo, $158 ec2 even with reserve and 20% discount
20 racks 3 years – $50k colo, $105k ec2

But – speed (agility) of ramping up…

To me this is like cars.  You build your own car, you know it.  But for arbitrary small numbers of cars, you don’t have time for that crap.  If you run a taxi fleet or something, then you start moving back towards specialized needs and spec/build your own.

Colo benefits – ownership and knowledge.  You know it and control it.

  • Load balancing (loadbalancer.com or zeus/riverbed only cloud options)
  • IDS etc. (appliances)
  • Choose connectivity
  • know your IO, get more RAM, more cores
  • Throw money at vertical scalability
  • Perception of control = security

Cloud benefits – instant. Scaling.

  • Forces good architecture
  • Immediate replacement, no sparing etc.
  • Unlimited storage, snapshots (1 TB volume limit)
  • No long term commitment
  • Provisioning APIs, autoscaling, EU storage, geodist

Hybrid.

  • crocodoc and encoder yammer partners are good to be close to
  • need to burst work
  • dev servers, windows dev, demo servers banished to AWS
  • moving cross connects to ec2 with vpc

Whew!  Man I hope someone’s reading this because it’s a lot of work.  Next, day three and bonus LSPE meeting!

2 Comments

Filed under Conferences, DevOps

The Real Lessons To Learn From The Amazon EC2 Outage

As I write this, our ops team is working furiously to bring up systems outside Amazon’s US East region to recover from the widespread outage they are having this morning. Naturally the Twitterverse, and in a day the blogosphere, and in a week the trade rag…axy? will be talking about how the cloud is unreliable and you can’t count on it for high availability.

And of course this is nonsense. These outages make the news because they hit everyone at once, but all the outages people are continually having at their own data centers are just as impactful, if less hit-generating in terms of news stories.

Sure, we’re not happy that some of our systems are down right now. But…

  1. The outage is only affecting one of our beta products, our production SaaS service is still running like a champ, it just can’t scale up right now.
  2. Our cloud product uptime is still way higher than our self hosted uptime. We have network outages, Web system outages, etc. all the time even though we have three data centers and millions of dollars in network gear and redundant Internet links and redundant servers.

People always assume they’d have zero downtime if they were hosting it. Or maybe that they could “make it get fixed” when it does go down. But that’s a false sense of security based on an archaic misconception that having things on premise gives you any more control over them.We run loads of on premise Web applications and a batch of cloud ones, and once we put some Keynote/Gomez/Alertsite against them we determined our cloud systems have much higher uptime.

Now, there are things that Amazon could do to make all this better on customers. In the Amazon SLAs, they say of course you can have super high uptime – if you are running redundantly across AZs and, in this case, regions. But Amazon makes it really unattractive and difficult to do this.

What Amazon Can Do Better

  1. We can work around this issue by bringing up instances in other regions.  Sadly, we didn’t already have our AMIs transferred into those regions, and you can only bring up instances off AMIs that are already in those regions. And transferring regions is a pain in the ass. There is absolutely zero reason Amazon doesn’t provide an API call to copy an AMI from region 1 to region 2. Bad on them. I emailed my Amazon account rep and just got back the top Google hits for “Amazon AMI region migrate”. Thanks, I did that already.
  2. We weren’t already running across multiple regions and AZs because of cost. Some of that is the cost of redundancy in and of itself, but more importantly is the hateful way Amazon does reserve pricing, which very much pushes you towards putting everything in one AZ.
  3. Also, redundancy only really works if you have everything, including data, in that AZ. If you are running redundant app servers across 4 AZs, but have your database in one of them – 0r have a database master in one and slaves in the others – you still get hosed by a particular region downtime.

Amazon needs to have tools that inherently let you distribute your stuff across their systems and needs to make their pricing/reserve strategy friendlier to doing things in what they say is “the right way.”

What We Could Do Better

We weren’t completely prepared for this. Once that region was already borked, it was impossible to migrate AMIs out of it, and there are so many ticky little region specific things all through the Amazon config – security groups, ELBs, etc – doing that on the fly is not possible unless you have specifically done it before, and we hadn’t.

We have an automation solution (PIE) that will regen our entire cloud for us in a short amount of time, but it doesn’t handle the base images, some of which we modify and re-burn from the Amazon ones. We don’t have that process automated and the documentation was out of date since Fedora likes to move their crap around all the time.

In the end, Amazon started coming back just as we got new images done in us-west-1.  We’ll certainly work on automating that process, and hope that Amazon will also step up to making it easier for their customers to do so.

Leave a comment

Filed under Cloud, DevOps