Tag Archives: Cloud

Google Cloud Update

We had a little get-together here in Austin today, sponsored by MomentumSI and hosted at Capital Factory (thanks to both!), to view the Google Cloud Platform newest product announcement webcast. About 24 local engineers showed up to watch.

You can view the whole thing yourself here, or just read my notes from the event.

Cloud Is Hard

Their thesis statement was that cloud, while cool, is still too hard for many people, hindering adoption or slowing innovation. So they’ve worked on making it easier.

Cost

Cost calculation is super complex (reserve, on demand, etc.). They talk about “other industry standard clouds” by which they mean Amazon Web Services. They note the drawbacks to reserved instances, which I am all totally in agreement on (see my earlier article Why Amazon Reserve Instances Torment Me for more on that). Specifically they note that reservations constrain your design choices, which is one of the great reasons to go to the cloud in the first place – Amen, brother!

Though cloud prices have been dropping 6-8% a year, hardware’s been dropping 20-30%. Why is Moore’s Law not translating into more sweet green in our pockets? It should, they contend. Thus, they are announcing on demand price drops:

  • GCE 32% price drop
  • Storage is now .026 cents/GB for any use
  • .02 c/GB for reduced durability storage
  • bigquery 85% reduction
  • can now purchase predictable throughput

Introducing sustained use discounts – no pre-plan or reserving ahead of time, instead prices automatically drop as VM usage is sustained over 25% of the month and then progressively from there. 100% use is a 53% discount over current (remember that includes the new 32% reduction, so another 21% from current for continued use). With linear machine cost scaling, makes it simple(r) to predict and calculate your costs.

Other Tradeoffs

Current cloud (hint: AWS) forces other tradeoffs: time to market vs scalability, flexibility (iaas) vs automatic management (paas), big data vs realtime data analysis.

But first, we interrupt our messaging to talk about other random new features based on customer feedback. To wit:

  • SuSE/Red Hat support
  • Windows Server 2008 R2 (preview) support
  • Cloud DNS service, accessible via API and console

The features are nice but even nicer was that they implemented these based on customer feedback, which means they consider this a real product with real customers and not just a fun tech thing for their own ends (which to be fair 80% of Google’s offerings are, and it can be hard to tell the difference).

Time to Market vs Scalability

So on scaling… You need deployment! Troubleshooting! Use tools you know!
They have a new “gcloud” command line tool
“gcloud init” pulls down the app via git, you can just edit, git commit, git push
They have a build service integrated – it spins up a jenkins/maven and builds, deploys – you can see release status in the console.
There’s also a new unified logs viewer with basic searching – like Splunk junior, with one cool dev feature. Click on the code in the stack trace and you’re put directly into the code in the console’s source view. Fix and commit, it auto-builds, bam you’re fixed.

IaaS vs PaaS

A new halfway state – “managed VMs.” It’s the normal PaaS, but in the config, you can tell it things to apt-get install onto the instances, so you can have more third party software than the PaaS previously allowed.
Also, you can “enable debugging” on an instance and then log in interactively.

Big Data vs Realtime Data Analysis

They’ve upped BigQuery to have 100k rows/sec ingest.
Example Demo: smart monitoring of 60 events/hour from 400k glen canyon power meters (17bn events/mo), with about 128k records. They did a visualization that is updating in near real time showing all those meters geolocated and you can go click on them to get realtime data.
He showed the complex BigQuery “bigjoin” to filter by meter lat/long from sep table and then by quartile across whole population. “Doing this in NoSQL would be impossible or very slow.”

They will be doing a Google Cloud roadshow soon – see cloud.google.com/roadshow – it looks like Austin will be on the list of cities!

Analysis

The good thing about getting a bunch of techies together to view this was the discussion afterwards.  The general sentiment was that:

1. The cost drops are nice and the approach to reserve/sustained use instances is much better. The reserve instance scheme is one of the worst things about AWS and if this drives them to adopt the same model, hooray!

2. The other new features (managed VMs, gcloud) are definitely nice. They are focusing on dev friendliness in their discussion but it’s a lot less clear how to operate this. If you’re really trying to stitch together a bunch of micro-services there’s not a lot of great support for that. They talk about using their PaaS and say “of course, if you use our PaaS you don’t need to carry a pager! You’d only need to do that if you’re doing IaaS and maintaining your own OSes.” That is dangerously naive and really made the whole group skittish. Most people there have done “play” things in Google’s cloud but are reticent to put mission critical items there, and this section of the presentation didn’t do a lot to improve that.

3. The BigQuery/realtime demo was impressive and multiple people would like to kick the tires on it.

Overall – it was a little light, but it was a keynote; the new features/pricing are all good; this shows more Google commitment to their cloud as a product but actual concerns still linger about maturity and suitability for realistically complex revenue-generating production applications.

 

Leave a comment

Filed under Cloud

ReInvent – Fireside Chat: Part 1

One of the interesting sessions at ReInvent was a fireside chat with Werner Vogels., where CEO’s or CTO’s of different companies/startups who use AWS talked about their applications/platforms and what they liked and wanted form AWS. It was a 3 part series with different folks, and I was able to attend the 1st one, but I’m guessing videos are available for the others online.  Interesting session, giving the audience a window into the way C level people think about problems and solutions…

First up, the CTO of mongodb…

Lots of people use mongo to store things like user profiles etc for their applications. Mongo performance has gotten a lot better because of ssd’s

Recently funded 150 million, and wanting to build out a lot of tools to be able to administer mongo better.

Apparently being a mongodb dba is a really high paying job these days!

User roles may be available in mongo next year to add more security.

Werner and Eliot want to work together to bring a hosted version of mongo like RDS.

Next up twilio’s Jeff Lawson

Jeff is ex amazon.

Untitled

Software people want building blocks and not some crazy monolithic thing to solve a problem. Telecom had this issue, and that is why I started Twilio.

Everyone is agile! We don’t have answers up front, but we figure out these answers as we go.

Started with voice, then moved to SMS followed by a global presence. Most customers of ours wanted something that didn’t want boundaries and just wanted an API to communicate with their customers.

Werner: It’s hard to run an API business. Tell us more…
Lawson: It is really hard. Apis are kinda like webapps when it comes to scaling. REST helps a lot from this perspective. Multi tenancy issues gets amplified when you have an API business.

Twilio apparently deploys 20 times a day. Aws really helps with deployment because you can bring brand new environments that look exactly like prod and then tear it down when things aren’t needed.

When it comes to api’s, we write the documentation first and show our customers first before actually implementing the API. Then iterate iterate iterate on the development.

Jeff asks: Make it easier to make vpc up and running.

Next up: Valentino with adroll (realtime bidding)

Untitled

There’s a data collection pipe which gets like 20 tb of data everyday.

Latency is king: Typically latency is like 50ms and 100ms. This is still a lot for us. I wish we had more transparency when it comes to latency inside aws and otherwise…

Why dynamo db? Didn’t find something simple at the time, and it was nice to be able to scale something without having to worry about it. We had 0 ops people at the time to work on scaling at the time.

Read write rates: 80k reads per second (not consistent), 40k writes per second.

Why erlang? You’re a python god.
I started working on Python with the twisted framework. But I realized that Python didn’t fit our use case well; the twisted system worked just as well but it would be complicated to manage it and needed a bit of hacks..

Today it would be hard to pick between erlang and go….

Leave a comment

Filed under Cloud, Conferences

ReInvent 2013: Day 2 Keynote

I didn’t cover the day 1 keynote, but fortunately it can be found here. The day 2 keynote was a lot more technical and interesting though. Here are my notes from it:

First, we began by talking about how aws plans its projects.

Lots of updates every year!

Before any project is started, and teams are in the brainstorming phase. A few key things are always done.

  • Meeting minutes
  • FAQ
  • Figure out the ux
  • Before any code is written

“2 Pizza Teams”: Small autonomous teams that had roadmap ownership with decoupled lauch schedules.

Customer collaboration

Get the functionality in the hands of customers as soon as possible. It may be feature limited, but it’s in the hands of customers so that they can get feedback as soon as possible. Iterate iterate iterate based on feedback. Different from the old guard where everything is engineering driven and it is unnecessarily complex.

Netflix platform….

Netflix is on stage and we’re taking about the Netflix cloud prizes and talking about the enhancements to the different tools…looks pretty cool, and will need to check them out. There are 14 chaos monkey “tests” to run now instead of just 1 from before.

Cloud prize winners

Werner is back is breaking down the different facets that AWS focuses on:

  • Performance- measure everything; put performance data in log files that can be mined.
  • Security
  • Reliability
  • Cost
  • Scalability

Illya sukhar CEO from Parse is on stage now (platform for mobile apps)
-parse data: store data; it’s 5 lines of code instead of a bunch of code.
-push notification

Parse started with 1 aws instance
From 0-180,000 apps

180,000 collections in mongodb; shows differences between pre and post piops

Security

IAM and IAM roles to set boundaries on who can access what.
How to do this from a db perspective?
Apparently you can have fine grained access controls on dynamodb instead of writing your own code.
Each data block is encrypted in redshift
Cost:
Talking about how customers are using the spot instances to save $.

Scalability:
We transfer usecase, who take care of transferring large files.

Airbnb on stage with mike curtis, VP of engineering
-350k hosts around the world
-4 millions guests (jan 2013)
-9 million guests today.

Host of aws services
1k ec2 instances
Million RDS rows
50tb for photos in s3

“The ops team at Airbnb is with a 5 person ops team.”

Helps devote resources to the real problem.

AirBnB in 2011

AirBnB in 2012

Dropcam came on stage after that to talk about how they use the AWS platform. Nothing too crazy, but interestingly more inbound videos are sent to dropcam than YouTube!

Dropcam

They keynote ended with an Amazon Kinesis demo (and a deadmau5 announcement for the replay party), which on the outside looks like a streaming API and different ways to process data on the backend. A prototype of streaming data from twitter and performing analytics was shown to demonstrate the service.

Announcements

  • RDS for PostgreSQL
  • New instance types-i2 for much better io performance
  • Dynamo db- global secondary indexes!!
  • Federation with saml 2.0 for IAM
  • Amazon RDS- cross region read replicas!
  • G2 instances for media and video intensive application
  • C3 instances are new with fastest processors- 2.8 gig intel e5 v2
  • Amazon kinesis- real time processing, fully managed. It looks like this will help you solve issues of scalability when you’re trying to build realtime streaming applications. It integrates with storage and processing services.

Announcements

Incase you want to watch it, the day 2 keynote is here: http://www.youtube.com/watch?v=Waq8Y6s1Cjs

And also, the day 1 keynote: http://www.youtube.com/watch?v=8ISQbdZ7WWc

2 Comments

Filed under Cloud, Conferences

ReInvent 2013- Scaling on AWS for the First 10 Million Users

This was the first talk by @simon_elisha I went to at ReInvent, and was a packed room. It was targeted towards developers going from inception of an app to growing it to 10 million users. Following are the notes I took…

– We will need a bigger box is the first issue, when you start seeing traffic to an application. Single box is an anti pattern because of no failover etc. move out your db from the web server etc…you could use RDS or something too.

– SQL or NoSQL?
Not a binary decision; maybe use both? A blended approach can reduce technical debt. Maybe just start with SQL because it’s familiar and there are clear patterns for scalability. Nosql is great for super low latency apps, metadata data sets, fast lookups and rapid ingesting data.

So for 100 users…
You can get by using route53, ELB, multiple web instances.

For 10000 users…
– Use cloud front to cache any static assets.
– Get your session state out of the webservers. Session state could be stored in dynamo db because it’s just unrelated data.
– Also might be time for elastic cache now which is just hosted redis or memcached.

Auto scaling…
Min, max servers running in multiple az zones. AWS makes this really simple.

If you end up at the 500k users situation you probably really want:
– metrics and alarms
– automated builds and deploys
– centralized logging

must haves for log metrics to collect:
– host level metrics
– aggregate level metrics
– log analysis
– external site performance

Use a product for this, because there are plenty available, and you can focus on what you’re really trying to accomplish.

Create tools to automate so you save your time especially to manage your time. Some of the ones that you can use are: elastic beanstalk, aws opsworks more for developers and cloud formation and raw ec2 for ops. The key is to be able to repeat those deploys quickly. You probably will need to use puppet and chef to manage the actual ec2 instances..

Now you probably need to redesign your app when you’re at the million user mark. Think about using a service oriented architecture. Loose coupling for the win instead of tight coupling. You can probably put a queue between 2 pieces

Key tip: don’t reinvent the wheel.

Example of what to do when you have a user uploading a picture to a site.

Simple workflow service
– workers and deciders: provides orchestration for your code.

When your data tier starts to break down 5-10 mill users
– federation
Split by function or purpose
Gotcha- You will have issues with join queries
– sharding
This  works well for one table with billions of rows.
Gotcha- operationally confusing to manage
– shift to nosql
Sorta similar to federation
Gotcha- crazy architecture change. Use dynamo db.

Final Tips

Leave a comment

Filed under Cloud, Conferences

LASCON 2013 Report – Second Afternoon

I’m afraid I only got to one session in the afternoon, but I have some good interviews coming your way in exchange!

User Authentication For Winners!

I didn’t get to attend but I know that Karthik’s talk on writing a user auth system was good, here are the slides. When we were at NI he had to write the login/password/reset system for our product and we were aghast that there was no project out there to use, you just had to roll your own in an area where there are so many lurking security flaws.  He talks about his journey and you should read it!

AWS CloudHSM And Why It Can Revolutionize Cloud

Oleg Gryb (@oleggryb), security architect at Intuit, and Todd Cignettei, Sr. Product Manager with AWS Security.

Oleg says: There are commonly held concerns about cloud security – key management, legal liability, data sovereignty and access, unknown security policies and processes…

CloudHSM makes objects in partitions not accessible by the cloud provider. It provides multiple layers of security.

[Ed. What is HSM?  I didn’t know and he didn’t say.  Here’s what Wikipedia says.]

Luckily, Todd gets up and tells us about the HSM, or Hardware Security Module. It’s a purpose built appliance designed to protect key material and perform secure cryptographic operations. The SafeNet Luna SA HSM has different roles – appliance administrator, security officer. It’s all super certified and if tampered with blows up the keys.

AWS is providing dedicated access to SafeNet Luna SA HSM appliances. They are physically in AWS datacenters and in your VPC. You control the keys; they manage the hardware but they can’t see your goodies. And you do your crypto operations there. Here’s the AWS page on CloudHSM.

They are already integrated with various software and APIs like Java JCA/JCE.

It’s being used to encrypt digital content, DRM, securing financial transactions (root of trust for PKI), db encryption, digital signatures for real estate transactions, mobile payments.

Back to Oleg. With the HSM, there’s some manual steps you need to do, Initialize the HSM, configure a server and generate server side certs, generate a client cert on each client, scp the public portion to the server to register it.

Normal client cert generation requires an IP, which in the cloud is lame. You can isntead use a generic client name and use the same one on all systems.

You put their LunaProvider,jar in your Java CLASSPATH and add the provider to java/security and you’re good to go.

Making a Luna HA array is very important of course. If you get two you can group them up.

Suggested architecture – they ahve to run in a VPC. “You want to put on Internet? Is crazy idea! Never!”

Crypto doesn’t solve your problem, it just moves it to another place. How do you get the secrets onto your instances? When your instance starts, you don’t want those creds in S3 or the AMI…

So at instance bootstrap, send a request to a server in an internal DC with IP, instance ID, public and local hostanmes, reservation ID, instance type… Validate using the API including instance start time, validate role, etc. and then pass it back. Check for dupes.  This isn’t perfect but what are ya gonna do?  You can assign a policy to a role and have an instance profile it uses.

He has written a Python tool to help automate this, you can get it at http://sf.net/p/lunamech.

1 Comment

Filed under Conferences, Security

Crosspost: How Bazaarvoice Weathered The AWS Storm

For regular agile admin readers, I wanted to point out the post I did on the Bazaarvoice engineering blog, How Bazaarvoice Weathered The AWS Storm, on how we have designed for resiliency to the point where we had zero end user facing downtime during last year’s AWS meltdown and Leapocalypse. It’s a bit late, I wrote it like in July and then the BV engineering blog kinda fell dormant (guy who ran it left, etc.) and we’re just getting it reinvigorated.  Anyway, go read the article and also watch that blog for more good stuff to come!

Leave a comment

Filed under Cloud, DevOps