Tag Archives: Cloud

ReInvent 2013: Day 2 Keynote

I didn’t cover the day 1 keynote, but fortunately it can be found here. The day 2 keynote was a lot more technical and interesting though. Here are my notes from it:

First, we began by talking about how aws plans its projects.

Lots of updates every year!

Before any project is started, and teams are in the brainstorming phase. A few key things are always done.

  • Meeting minutes
  • FAQ
  • Figure out the ux
  • Before any code is written

“2 Pizza Teams”: Small autonomous teams that had roadmap ownership with decoupled lauch schedules.

Customer collaboration

Get the functionality in the hands of customers as soon as possible. It may be feature limited, but it’s in the hands of customers so that they can get feedback as soon as possible. Iterate iterate iterate based on feedback. Different from the old guard where everything is engineering driven and it is unnecessarily complex.

Netflix platform….

Netflix is on stage and we’re taking about the Netflix cloud prizes and talking about the enhancements to the different tools…looks pretty cool, and will need to check them out. There are 14 chaos monkey “tests” to run now instead of just 1 from before.

Cloud prize winners

Werner is back is breaking down the different facets that AWS focuses on:

  • Performance- measure everything; put performance data in log files that can be mined.
  • Security
  • Reliability
  • Cost
  • Scalability

Illya sukhar CEO from Parse is on stage now (platform for mobile apps)
-parse data: store data; it’s 5 lines of code instead of a bunch of code.
-push notification

Parse started with 1 aws instance
From 0-180,000 apps

180,000 collections in mongodb; shows differences between pre and post piops

Security

IAM and IAM roles to set boundaries on who can access what.
How to do this from a db perspective?
Apparently you can have fine grained access controls on dynamodb instead of writing your own code.
Each data block is encrypted in redshift
Cost:
Talking about how customers are using the spot instances to save $.

Scalability:
We transfer usecase, who take care of transferring large files.

Airbnb on stage with mike curtis, VP of engineering
-350k hosts around the world
-4 millions guests (jan 2013)
-9 million guests today.

Host of aws services
1k ec2 instances
Million RDS rows
50tb for photos in s3

“The ops team at Airbnb is with a 5 person ops team.”

Helps devote resources to the real problem.

AirBnB in 2011

AirBnB in 2012

Dropcam came on stage after that to talk about how they use the AWS platform. Nothing too crazy, but interestingly more inbound videos are sent to dropcam than YouTube!

Dropcam

They keynote ended with an Amazon Kinesis demo (and a deadmau5 announcement for the replay party), which on the outside looks like a streaming API and different ways to process data on the backend. A prototype of streaming data from twitter and performing analytics was shown to demonstrate the service.

Announcements

  • RDS for PostgreSQL
  • New instance types-i2 for much better io performance
  • Dynamo db- global secondary indexes!!
  • Federation with saml 2.0 for IAM
  • Amazon RDS- cross region read replicas!
  • G2 instances for media and video intensive application
  • C3 instances are new with fastest processors- 2.8 gig intel e5 v2
  • Amazon kinesis- real time processing, fully managed. It looks like this will help you solve issues of scalability when you’re trying to build realtime streaming applications. It integrates with storage and processing services.

Announcements

Incase you want to watch it, the day 2 keynote is here: http://www.youtube.com/watch?v=Waq8Y6s1Cjs

And also, the day 1 keynote: http://www.youtube.com/watch?v=8ISQbdZ7WWc

2 Comments

Filed under Cloud, Conferences

ReInvent 2013- Scaling on AWS for the First 10 Million Users

This was the first talk by @simon_elisha I went to at ReInvent, and was a packed room. It was targeted towards developers going from inception of an app to growing it to 10 million users. Following are the notes I took…

– We will need a bigger box is the first issue, when you start seeing traffic to an application. Single box is an anti pattern because of no failover etc. move out your db from the web server etc…you could use RDS or something too.

– SQL or NoSQL?
Not a binary decision; maybe use both? A blended approach can reduce technical debt. Maybe just start with SQL because it’s familiar and there are clear patterns for scalability. Nosql is great for super low latency apps, metadata data sets, fast lookups and rapid ingesting data.

So for 100 users…
You can get by using route53, ELB, multiple web instances.

For 10000 users…
– Use cloud front to cache any static assets.
– Get your session state out of the webservers. Session state could be stored in dynamo db because it’s just unrelated data.
– Also might be time for elastic cache now which is just hosted redis or memcached.

Auto scaling…
Min, max servers running in multiple az zones. AWS makes this really simple.

If you end up at the 500k users situation you probably really want:
– metrics and alarms
– automated builds and deploys
– centralized logging

must haves for log metrics to collect:
– host level metrics
– aggregate level metrics
– log analysis
– external site performance

Use a product for this, because there are plenty available, and you can focus on what you’re really trying to accomplish.

Create tools to automate so you save your time especially to manage your time. Some of the ones that you can use are: elastic beanstalk, aws opsworks more for developers and cloud formation and raw ec2 for ops. The key is to be able to repeat those deploys quickly. You probably will need to use puppet and chef to manage the actual ec2 instances..

Now you probably need to redesign your app when you’re at the million user mark. Think about using a service oriented architecture. Loose coupling for the win instead of tight coupling. You can probably put a queue between 2 pieces

Key tip: don’t reinvent the wheel.

Example of what to do when you have a user uploading a picture to a site.

Simple workflow service
– workers and deciders: provides orchestration for your code.

When your data tier starts to break down 5-10 mill users
– federation
Split by function or purpose
Gotcha- You will have issues with join queries
– sharding
This  works well for one table with billions of rows.
Gotcha- operationally confusing to manage
– shift to nosql
Sorta similar to federation
Gotcha- crazy architecture change. Use dynamo db.

Final Tips

Leave a comment

Filed under Cloud, Conferences

LASCON 2013 Report – Second Afternoon

I’m afraid I only got to one session in the afternoon, but I have some good interviews coming your way in exchange!

User Authentication For Winners!

I didn’t get to attend but I know that Karthik’s talk on writing a user auth system was good, here are the slides. When we were at NI he had to write the login/password/reset system for our product and we were aghast that there was no project out there to use, you just had to roll your own in an area where there are so many lurking security flaws.  He talks about his journey and you should read it!

AWS CloudHSM And Why It Can Revolutionize Cloud

Oleg Gryb (@oleggryb), security architect at Intuit, and Todd Cignettei, Sr. Product Manager with AWS Security.

Oleg says: There are commonly held concerns about cloud security – key management, legal liability, data sovereignty and access, unknown security policies and processes…

CloudHSM makes objects in partitions not accessible by the cloud provider. It provides multiple layers of security.

[Ed. What is HSM?  I didn’t know and he didn’t say.  Here’s what Wikipedia says.]

Luckily, Todd gets up and tells us about the HSM, or Hardware Security Module. It’s a purpose built appliance designed to protect key material and perform secure cryptographic operations. The SafeNet Luna SA HSM has different roles – appliance administrator, security officer. It’s all super certified and if tampered with blows up the keys.

AWS is providing dedicated access to SafeNet Luna SA HSM appliances. They are physically in AWS datacenters and in your VPC. You control the keys; they manage the hardware but they can’t see your goodies. And you do your crypto operations there. Here’s the AWS page on CloudHSM.

They are already integrated with various software and APIs like Java JCA/JCE.

It’s being used to encrypt digital content, DRM, securing financial transactions (root of trust for PKI), db encryption, digital signatures for real estate transactions, mobile payments.

Back to Oleg. With the HSM, there’s some manual steps you need to do, Initialize the HSM, configure a server and generate server side certs, generate a client cert on each client, scp the public portion to the server to register it.

Normal client cert generation requires an IP, which in the cloud is lame. You can isntead use a generic client name and use the same one on all systems.

You put their LunaProvider,jar in your Java CLASSPATH and add the provider to java/security and you’re good to go.

Making a Luna HA array is very important of course. If you get two you can group them up.

Suggested architecture – they ahve to run in a VPC. “You want to put on Internet? Is crazy idea! Never!”

Crypto doesn’t solve your problem, it just moves it to another place. How do you get the secrets onto your instances? When your instance starts, you don’t want those creds in S3 or the AMI…

So at instance bootstrap, send a request to a server in an internal DC with IP, instance ID, public and local hostanmes, reservation ID, instance type… Validate using the API including instance start time, validate role, etc. and then pass it back. Check for dupes.  This isn’t perfect but what are ya gonna do?  You can assign a policy to a role and have an instance profile it uses.

He has written a Python tool to help automate this, you can get it at http://sf.net/p/lunamech.

1 Comment

Filed under Conferences, Security

Crosspost: How Bazaarvoice Weathered The AWS Storm

For regular agile admin readers, I wanted to point out the post I did on the Bazaarvoice engineering blog, How Bazaarvoice Weathered The AWS Storm, on how we have designed for resiliency to the point where we had zero end user facing downtime during last year’s AWS meltdown and Leapocalypse. It’s a bit late, I wrote it like in July and then the BV engineering blog kinda fell dormant (guy who ran it left, etc.) and we’re just getting it reinvigorated.  Anyway, go read the article and also watch that blog for more good stuff to come!

Leave a comment

Filed under Cloud, DevOps

Awesome Austin Events!

Besides the “big one,” DevOpsDays Austin 2013, there’s a bunch of great events going on in Austin for techies.

The Agile Austin DevOps SIG meets every last Wednesday over lunch at Bazaarvoice; lunch is provided.  This month’s meeting on January 30 is Breaking the Barriers.

The Austin Cloud User Group meets every third Tuesday in the evening at Pervasive; dinner is provided. This month’s meeting is January 15, sponsored by Canonical, and there is a talk on Openstack Quantum, the network virtualization platform.

South by Southwest Interactive is here of course on March 8-12.

Hip security event BSides Austin is on March 21-22.

Data Day Austin is on January 29.

Texas Linux Fest will be May 31 – June 1.

It’s never been a better time to be a techie in Austin!

3 Comments

Filed under Cloud, Conferences, DevOps

AppSec USA 2012 Is Here (in Austin)!

AppSec USA 2012, the big OWASP security convention, is here in Austin this year!  And the agile admin’s own @wickett is coordinating it.

“Why do I care if I’m not a security wonk,” you ask? Well, guess what, the security world is waking up and smelling the coffee – this isn’t like security conventions were even just a couple years ago.  There’s a Cloud track and a Rugged DevOps track.

We have like 20 people going from Bazaarvoice. It’s two days, Thursday and Friday (yes, tomorrow – I don’t know why James didn’t post this earlier, sorry) and just $500. So it’s cheap and low impact.

And who’s speaking?  Well, how about Douglas Crockford, inventor of JSON?  And Gene Kim, author of Visible Ops?  That’s not the usual infosec crowd, is it?  Also Michael Howard from Microsoft, Josh Corman from Akamai, a trio of Twitter engineers, Nick Galbreath (formerly of Etsy), Jason Chan from Netflix, Brendan Eich from Mozilla… This is a star-studded techie event that you want to be at!

I’ll be there and will report in…

1 Comment

Filed under Conferences, DevOps, Security

Awesome Austin Tech Meetups

Austin is such a great place to be a techie.

  • The Austin Cloud User Group (I help run it) meets every third Tuesday evening, and we’ve ben having 50+ people come in to check out some awesome stuff.  Next meeting Feb 21 on Puppet, hosted by Pervasive.
  • The Agile Austin DevOps SIG meets fourth Wednesdays, we had our meeting today and had about 20 attendees, hosted by CA/Hyperformix. I also help run that one.
  • The Austin Big Data User Group is back meeting – next one is tomorrow night! Hosted by Bazaarvoice.
  • The Austin OWASP chapter is one of the biggest and most active in the country, and also meets monthly, hosted by National Instruments. Fellow Agile Admin James Wickett helps run that group.
  • The Cloud Security Alliance, Austin chapter is just getting started but has a lot of momentum and we’re coordinating with them from the ACUG and OWASP sides. Their first meeting is tonight, come out!

There are others but those are my favorites and therefore the coolest by definition.

There’s also cool events coming up you should keep an eye out for.

  • DevOpsDays Austin, Apr 2-3, hosted by National Instruments, and this’ll be big! Patrick Debois and the whole crew of DevOps illuminati will be here. Now taking sponsors and speakers! Register now!
  • AppSec USA 2012, Oct 23-26 – Austin OWASP kicks so much ass with LASCON that the annual OWASP convention is coming here to Austin this year!
  • South by Southwest Interactive, March 9-13 – quickly becoming theWeb conference in the flyover states :-). Lots of stuff happens during it, like:
    • Austin Cloud/DevOps party courtesy GeekAustin (ACUG is a community sponsor). March 10.
    • CloudCamp – Dave Nielsen will be bringing a CloudCamp to Austin again this year during SXSWi. Details TBD, sounding like Mar 11 maybe.
  • The Cloud Security Alliance and ACUG are hoping to put together an Austin cloud conference, too. Maybe early 2013.

Leave a comment

Filed under Cloud, Conferences, DevOps

Cloud Security Is No Oxymoron

Fellow Agile Admin, James Wickett, just wrote an article for Control Engineering about cloud security. Cloud security is kinda funny, it’s the biggest FUD attractor and “concern” of folks who don’t really know how their on premise security works either.  Anyway, read the article! We’re putting some real security work into our cloud products at NI; can’t speak for others but…

In related news, the excellent Cloud Security Alliance is starting an Austin chapter! Go check it out. The CSA got a start by being an organization that actually issues effective guidance on cloud security instead of being another vendor-haven or FUD collector.

Also see my older LASCON 2010 presentation on “Why The Cloud Is More Secure Than Your Existing Systems” (now a year later, the trade rags are stealing that headline for their own spam…  Woot!)

Leave a comment

Filed under Cloud, Security

Why Your Testing Is Lying To You

As a follow-on to Why Your Monitoring Is Lying To You.  How is it that you can have an application go through a whole test phase, with two-day-long load tests, and have surprising errors when you go to production?  Well, here’s how…  The same application I describe in the case study part of the monitoring article slipped through testing as well and almost went live with some issues. How, oh how could this happen…

I Didn’t See Any Errors!

Our developers quite reasonably said “But we’ve been developing and using this app in dev and test for months and haven’t seen this problem!” But consider the effects at work in But, You See, Your Other Customers Are Dumber Than We Are. There are a variety of levels of effect that prevent you from seeing intermittent problems, and confirmation bias ends up taking care of the rest.

The only fix here is rigor. If you hit your application and test and it errors, you can’t just ignore it. “I hit reload, it worked.  Maybe they were redeploying. On with life!” Maybe it’s your layer, maybe it’s another layer, it doesn’t matter, you have to log that as a bug and follow up and not just cancel the bug as “not reproducible” if you don’t see it yourself in 5 minutes of trying.  Devs sometimes get frustrated with us when we won’t let up on occurrences of transient errors, but if you don’t know why they happened and haven’t done anything to fix it, then it’s just a matter of time before it happens again, right?

We have a strict policy that every error is a bug, and if the error wasn’t detected it is multiple bugs – a bug with the monitoring, a bug with the testing, etc. If there was an error but “you don’t know why” – you aren’t logging enough or don’t have appropriate tools in place, and THAT’s a bug.

Our Load Test/Automated Tests Didn’t See Any Errors!

I’ll be honest, we don’t have much in the way of automated testing in place. So there’s that.  But we have long load tests we run.  “If there are intermittent failures they would have turned up over a two day load test right?” Well, not so fast. How confident are you this error is visible to and detected by your load test?  I have seen MANY load test results in my lifetime where someone was happily measuring the response time of what turned out to be 500 errors.  “Man, my app is a lot faster this time!  The numbers look great! Wait… It’s not even deployed. I hit it manually and I get a Tomcat page.”

Often we build deliberate “lies” into our software. We throw “pretty” error pages that aren’t basic errors. We are trying not to leak information to customers so we bowderlize failures on the front end. We retry maniacally in the face of failed connections, but don’t log it. We have to use constrained sets of return codes because the client consuming our services (like, say, Silverlight) is lobotomized and doesn’t savvy HTTP 401 or other such fancy schmancy codes.

Be careful that load tests and automated tests are correctly interpreting responses.  Look at your responses in Fiddler – we had what looked to the eye to be a 401 page that was actually passing back a 200 HTTP return code.

The best fix to this is test driven development.  Run the tests first so you see them fail, then write the code so you see them work!  Tests are code, and if you just write them on your working code then you’re not really sure if they’ll fail if somethings bad!

Fault Testing

Also, you need to perform positive and negative fault testing. Test failures end to end, including monitoring and logging and scaling and other management stuff. At the far end of this you have the cool if a little crazy Chaos Monkey.  Most of us aren’t ready or willing to jack up our production systems regularly, but you should at least do it in test and verify both that things work when they should and that they fail and you get proper notification and information if they do.

Try this.  Have someone Chaos Monkey you by turning off something random – a database, making a file system read only, a back end Web service call.  If you have redundancy built in to counter this, great, try it with one and see the failover, but then have them break “all of it” to provoke a failure.  Do you see the failure and get alerted? More importantly, do you have enough information to tell what they broke?  “One of the four databases we connect to” is NOT an adequate answer. Have someone break it, send you all the available logs and info, and if you can’t immediately pinpoint the problem, fix that.

How Complex Systems Fail, Invisibly

In the end, a lot of this boils down to How Complex Systems Fail. You can have failures at multiple levels – and not really failures, just assumptions – that stack on top of each other to both generate failures and prevent you from easily detecting those failures.

Also consider that you should be able to see those “short of failure” errors.  If you’re failing over, or retrying, or whatnot – well it’s great that you’re not down, but don’t you think you should know if you’re having to fail over 100x more this week?  Log it and turn it into a metric. On our corporate Web site, there’s hundreds of thousands of Web pages, so a certain level of 404s is expected.  We don’t alert anyone on a 404.  But we do metricize it and trend it and take notice if the number spikes up (or down – where’d all that bad content go?).

Whoelsale failures are easy to detect and mitigate.  It’s the mini-failures, or things that someone would argue are not a failure, on a given level that line up with the same kinds of things on all the other layers and those lined up holes start letting problems slip through.

http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html

2 Comments

Filed under DevOps

DevOps Tip: Design for Failure

We have had some interesting  internal discussions lately about application reliability.  It’s probably not a surprise to many of you that the cloud is unreliable, on a small scale that is.  Sure, on the large scale you use the cloud to make highly resilient environments. But a certain percentage of calls to the cloud fail – whether it’s Amazon’s or Azure’s management APIs, or hitting Amazon or Azure storage, or going through an Amazon ELB, or hitting SQL Azure. Heck, on Azure they plainly state that they will pull your instances out from under you, restart them, and move them to other hardware without notice. If you’re running 2 or more, they won’t do them all at the same time – so again, you get large scale resilience but at the cost of some small scale unreliability.

The problem is, that people sometimes come from the assumption that their application is always working fine, unless you can prove otherwise. This is fundamentally the wrong assumption. You have to assume your application has problems, unless you can prove it doesn’t.  This changes your approach to testing, logging, and monitoring profoundly.

Take the all too common example of an app with intermittent failures. Let’s say it’s as bad as 1 in 20 times.  1 in 20 times a customer hits your application, it fails somehow. It is very likely you don’t know this. Because by default, you don’t know it. How would you? Well, obviously, by monitoring, logging, and testing. I’ll follow this up with a series of posts describing how and why those often fail to detect problems. The short form is that “ha ha, no they don’t.”

Here’s a bad story I’ll tell on myself.  Here at NI, we rolled out a PDF instant quote generation widget.  We have over 250 apps on ni.com, so we don’t put synthetic monitors on all of them (remind me to tell you about the time early at NI that I discovered synthetic monitoring was producing 30% of our site load). Apparently the logging wasn’t all that good either, it didn’t trigger any of our log monitoring heuristics. Anyway, come to find out later on that the app was failing in production about 75% of the time. This is an application on a “monitored” site, where a developer and a tester signed off on the app. Whoops.  If you do a cursory test and assume it’ll work – well you know what they say about assumptions – they make an ass out of “you” and “mption.” 🙂

Anyway, to me part of the good part about the cloud is that they come out and say “we’re going to fail 2-5% of the time, code for it.” Because before the cloud, there were failures all the time too, but people managed to delude themselves into thinking there weren’t; that an application (even a complex Internet-based application) should just work, hit after hit, day after day, on into the future. So by having handling failure built in – like a lesser version of the Chaos Monkey – you’re not really just making your app cloud friendly, you’re making it better.

Real engineers who make cars and whatnot know better. That’s why there was a big ol’ maintenance hatch on the side of the Hubble Space Telescope; if any of you have watched the Hubble 3D IMAX film you get to see them performing maintenance on it.  If a billion dollar telescope in fricking space has problems and needs to be maintainable, so does your little Web app.

But I see so many apps that don’t really take failure into account.  Oh, maybe they retry some connections if they fail. But what if you get to the end of your retries? What if the response you get back is an unexpected HTTP code or unexpected payload? You’d think in the age of try/catch and easily integrated logging frameworks you wouldn’t see this any more, but I see it all the time. It’s a combination of not realizing that failure is ubiquitous, and not thinking about the impact (especially the customer facing impact) of that failure.

This is one of the (many) great DevOps learning experiences – Ops helping teach Devs all the things that can go wrong that don’t really go wrong much in a “frictionless” lab environment.  “So, what do you do if your hard drive is suddenly not there?” (Common with Amazon EBS failures.)  “What do you do if you took data off that queue and then your instance restarts before you put it into the database?” (Hopefully a transaction.) “What do you do if you can’t make that network connection, are you retrying every 5 ms and then filling up the system’s TCP connections?” (True story.) “Hey, I’m sure your app is pure as the driven snow right now, but is it always going to work the same when the PaaS vendor changes the OS version under you?”

In all circumstances, you should

  • Plan for failure (understand failure modes, retry, design for it)
  • Detect failure (monitor, log, etc.)
  • Plan for and detect failure of your schemes to plan for and detect failure!

We do some security threat modeling here. I wonder if there’s not a lightweight methodology like that which could be readily adapted for reliability modeling of apps.  Seems like something someone would have done… But a simple one, not like lame complicated risk matrices. I’ll have to research that.

Leave a comment

Filed under DevOps