Category Archives: Cloud

Cloud computing and all its permutations.

by Ernest Mueller | March 25, 2014 · 3:22 pm

Google Cloud Update

We had a little get-together here in Austin today, sponsored by MomentumSI and hosted at Capital Factory (thanks to both!), to view the Google Cloud Platform newest product announcement webcast. About 24 local engineers showed up to watch.

You can view the whole thing yourself here, or just read my notes from the event.

Cloud Is Hard

Their thesis statement was that cloud, while cool, is still too hard for many people, hindering adoption or slowing innovation. So they’ve worked on making it easier.

Cost

Cost calculation is super complex (reserve, on demand, etc.). They talk about “other industry standard clouds” by which they mean Amazon Web Services. They note the drawbacks to reserved instances, which I am all totally in agreement on (see my earlier article Why Amazon Reserve Instances Torment Me for more on that). Specifically they note that reservations constrain your design choices, which is one of the great reasons to go to the cloud in the first place – Amen, brother!

Though cloud prices have been dropping 6-8% a year, hardware’s been dropping 20-30%. Why is Moore’s Law not translating into more sweet green in our pockets? It should, they contend. Thus, they are announcing on demand price drops:

GCE 32% price drop
Storage is now .026 cents/GB for any use
.02 c/GB for reduced durability storage
bigquery 85% reduction
can now purchase predictable throughput

Introducing sustained use discounts – no pre-plan or reserving ahead of time, instead prices automatically drop as VM usage is sustained over 25% of the month and then progressively from there. 100% use is a 53% discount over current (remember that includes the new 32% reduction, so another 21% from current for continued use). With linear machine cost scaling, makes it simple(r) to predict and calculate your costs.

Other Tradeoffs

Current cloud (hint: AWS) forces other tradeoffs: time to market vs scalability, flexibility (iaas) vs automatic management (paas), big data vs realtime data analysis.

But first, we interrupt our messaging to talk about other random new features based on customer feedback. To wit:

SuSE/Red Hat support
Windows Server 2008 R2 (preview) support
Cloud DNS service, accessible via API and console

The features are nice but even nicer was that they implemented these based on customer feedback, which means they consider this a real product with real customers and not just a fun tech thing for their own ends (which to be fair 80% of Google’s offerings are, and it can be hard to tell the difference).

Time to Market vs Scalability

So on scaling… You need deployment! Troubleshooting! Use tools you know!
They have a new “gcloud” command line tool
“gcloud init” pulls down the app via git, you can just edit, git commit, git push
They have a build service integrated – it spins up a jenkins/maven and builds, deploys – you can see release status in the console.
There’s also a new unified logs viewer with basic searching – like Splunk junior, with one cool dev feature. Click on the code in the stack trace and you’re put directly into the code in the console’s source view. Fix and commit, it auto-builds, bam you’re fixed.

IaaS vs PaaS

A new halfway state – “managed VMs.” It’s the normal PaaS, but in the config, you can tell it things to apt-get install onto the instances, so you can have more third party software than the PaaS previously allowed.
Also, you can “enable debugging” on an instance and then log in interactively.

Big Data vs Realtime Data Analysis

They’ve upped BigQuery to have 100k rows/sec ingest.
Example Demo: smart monitoring of 60 events/hour from 400k glen canyon power meters (17bn events/mo), with about 128k records. They did a visualization that is updating in near real time showing all those meters geolocated and you can go click on them to get realtime data.
He showed the complex BigQuery “bigjoin” to filter by meter lat/long from sep table and then by quartile across whole population. “Doing this in NoSQL would be impossible or very slow.”

They will be doing a Google Cloud roadshow soon – see cloud.google.com/roadshow – it looks like Austin will be on the list of cities!

Analysis

The good thing about getting a bunch of techies together to view this was the discussion afterwards. The general sentiment was that:

1. The cost drops are nice and the approach to reserve/sustained use instances is much better. The reserve instance scheme is one of the worst things about AWS and if this drives them to adopt the same model, hooray!

2. The other new features (managed VMs, gcloud) are definitely nice. They are focusing on dev friendliness in their discussion but it’s a lot less clear how to operate this. If you’re really trying to stitch together a bunch of micro-services there’s not a lot of great support for that. They talk about using their PaaS and say “of course, if you use our PaaS you don’t need to carry a pager! You’d only need to do that if you’re doing IaaS and maintaining your own OSes.” That is dangerously naive and really made the whole group skittish. Most people there have done “play” things in Google’s cloud but are reticent to put mission critical items there, and this section of the presentation didn’t do a lot to improve that.

3. The BigQuery/realtime demo was impressive and multiple people would like to kick the tires on it.

Overall – it was a little light, but it was a keynote; the new features/pricing are all good; this shows more Google commitment to their cloud as a product but actual concerns still linger about maturity and suitability for realistically complex revenue-generating production applications.

Leave a comment

Filed under Cloud

Tagged as app engine, bigdata, Cloud, gae, gce, google

by Ernest Mueller | March 16, 2014 · 10:20 pm

Getting Started On AWS – Securely

So you’ve decided to start playing around with Amazon Web Services and are worried about doing so securely. Here’s the basics to do when you set up to ensure you’re on sound footing. In fact, I’m going to use the free tier of all these items for this walkthrough, so feel free and do it yourself if you’ve never taken the plunge into AWS!

Account Setup

Signing up for Amazon Web Services is as simple as going to aws.amazon.com and clicking the “Sign Up” button.

It will want a password – choose a strong one, obviously – and some credit card info for if you exceed the free tier. It’ll want a phone number for a robot to call you – they show you a PIN, the robot calls you, you give the robot the PIN, and you’re good to go.

Multifactor Authentication Setup For Your Account

Next, set up multifactor authentication (MFA) for your Amazon account. You should see an option like this to go directly there immediately post signup, or you can pick the “IAM” section out of the main Amazon console.

When you go to the IAM console you’ll see two options under Security Status to turn on – Root Account MFA and Password Policy. I won’t talk much about the password policy except to say “go turn it on and check all the boxes to ensure strong passwords.” To turn on MFA, you need some kind of MFA device. The Amazon docs to walk you through the process are here. Unless you have a Gemalto hardware token already your best best is to just download Google Authenticator (GA) onto your iPhone/Android from the relevant app store (other choices here).

Once you’ve installed Google Authenticator, click on “Manage MFA Device” and choose virtual; it’ll show you a QR code that GA can scan to do the setup. Then you enter two tokens in a row from GA and it’s hooked up to the account. Now, to log into the console you need both your password and a MFA token. (You can also use GA for Dropbox, Evernote, Gmail, WordPress, etc. and is a good safeguard against the inevitable password losses these companies sometimes have.) Of course once you do this you need to be careful – if you lose the phone or device you can’t get in!

Now make sure and save all your credentials in a password vault. I prefer Keepass Password Safe along with MiniKeePass on the iPhone. Besides the password, you should go to your Security Credentials (off a popdown form your name in the top right) and store your AWS Account ID, Canonical User ID (used for S3), and any access keys or X.509 certificates you make – but it’s better not to make these for the main account, just for IAM accounts. Proceed onward to hear more about that!

Identity and Access Management Setup

All right – now your main login is secure. But you want to take another step past this, which is setting up an Identity and Access Management (IAM) account. IAM used to be a lot less robustly supported in AWS but they’ve gotten it to be pretty pervasive across all their services now. Here’s the grid of what services support IAM and how. You can think of this as the cloud analogy to UNIX’s “secure your root account, but still you shouldn’t log in as it but as a more restricted user.”

First, you have to set up a group. On the IAM dashboard click on the big ol’ “Create A New Group Of Users” button in the yellow box at the top.

Then make a group, call it “Admins” for the sake of argument.

Choose the “Administrator Access” policy template. If you know what you’re doing, you can change this up extensively.

Add a username for yourself (and whatever other people or entities you want to have full admin access, ideally not a long list) to the group.

For each user it’ll give an Access Key ID and Secret Access Key – these are the private credentials for that user, they should take them and put them in whatever password vault they use.

If you want to use that account to log into the console – and for this first one, you probably do – then once this is complete you go into Users, select the user, and under Sign-In Credentials it will say Password: No. Click Manage Password and set a password for that user; they can then log into the custom IAM login URL shown on the front page of the IAM dashboard (it’ll put it in the file when you Download Credentials, too).

MFA Setup for IAM

Just as with the main user, you can (and should) also set up MFA on IAM users. It’s the same process as with the main account so I won’t belabor it.

After you set up your IAM user and its MFA, you shouldn’t log in using your main AWS account credentials to do work – only if you need the enhanced access to mess with account credentials. Log out and then back in with your IAM account to proceed. If you want to take it a step farther and make an even less privileged account without admin rights, which you can use for everyday tasks like logging in and looking at state or just starting/stopping instances but not manipulating more sensitive functions, you can do that too.

More IAM

Of course if you are looking to manage multiple people, or separate apps that have access to your account (many SaaS solutions that integrate with your Amazon account will ask for IAM credentials with specific access), you can set up more groups with less access and have those entities use those. In general I’d suggest using a group+user and not just a user (different SaaS monitoring services recommend different approaches, but I think a plain user is less flexible). You can also get fancy with roles (used for app access from your instances) and identity providers. Remember the principle of least privilege – give things only as much access as they need, so that if those credentials are compromised there’s a limit to what they can do. There’s an AWS IAM best practices guide with more tips.

Turn On Accounting

Security folks know that nothing’s complete without the ability to audit it. Well, you can turn on logging of AWS security events using CloudTrail, the AWS logging service. This will basically dump IAM (and other Amazon API events) to a JSON file in S3. This is a whole can of whupass unto itself, but the short form is to follow this guide to set up your trail, making sure to say Yes to “Include global services?”.

You can also go into S3 and set your bucket (properties.. lifecycle) to expire (archive, delete, etc.) the logs after a certain time.

Then you can do something like set up SumoLogic to watch it and review/alert on your logs. If you want to try that, the short HOWTO is:

Sign up for Sumo free trial (need a non-gmail email account)
Add a sumo IAM group with permissions to get to your CloudTrail S3 bucket (there’s suggested JSON with the exact settings in the Sumo help) and a user in that group
Add a hosted collector
Add a S3 source to that collector, point it at your bucket, give it the sumo user’s AWS creds
Your data’s going to come in in big clumps of JSON though, which you can parse with some pain. Hint, your searches chould look like:

* | json “userIdentity.userName”, “eventTime”, “eventName” as username, time, action | sort by time

They also have an app specific to CloudTrail; you have to contact Sumo support to get it turned on though.

Network Setup

All of this, of course, is about access at the Amazon account and API level. For your actual instances, you’ll want to set up secure network access and then manage the SSH keys you use to log into them.

VPCs used to be a limited option, and mostly people just used security groups. Nowadays, VPCs are standard and an expected part of your setup. They’re like a private virtual network. When you create your account, it’ll actually create one default one for you automatically. You can see it by going to the VPC Console. This default VPC, though, is set up for convenience and back compatibility – instances you launch into it will get a public IP address, which may not be what you want, and the default security group allows all outbound traffic and all traffic from within the security group.

You should consider starting one instance this way and then using it as a bastion host to gateway into your other instances, which shouldn’t have public IPs unless you really want them to be publicly facing. It’s hard to prescribe other specifics here because it really depends on what you plan to do. At a bare minimum you need to add an inbound SSH rule from your location to the security group so you can log into your first instance when you start it below. (They have a neat new “My IP” choice that’ll detect where you’re coming from. Of course that won’t work when you drive to the Starbucks…) Consider removing the rule allowing all traffic from within a security group – even within a group it’s more secure to allow specific protocols instead of “everything from everywhere.”

Ideally, you’d set up a VPN to the VPC’s Internet Gateway – but this requires expensive hardware or setting up your own server and is way out of scope here.

System Setup

Then, of course, you finally get to starting instances! Each instance will start with a default root ssh key. Things you want to do here are:

You will want to use personal SSH keys to log in. Generate a public-private pair (using putty-keygen or whatever’s best for your client). This doc tells you how to upload the public key to AWS. This will start any instance with that public key as root (or ec2-user or other non-root username depending on the distro you’re using), so this is a pretty sensitive root credential. You can add more users and distribute more keys to the instances later either via your favorite CM tool, by using AWS OpsWorks which is based on Chef, or however else.
Start an instance in the EC2 section of the console into your default VPC/sec group/etc. using your uploaded public key. I’m not going to do a detailed HOWTO on this because it’s pretty well-trod ground. If you don’t have an opinion, start with Amazon Linux.
Log in using your private key. Check the SSH fingerprint the first time you log in; it’ll be in the console output of the instance which you can see through the console (Actions… Get System Log) or an ec2-get-console-output command line.
Patch the instance. The AMI you’re using may be more or less super old and a “sudo yum update” or similar is a really good idea.
Turn off passworded login if it’s not already, and the ability to directly log in as root if it’s not already.
If you’d like this to be your bastion host, then add other security groups for other instances to go to – don’t allow inbound SSH to them from anywhere, just from this security group.

Automate

The final step to doing all this securely is to not be making manual changes. Via the CLI or API you can automate a lot of this, but even better is using CloudFormation, maybe in conjunction with OpsWorks or another CM tool, to define in a readable config how you want your system to behave (VPCs, security groups, etc.) and instantiate off that. Nothing’s more auditable than a system that’s built automatically from a spec! You can cheat a little and set up your VPCs and all the way you want and use their CloudFormer tool to generate a CloudFormation template from your running system. Then you can edit that and tear down/restart from scratch.

The more you automate, the tighter you can make the security controls without inconveniencing yourself. A trivial example is you could have a script that uses the CLI to change the security group to allow SSH from wherever you are right now, and then close it afterwards – so there’s no SSH access from a location unless you allow it! In the same vein, allowing “all access” within a security group or from one group to another is usually done out of laziness and flexibility for manual changes – if you automate such that if you add a new set of servers, they also configure their connectivity needs specifically, you’re more secure. For defense in depth you could automatically configure the onboard firewalls on the boxes to mimic the security groups, just read the security group settings and transform into similar iptables (or whatever) settings. Voila, a HIPS. Pump those logs into Sumo too.

You could add tripwire or OSSEC for change detection, but also if you run your servers from trusted images and recreate them frequently, you can very much reduce the risk of compromise.

That’s my quick HOWTO on how to get servers running in a mode that’s likely way more secure than the average enterprise server unless you work for a bank or something, inside a couple hours. MFA, key based auth, all the network separation you could want, separation of privileges…

Leave a comment

Filed under Cloud, Security

Tagged as amazon, amazon web services, authentication, aws, Cloud, IAM, login, MFA, multifactor, Security

by Ernest Mueller | February 26, 2014 · 5:59 pm

Special CloudAustin SXSW Edition 3/6

There’s a special early CloudAustin user group this month on Thursday, March 6 out at Rackspace. We’re having some folks from West Coast startup Stormpath (http://stormpath.com/), API-driven user and group management for developers come and give two talks:

Cloud Marketing 101: How to Market Your Cloud Product

You pour blood, sweat and tears into your API, open source and weekend projects – let’s make sure they get the attention they deserve! We’ll go through real-world examples of tactics developers can do to attract attention to their work. Beyond growth hacking and that first post to Hacker News, we’ll look at high-value marketing maneuvers that will drive usage, but won’t make you feel like a dirty huckster.

To Infinity and Beyond! Scaling Your Stack with Service Oriented Architecture

Abstract: Service Oriented Architecture is a proven design pattern which allows you to simplify your codebase, seamlessly scale your service, reduce engineering frustrations — and even helps lessen hosting costs. Come learn what SOA is, why it’s useful, and take a look at an in-depth technical overview of SOA, and how it can help your organization. Delight your engineers (and business people!) by building your product on top of simple, REST API services.

Leave a comment

Filed under Cloud, Conferences

Tagged as cloudaustin

by karthequian | November 20, 2013 · 5:11 pm

ReInvent – Fireside Chat: Part 1

One of the interesting sessions at ReInvent was a fireside chat with Werner Vogels., where CEO’s or CTO’s of different companies/startups who use AWS talked about their applications/platforms and what they liked and wanted form AWS. It was a 3 part series with different folks, and I was able to attend the 1st one, but I’m guessing videos are available for the others online. Interesting session, giving the audience a window into the way C level people think about problems and solutions…

First up, the CTO of mongodb…

Lots of people use mongo to store things like user profiles etc for their applications. Mongo performance has gotten a lot better because of ssd’s

Recently funded 150 million, and wanting to build out a lot of tools to be able to administer mongo better.

Apparently being a mongodb dba is a really high paying job these days!

User roles may be available in mongo next year to add more security.

Werner and Eliot want to work together to bring a hosted version of mongo like RDS.

Next up twilio’s Jeff Lawson

Jeff is ex amazon.

Software people want building blocks and not some crazy monolithic thing to solve a problem. Telecom had this issue, and that is why I started Twilio.

Everyone is agile! We don’t have answers up front, but we figure out these answers as we go.

Started with voice, then moved to SMS followed by a global presence. Most customers of ours wanted something that didn’t want boundaries and just wanted an API to communicate with their customers.

Werner: It’s hard to run an API business. Tell us more…
Lawson: It is really hard. Apis are kinda like webapps when it comes to scaling. REST helps a lot from this perspective. Multi tenancy issues gets amplified when you have an API business.

Twilio apparently deploys 20 times a day. Aws really helps with deployment because you can bring brand new environments that look exactly like prod and then tear it down when things aren’t needed.

When it comes to api’s, we write the documentation first and show our customers first before actually implementing the API. Then iterate iterate iterate on the development.

Jeff asks: Make it easier to make vpc up and running.

Next up: Valentino with adroll (realtime bidding)

There’s a data collection pipe which gets like 20 tb of data everyday.

Latency is king: Typically latency is like 50ms and 100ms. This is still a lot for us. I wish we had more transparency when it comes to latency inside aws and otherwise…

Why dynamo db? Didn’t find something simple at the time, and it was nice to be able to scale something without having to worry about it. We had 0 ops people at the time to work on scaling at the time.

Read write rates: 80k reads per second (not consistent), 40k writes per second.

Why erlang? You’re a python god.
I started working on Python with the twisted framework. But I realized that Python didn’t fit our use case well; the twisted system worked just as well but it would be complicated to manage it and needed a bit of hacks..

Today it would be hard to pick between erlang and go….

Leave a comment

Filed under Cloud, Conferences

Tagged as 2013, amazon, aws, Cloud, conference, reinvent

by karthequian | November 20, 2013 · 1:34 am

ReInvent 2013: Day 2 Keynote

I didn’t cover the day 1 keynote, but fortunately it can be found here. The day 2 keynote was a lot more technical and interesting though. Here are my notes from it:

First, we began by talking about how aws plans its projects.

Lots of updates every year!

Before any project is started, and teams are in the brainstorming phase. A few key things are always done.

Meeting minutes
FAQ
Figure out the ux
Before any code is written

“2 Pizza Teams”: Small autonomous teams that had roadmap ownership with decoupled lauch schedules.

Customer collaboration

Get the functionality in the hands of customers as soon as possible. It may be feature limited, but it’s in the hands of customers so that they can get feedback as soon as possible. Iterate iterate iterate based on feedback. Different from the old guard where everything is engineering driven and it is unnecessarily complex.

Netflix platform….

Netflix is on stage and we’re taking about the Netflix cloud prizes and talking about the enhancements to the different tools…looks pretty cool, and will need to check them out. There are 14 chaos monkey “tests” to run now instead of just 1 from before.

Cloud prize winners

Werner is back is breaking down the different facets that AWS focuses on:

Performance- measure everything; put performance data in log files that can be mined.
Security
Reliability
Cost
Scalability

Illya sukhar CEO from Parse is on stage now (platform for mobile apps)
-parse data: store data; it’s 5 lines of code instead of a bunch of code.
-push notification

Parse started with 1 aws instance
From 0-180,000 apps

180,000 collections in mongodb; shows differences between pre and post piops

Security

IAM and IAM roles to set boundaries on who can access what.
How to do this from a db perspective?
Apparently you can have fine grained access controls on dynamodb instead of writing your own code.
Each data block is encrypted in redshift
Cost:
Talking about how customers are using the spot instances to save $.

Scalability:
We transfer usecase, who take care of transferring large files.

Airbnb on stage with mike curtis, VP of engineering
-350k hosts around the world
-4 millions guests (jan 2013)
-9 million guests today.

Host of aws services
1k ec2 instances
Million RDS rows
50tb for photos in s3

“The ops team at Airbnb is with a 5 person ops team.”

Helps devote resources to the real problem.

AirBnB in 2011

AirBnB in 2012

Dropcam came on stage after that to talk about how they use the AWS platform. Nothing too crazy, but interestingly more inbound videos are sent to dropcam than YouTube!

Dropcam

They keynote ended with an Amazon Kinesis demo (and a deadmau5 announcement for the replay party), which on the outside looks like a streaming API and different ways to process data on the backend. A prototype of streaming data from twitter and performing analytics was shown to demonstrate the service.

Announcements

RDS for PostgreSQL
New instance types-i2 for much better io performance
Dynamo db- global secondary indexes!!
Federation with saml 2.0 for IAM
Amazon RDS- cross region read replicas!
G2 instances for media and video intensive application
C3 instances are new with fastest processors- 2.8 gig intel e5 v2
Amazon kinesis- real time processing, fully managed. It looks like this will help you solve issues of scalability when you’re trying to build realtime streaming applications. It integrates with storage and processing services.

Announcements

Incase you want to watch it, the day 2 keynote is here: http://www.youtube.com/watch?v=Waq8Y6s1Cjs

And also, the day 1 keynote: http://www.youtube.com/watch?v=8ISQbdZ7WWc

2 Comments

Filed under Cloud, Conferences

Tagged as 2013, amazon, aws, Cloud, conference, reinvent

by karthequian | November 19, 2013 · 1:57 pm

ReInvent 2013- Scaling on AWS for the First 10 Million Users

This was the first talk by @simon_elisha I went to at ReInvent, and was a packed room. It was targeted towards developers going from inception of an app to growing it to 10 million users. Following are the notes I took…

– We will need a bigger box is the first issue, when you start seeing traffic to an application. Single box is an anti pattern because of no failover etc. move out your db from the web server etc…you could use RDS or something too.

– SQL or NoSQL?
Not a binary decision; maybe use both? A blended approach can reduce technical debt. Maybe just start with SQL because it’s familiar and there are clear patterns for scalability. Nosql is great for super low latency apps, metadata data sets, fast lookups and rapid ingesting data.

So for 100 users…
You can get by using route53, ELB, multiple web instances.

For 10000 users…
– Use cloud front to cache any static assets.
– Get your session state out of the webservers. Session state could be stored in dynamo db because it’s just unrelated data.
– Also might be time for elastic cache now which is just hosted redis or memcached.

Auto scaling…
Min, max servers running in multiple az zones. AWS makes this really simple.

If you end up at the 500k users situation you probably really want:
– metrics and alarms
– automated builds and deploys
– centralized logging

must haves for log metrics to collect:
– host level metrics
– aggregate level metrics
– log analysis
– external site performance

Use a product for this, because there are plenty available, and you can focus on what you’re really trying to accomplish.

Create tools to automate so you save your time especially to manage your time. Some of the ones that you can use are: elastic beanstalk, aws opsworks more for developers and cloud formation and raw ec2 for ops. The key is to be able to repeat those deploys quickly. You probably will need to use puppet and chef to manage the actual ec2 instances..

Now you probably need to redesign your app when you’re at the million user mark. Think about using a service oriented architecture. Loose coupling for the win instead of tight coupling. You can probably put a queue between 2 pieces

Key tip: don’t reinvent the wheel.

Example of what to do when you have a user uploading a picture to a site.

Simple workflow service
– workers and deciders: provides orchestration for your code.

When your data tier starts to break down 5-10 mill users
– federation
Split by function or purpose
Gotcha- You will have issues with join queries
– sharding
This works well for one table with billions of rows.
Gotcha- operationally confusing to manage
– shift to nosql
Sorta similar to federation
Gotcha- crazy architecture change. Use dynamo db.

Final Tips

Leave a comment

Filed under Cloud, Conferences

Tagged as 2013, amazon, aws, Cloud, conference, reinvent

by Ernest Mueller | October 29, 2013 · 11:26 am

LASCON Interview: Jason Chan

Jason Chan (@chanjbs) is an Engineering Director of the Cloud Security team at Netflix.

Tell me about your current gig!

I work on the Cloud Security team at Netflix, we’re responsible for the security of the streaming service at Netflix. We work with some other teams on platform and mobile security.

What are the biggest threats/challenges you face there?

Protecting the personal data of our members of course. Also we have content we want to protect – on the client side via DRM, but mainly the pipeline of how we receive the content from our studio partners. Also, due to the size of the infrastructure, its integrity – we don’t want to be a botnet or have things injected to our content that can our clients.

How does your team’s approach differ from other security teams out there?

We embody the corporate culture more, perhaps, than other security teams do. Our culture is a big differentiator between us and different companies. So it’s very important that people we hire match the culture. Some folks are more comfortable with strong processes and policies with black and white decisions, but here we can’t just say now, we have to help the business get things done safely.

You build a security team and you have certain expertise on it. It’s up to the company how you use that expertise. They don’t necessarily know where all the risk is, so we have to provide objective guidance and then mutually come to the right decision of what to do in a given situation.

Tell us about how you foster your focus on creating tools over process mandates?

We start with recruiting, to understand that policy and process isn’t the solution. Adrian [Cockroft] says process is usually organizational scar tissue. By doing it with tools and automation makes it more objective and less threatening to people. Turning things into metrics makes it less of an argument. There’s a weird dynamic in the culture that’s a form of peer pressure, where everyone’s trying to do the right thing and no one wants to be the one to negatively impact that. As a result people are willing to say “Yes we will” – like, you can opt out of Chaos Monkey, but people don’t because they don’t want to be “that guy.”

We’re starting to look at availability in a much more refined way. It’s not just “how long were you down.” We’re establishing metrics over real impact – how many streams did we miss? How many start clicks went unfulfilled. We can then assign rough values to each operation (it’s not perfect, but based on shared understanding) and then we can establish real impact and make tradeoffs. (It’s more story point-ish instead of hard ROI). But you can get what you need to do now vs what can wait.

Your work – how much is reactive versus roadmapped tool development?

It’s probably 50/50 on our team. We have some big work going on now that’s complex and has been roadmapped for a while. We need to have bandwidth as things pop up though, so we can’t commit everyone 100%. We have a roadmap we’ve committed to that we need to build, and we keep some resource free so that we can use our agile board to manage it. I try to build the culture of “let’s solve a problem once,” and share knowledge, so when it recurs we can handle it faster/better. I feel like we can be pretty responsive with the agile model, our two week sprints and quarterly planning give us flexibility. We get more cross-training too, when we do the mid-sprint statuses and sprint meetings. We use our JIRA board to manage our work and it’s been very successful for us.

What’s it like working at Netflix?

It’s great, I love it. It’s different because you’re given freedom to do the right thing, use your expertise, and be responsible for your decisions. Each individual engineer gets to have a lot of impact on a pretty large company. You get to work on challenging problems and work with good colleagues.

How do you conduct collaboration within your team and with other teams?

Inside the team, we instituted once a week or every other week “deep dives” lunch and learn presentation of what you’re working on for other team members. Cross-team collaboration is a challenge; we have so many tools internally no one knows what they all are!

You are blazing trails with your approach – where do you think the rest of the security field is going?

I don’t know if our approach will catch on, but I’ve spent a lot of my last year recruiting, and I see that the professionalization of the industry in general is improving. It’s being taught in school, there’s greater awareness of it. It’s going to be seen as less black magic, “I must be a hacker in my basement first” kind of job.

Development skills are mandatory for security here, and I see a move away from pure operators to people with CS degrees and developers and an acceleration in innovation. We’ve filed three patents on the things we’ve built. Security isn’t’ a solved problem and there’s a lot left to be done!

We’re working right now on a distributed scanning system that’s very AWS friendly, code named Monterey. We hope to be open sourcing it next year. How do you inventory and assess an environment that’s always changing? It’s a very asynchronous problem. We thought about it for a while and we’re very happy with the result – it’s really not much code, once you think the problem through properly your solution can be elegant.

1 Comment

Filed under Cloud, Conferences, Security

Tagged as conference, interview, lascon, netflix, Security

by Ernest Mueller | July 17, 2013 · 9:51 am

Cloud Austin Logging Tool Roundup Presentations

James, Karthik, and I run Cloud Austin, a technical user group for cloud computing types in Austin. Last night we broke new ground by videoing the presentations using Hangouts On Air, and the result is a cool bunch of 15 minute presentations on Splunk, Sumo Logic, Logstash, Greylog2 (including one from Lennat Koopmann, the maintainer) and the first public presentation of Project Meniscus, Rackspace’s new logging system.

You can go get slides and watch the 2+ hour long video on the Cloud Austin blog.

Leave a comment

Filed under Cloud, DevOps

Tagged as greylog2, logging, logstash, meniscus, splunk, sumo logic

by Ernest Mueller | June 24, 2013 · 2:51 pm

Crosspost: How Bazaarvoice Weathered The AWS Storm

For regular agile admin readers, I wanted to point out the post I did on the Bazaarvoice engineering blog, How Bazaarvoice Weathered The AWS Storm, on how we have designed for resiliency to the point where we had zero end user facing downtime during last year’s AWS meltdown and Leapocalypse. It’s a bit late, I wrote it like in July and then the BV engineering blog kinda fell dormant (guy who ran it left, etc.) and we’re just getting it reinvigorated. Anyway, go read the article and also watch that blog for more good stuff to come!

Leave a comment

Filed under Cloud, DevOps

Tagged as availability zone, aws, bazaarvoice, Cloud, DevOps, ec2, leapocalypse, outage, region, resiliency

by karthequian | June 20, 2013 · 4:40 pm

Velocity 2013 Day 3: benchmarking the new front end

By Emily Nakashima and Rachel Myers

bitly.com/ostrichandyak

Talking about their experiences at mod cloth…..

Better performance is more user engagement, page views etc…

Basically, we’re trying to improve performance because it improves user experience.

A quick timeline on standards and js mvc frameworks from 2008 till present.

NewRelic was instrumented to get an overview of performance and performance metrics; the execs asked for a dashboard!! Execs love dashboard 🙂

Step 1: add a cdn; it’s an easy win!
Step 2: The initial idea was to render the easy part of the site first- 90% render.
Step 3: changed this to a single page app

BackboneJS was used to redesign the app to a single page app from the way the app was structured before.

There aren’t great tools for Ajax enabled sites to figure out perf issues. Some of the ones that they used were:
– LogNormal: rebranded as Soasta mpulse
– newrelic
– yslow
– webpagetest
– google analytics (use for front end monitoring, check out user timings in ga)- good 1st step!
– circonus (which is the favorite tool of the presenters)

Category Archives: Cloud

Special CloudAustin SXSW Edition 3/6

LASCON Interview: Jason Chan

Cloud Austin Logging Tool Roundup Presentations

Velocity 2013 Day 3: benchmarking the new front end

Subscribe

Recent Comments

Recent Posts

Austinites

Cloud

DevOps

Archives