Tag Archives: netflix

LASCON Interview: Jason Chan

 IMG_1513Jason Chan (@chanjbs) is an Engineering Director of the Cloud Security team at Netflix.

Tell me about your current gig!

I work on the Cloud Security team at Netflix, we’re responsible for the security of the streaming service at Netflix.  We work with some other teams on platform and mobile security.

What are the biggest threats/challenges you face there?

Protecting the personal data of our members of course.  Also we have content we want to protect – on the client side via DRM, but mainly the pipeline of how we receive the content from our studio partners. Also, due to the size of the infrastructure, its integrity – we don’t want to be a botnet or have things injected to our content that can our clients.

How does your team’s approach differ from other security teams out there?

We embody the corporate culture more, perhaps, than other security teams do. Our culture is a big differentiator between us and different companies.  So it’s very important that people we hire match the culture. Some folks are more comfortable with strong processes and policies with black and white decisions, but here we can’t just say now, we have to help the business get things done safely.

You build a security team and you have certain expertise on it.  It’s up to the company how you use that expertise. They don’t necessarily know where all the risk is, so we have to provide objective guidance and then mutually come to the right decision of what to do in a given situation.

Tell us about how you foster your focus on creating tools over process mandates?

We start with recruiting, to understand that policy and process isn’t the solution.  Adrian [Cockroft] says process is usually organizational scar tissue. By doing it with tools and automation makes it more objective and less threatening to people. Turning things into metrics makes it less of an argument. There’s a weird dynamic in the culture that’s a form of peer pressure, where everyone’s trying to do the right thing and no one wants to be the one to negatively impact that.  As a result people are willing to say “Yes we will” – like, you can opt out of Chaos Monkey, but people don’t because they don’t want to be “that guy.”

We’re starting to look at availability in a much  more refined way.  It’s not just “how long were you down.”  We’re establishing metrics over real impact – how many streams did we miss?  How many start clicks went unfulfilled.  We can then assign rough values to each operation (it’s not perfect, but based on shared understanding) and then we can establish real impact and make tradeoffs. (It’s more story point-ish instead of hard ROI). But you can get what you need to do now vs what can wait.

Your work  – how much is reactive versus roadmapped tool development?

It’s probably 50/50 on our team.  We have some big work going on now that’s complex and has been roadmapped for a while.  We need to have bandwidth as things pop up though, so we can’t commit everyone 100%. We have a roadmap we’ve committed to that we need to build, and we keep some resource free so that we can use our agile board to manage it. I try to build the culture of “let’s solve a problem once,” and share knowledge, so when it recurs we can handle it faster/better.  I feel like we can be pretty responsive with the agile model, our two week sprints and quarterly planning give us flexibility. We get more cross-training too, when we do the mid-sprint statuses and sprint meetings. We use our JIRA board to manage our work and it’s been very successful for us.

What’s it like working at Netflix?

It’s great, I love it.  It’s different because you’re given freedom to do the right thing, use your expertise, and be responsible for your decisions. Each individual engineer gets to have a lot of impact on a pretty large company.  You get to work on challenging problems and work with good colleagues.

How do you conduct collaboration within your team and with other teams?

Inside the team, we instituted once a week or every other week “deep dives” lunch and learn presentation of what you’re working on for other team members. Cross-team collaboration is a challenge; we have so many tools internally no one knows what they all are!

You are blazing trails with your approach – where do you think the rest of the security field is going?

I don’t know if our approach will catch on, but I’ve spent a lot of my last year recruiting, and I see that the professionalization of the industry in general is improving.  It’s being taught in school, there’s greater awareness of it. It’s going to be seen as less black magic, “I must be a hacker in my basement first” kind of job.

Development skills are mandatory for security here, and I see a move away from pure operators to people with CS degrees and developers and an acceleration in innovation. We’ve filed three patents on the things we’ve built. Security isn’t’ a solved problem and there’s a lot left to be done!

We’re working right now on a distributed scanning system that’s very AWS friendly, code named Monterey. We hope to be open sourcing it next year.  How do you inventory and assess an environment that’s always changing? It’s a very asynchronous problem. We thought about it for a while and we’re very happy with the result – it’s really not much code, once you think the problem through properly your solution can be elegant.

1 Comment

Filed under Cloud, Conferences, Security

LASCON 2013 Report – First Morning

IMG_1475Arriving at #LASCON 2013, hosted as usual at the Norris Conference Center, the first thing you see is the vintage video games throughout the lobby! As usual it’s well run and you get your metal badge and other doodads without any folderol; volunteers packed the venue ready to help folks with anything. I got a lovely media badge since I’m on the hook to blog/tweet it up while I’m there! It’s in a nice central location on Anderson Lane so getting there took a lot less time than my normal commute to work did.

IMG_1481The MCs, James Wickett and David Hughes, got us kicked off. Thanks went out to many the LASCON sponsors!

  • White Hat
  • Qualys
  • Gemalto
  • Trustwave/Spider Labs
  • Critical Start
  • Sourcefire
  • SOS Security

IMG_1482Then everyone stood and raised their right hand to say the “LASCON pledge,” which consists of “I will not hack the Wi-fi,” “I will not social engineer other attendees and the nice Norris Conference Center staff who are hosting us,” and similar.

Then, the keynote!

Keynote- Nick Galbreath, The Origins of Insecurity

IMG_1488Nick Galbreath (@ngalbreath), VP of Engineering at Iponweb. He used to work for Etsy, now he works in Tokyo for a Russia-based ad infrastructure company.  Suck that, Edward Snowden.

Slides at speakerdeck.com/ngalbreath!

If you’re in security, you should be bringing someone else from dev or ops or something here! We can’t get much done by ourselves.

Crypto

There’s a lot of consternation about crypto and SSL and PKI lately. The math is sound!  See FP’s “The NSA’s New Code Breakers” – it’s way easier to get access other ways. I don’t know of any examples of brute forcing SSL keys – it’s attacking data at rest or bypassing it altogether.

But what about the android/bitcoin break and alleged fix re: Java SecureRandom PRNG? I can’t find the fix checked in anywhere.  Let’s look at SHA1PRNG. Where’s the spec? You’re forced to use it, where’s the open implementation, tests…

Basically everything went wrong in specification, implementation, testing, review, postmortem… Then the NIST’s Dual-EC-DRBG spec – slow and with a potential backdoor – but at least it’s not required by FIPS!  It’s broken but not mandatory and we know it’s broken, so fair enough. It’s a “standard turd.” Standards aren’t a replacement for common sense. Known turdy in 2007.  Why are you just removing it now? TLS 1.2 was approved in 2008, why don’t all browsers support it and no browsers support GCM mode? Old standards need augmentation and updates.

Fixing the CA system – four great ways, certificate pinning, pruning, HTTPS Strict-Transport-Security, certificate-transparency.org.

Everything Else

  • Network Security – stuff you didn’t write
  • App Security – stuff you did write
  • Endpoint Security – stuff you run

IT internal tech is mostly Windows/Mac CM and patching, 99% C-based stuff.

Tech Ops – Routers, Linux, Core server (all C too)

Dev:

  • Input validation – not hard
  • Configuration problems
  • Logical problems – more interesting
  • Language platform problems (most patches here also in C!)

Reactive work is patching, CM, fixing apps, patching infrastructure. You can focus your patching though – Win7 at current patches, Flash, Adobe, Java will get 99% of your problems, focus there – but it’s hard to do. But either you can do it trivially or it’s really hard.

Learn from the hardest apps to deploy.  The Chrome model of self updating gets 97% of people within a version in 4-6 weeks. Android, not so good- driven more by throwing out phones than any ability to upgrade. They’re chipping stuff away from the OS and making more into apps to speed it up. Apple/iOS just figured out app auto-update. Desktop lags though. WordPress is starting background updates. BSD is automatically installing security updates at first boot.

Releasing faster and safely is a competitive advantage AND makes you more secure.

For desktop upgrades, can’t we do something with containers? Why only one version installed? How can we find out about problems from users faster? How do we make patching and deployment easy for the dumbest users?

Even info on “How do I configure Apache securely” is wide and random on the Web. Silently breaks all the time, and it’s simple compared to firewalls, ssh, VPN, DNS… Rat’s nests full of crap, while it gets easier and easier to put servers on the internet. How can we make it safe to configure a server and keep it secure?

Can we do this for application development? Ruby BrakeMan is great, it does static analysis on commit and sends you email about rookie mistakes. Why not for apache config? (Where did chkconfig go?)

PHP Crypt – great for legacy passwords and horrible for new ones. Approximately 0% chance of a dev getting its configuration right.

See @manicode’s best practices – have a business level API for that.

By default, every language has a non-crypto, insecure PRNG. So people use them. They are used for some science stuff, but seriously if you’re doing physics you’re going to link something else in. Being slightly slower for toy apps that don’t care about security isn’t a big deal. Make the default PRNG secure! And, there’s 100x more people interested in making things fast than making them secure, so make the default language PRNG secure and people will make it faster.

libinjection.client9.com to try to eliminate SQL injection! It’s C, fast, low false positives, plug in anywhere.

Products focus on blocking and offense/intrusion, but leave these areas (actual fixing) uncovered.  Think globally, act locally. Even if you’re not a dev, most open source doesn’t have a security anything – join in!
Write fuzzers, compile with different flags, etc.

So think big, get involved, bring your friends!

Malware Automation

IMG_1490By Christopher Elisan from RSA, aka @tophs.

Total discovered malware is growing geometrically year over year. There are a lot of “DIY malware creation kits” nowadays; SpyEye, Zeus… These are more oriented around online crime; the kits of yesteryear were more about pissing contests about “mine is better than yours” (VCL, PS-MPC). The variation they can create is larger as well.

Armoring tools exist now – PFE CX for example, claims to encrypt, compress, etc. your executable – but all the functions don’t always work and buyers don’t check.  Indetecitbles.net is online and will do it! It was free but now it’s “hidden.”

Use a tool like ExeBundle to bundle up your malware and then share it out via whatever route (file sharing, google play, whatever). Or hacking and overwriting good wares – even those that bother publishing a hash to verify their software often keep it on the same Web site that is already getting hacked to change the executable, so the hash just gets changed too.

So you make your malware with a kit, put it through a crypter a realtime packer, an EXE binder, other armoring tools, then run through QA in terms of on premise and cloud AV, then you’re ready to go.

Targeted vs opportunistic attacks… Delivery is a lot easier when you can target.

Anyway, many of those new malware samples are really just the same core malware run through a different variety of armoring tools. They’re counted as different malware but should get grouped into families; he’s working on that at RSA now.

Besides the variation in malware, domains serving malware can rotate in minutes. Since the malware can be created so quickly it effectively defeats AV by generating too many unique signatures. Reversing has to be done but it takes weeks/months.

Demo: Creating Malware in 2 Minutes!

ZeuS Builder – bang, bot.exe, one every couple seconds. Unique but not hash-unique at this point. They look different on disk and in memory. Then runs Saw Crypter, in seconds it creates multiple samples from one ZeuS sample. Bang, automated generation of billlllllyuns of armored samples.

There’s really just a handful of kits behind all the malware, need new solutions that go after the tools and do signature-less detection.

From Gates to Guardians: Alternate Approaches to Product Security

IMG_1493Jason Chan, Director of Engineering from Netflix, in charge of security for the streaming product. Here are his slides on Slideshare!

Agile, cloud, continuous delivery, DevOps – traditional security doesn’t adapt well to these. We want to move fast and stay safe at Netflix.

The challenges are speed (rapid change) and scale. To address these…

  • Culture – If your culture has moved towards rapid delivery, it’s innovation first. Don’t be “Doctor No” and go against your company culture, you won’t be successful.  Adapt.
  • Visibility – you need to be able to see whats going on in a big distributed system.
  • Automation – no checklists and spreadsheets

At Netflix we do ~200+ pushes to production a day, 40M subscribers, 1000+ devices supported.

Culture

We have a lot of stuff on our site about this, it’s a big differentiator.  “Freedom and responsibility” is the summary. No buck passing. Responsible disclosure program externally.

We’re moving towards “full stack engineers” that know some about appsec, online operations, monitoring and response, infrastructure/systems/cloud – that can write some kind of code. The security industry seems to be moving towards superspecialists, we don’t see that as successful.

2 week sprint model, JIRA Scrum workflow (CLDSEC project!). No standups, weekly midsprint meeting. Bullpen shared-space model.

Visibility

Use their internal security dashboard (VPC, crypto, other services plug in and display their security metrics). Alerts send emails with descriptive subjects, the alert config, instructions/links as to where to check/what to do. Chat integration.

NSA asks, how do you verify software integrity in production?  How do you know you’re not backdoored?

They have their Mimir dashboard that is a CI/CD dashboard, that tracks source code to build to deploy to JIRA ticket. Traceability!

Canary testing because code reviews don’t catch much.  Deploy a new version and test it (regression, perf, security) and see if it’s OK. Automatic Canary Analyzer gets a confidence level – “99% GO!”

Simian Army does ongoing testing. Go to prod… Then the monkeys test it.

Security Monkey shows config change timestamps of security groups and stuff.

So they have Babou (the ocelot from Archer) that does file integrity monitoring. They use the immutable server pattern so checking is kinda easy, but you still can be running multiple canary versions at the same time so there’s not one “golden master.” This allows multiple baselines.

Q: How long did it take to make this change and implement? What were the triggers?
A: This push started when he started in 2011; previously IT security handled product security. He hired his first person last year and now they’re up to 10.

Q: What do you do earlier on in the lifecycle in arch and design (threat modeling etc.)?
A: Can’t be automated, the model here is optionally come engage us (with more aggressiveness for stuff that’s clearly sensitive/SOXey).

Q: So this finds problems but how do people know what to do in the first place, share mistakes cross teams?
A: As things happen, added libraries with training and documentation. But think of it as “libraries.”

Q: Competing with Amazon while renting their hardware? (Laaaaaame, the CEO has talked about this in multiple venues.)
A: AWS is the only real choice. Our CEOs talked.

Next – Lunch!  No liveblog of lunch, you foodie voyeurs!

Leave a comment

Filed under Conferences, Security

Velocity 2013 Day 2 Liveblog – Application Resilience Engineering and Operations at Netflix

Application Resilience Engineering and Operations at Netflix  by Ben Christensen (@benjchristensen)

Netflix and resilience.  We have all this infrastructure failover stuff, but once you get to the application each one has dozens of dependencies that can take them down.

Needed speed of iteration, to provide client libraries (just saying “here’s my REST service” isn’t good enough), and a mixed technical environment.

They like the Bulkheading pattern (read Michael Nygard’s Release it! to find out what that is). Want the app to degrade gracefully if one of its dozen dependencies fails. So they wrote Hystrix.

1. Use a tryable semaphonre in front of every library they talk to. Use it to shed load. (circuit breaker)

2. Replace that with a thread pool, which adds the benefit of thread isolation and timeouts.

A request gets created and goes throug hthe circuit breaker, runs, then health gets fed back into the front. Errors all go back into the same channel.

The “HystrixCommand” class provides fail fast, fail silent (intercept, especially for optional functionality, and replace with an appropriate null), stubbed fallback (try with the limited data you have – e.g. can’t fetch the video bookmark, send you to the start instead of breaking playback), fallback via network (like to a stale cache or whatever).

Moved into operational mode.  How do you know that failures are generating fallbacks?  Line graph syncup in this case (weird). But they use instrumentation of this as part of a purty dashboard. Get lots of low latency granular metrics about bulkhead/circuit breaker activity pushed into a stream.

Here’s where I got confused.  I guess we moved on from Hystrix and are just on to “random good practices.”

Make low latency config changes across a cluster too. Push across a cluster in seconds.

Auditing via simulation (you know, the monkeys).

When deploying, deploy to canary fleet first, and “Zuul” routing layer manages it. You know canarying.  But then…

“Squeeze” testing – every deploy is burned as an AMI, we push it to perf degradation point to know rps/load it can take. Most lately, added

“Coalmine” testing – finding the “unknown unknowns” – an env on current code capturing all network traffic esp, crossing bulkheads. So like a new feature that’s feature flagged or whatever so not caught in canary and suddenly it starts getting traffic.

So when a problem happens – the failure is isolated by bulkheads and the cluster adapts by flipping its circuit breaker.

Distributed systems are complex.  Isolate relationships between them.

Auditing and operations are essential.

Leave a comment

Filed under Conferences, DevOps