Application Resilience Engineering and Operations at Netflix by Ben Christensen (@benjchristensen)
Netflix and resilience. We have all this infrastructure failover stuff, but once you get to the application each one has dozens of dependencies that can take them down.
Needed speed of iteration, to provide client libraries (just saying “here’s my REST service” isn’t good enough), and a mixed technical environment.
They like the Bulkheading pattern (read Michael Nygard’s Release it! to find out what that is). Want the app to degrade gracefully if one of its dozen dependencies fails. So they wrote Hystrix.
1. Use a tryable semaphonre in front of every library they talk to. Use it to shed load. (circuit breaker)
2. Replace that with a thread pool, which adds the benefit of thread isolation and timeouts.
A request gets created and goes throug hthe circuit breaker, runs, then health gets fed back into the front. Errors all go back into the same channel.
The “HystrixCommand” class provides fail fast, fail silent (intercept, especially for optional functionality, and replace with an appropriate null), stubbed fallback (try with the limited data you have – e.g. can’t fetch the video bookmark, send you to the start instead of breaking playback), fallback via network (like to a stale cache or whatever).
Moved into operational mode. How do you know that failures are generating fallbacks? Line graph syncup in this case (weird). But they use instrumentation of this as part of a purty dashboard. Get lots of low latency granular metrics about bulkhead/circuit breaker activity pushed into a stream.
Here’s where I got confused. I guess we moved on from Hystrix and are just on to “random good practices.”
Make low latency config changes across a cluster too. Push across a cluster in seconds.
Auditing via simulation (you know, the monkeys).
When deploying, deploy to canary fleet first, and “Zuul” routing layer manages it. You know canarying. But then…
“Squeeze” testing – every deploy is burned as an AMI, we push it to perf degradation point to know rps/load it can take. Most lately, added
“Coalmine” testing – finding the “unknown unknowns” – an env on current code capturing all network traffic esp, crossing bulkheads. So like a new feature that’s feature flagged or whatever so not caught in canary and suddenly it starts getting traffic.
So when a problem happens – the failure is isolated by bulkheads and the cluster adapts by flipping its circuit breaker.
Distributed systems are complex. Isolate relationships between them.
Auditing and operations are essential.