Don’t forget, this Friday 7/25 is the annual celebration of System Administrator Appreciation Day. Start dropping hints to your coworkers about your treat of choice now! “DevOps means you have to care!(tm)”
Monthly Archives: July 2014
Well, it was my first Velocity (I’ve been to every one, 2008 to present, you can read the previous reports here on the blog) as a vendor! So that was different, and I split time between working the Copperegg booth and going to sessions. As a result I’m not going to do the extensive session-by-session notes I’ve done in the past. Two other Agile Admins, James and Karthik were there, I’m hoping they do some writeups of sessions they attended too!
Being a vendor was interesting; though standing at the booth made my dogs bark after the day was over, it was great to be able to talk to so many people. There were a lot of monitoring providers at the show (Copperegg (us), Compuware, New Relic, Datadog, many more). Pingdom was right across from us, with a slate of guys shipped in from Sweden, but they were generally grumpy – jet lag or their recent acquisition, perhaps. A new log management SaaS provider was there, logentries.com, and that was interesting – Sumo is the only real one in the space since Loggly and SplunkStorm borked it up and they’ve been getting a little… “Enterprise-y?” By that I mean having sales reps call you 5x/day and wanting near-Splunk prices. So yay to the newcomers, competition is always good. Other than that, it was mostly the same slate of Velocity-vendors as usual.
Well, let’s get it out of the way – there wasn’t all that much new this year. Karthik complained to me that “last year, Velocity was my favorite conference ever, and this year I didn’t get much out of it.” Not every year hosts a bunch of new techniques, sadly, but I thought there were some gems in there. Here’s the major four new trends taking up speech-space:
Docker docker docker containers containers containers. Learn it now because in a year everything will be in containers – no, seriously. Largest splash in computing since Amazon AWS. The hype is a little overexcited at times but there’s a lot of new development going on here. On the one hand, not everyone needs new-box spinup in 5s instead of 5m and the efficiency gains are a tradeoff for security – but to be blunt, people stopped well short of exercising the elasticity and ephemerality of cloud/virtualization solutions, instead going for the more comfortable “let’s deploy a three tier app manually like we did back in the day, but in the cloud” and so containers will be a disruption to push forward the concept of dynamic service orchestration etc., which is good.
There is starting to be buzz around Internet of Things. Mark Burgess (CFEngine, author of “In Search Of Certainty”) did a presentation on IoT and a more distributed model of monitoring and computation. Worth looking at, and it’s becoming more a part of mainstream computing (“engineering” tech and “IT” tech split off from each other 15 years ago for whatever reason and are just now joining forces again). Since we Agile Admins all had worked at National Instruments and had tried to get them onto the IoT bandwagon like 5 years ago, we grumped among each other about this.
There’s also strong interest in software defined networking (OpenDaylight, Cumulus). John Willis (@botchagalupe) waxed poetic on the topic and it fit into the general push towards making everything programmable.
There was strong and sustained interest (presentations, etc.) on STEM education and specifically on women in tech/getting more women into tech.
Video of these should be publicly available so you can watch them.
Jeff Dean of Google did a very interesting talk on making large scale services low latency that I recommend everyone view (video is at the link). Shared environments increase utilization but also congestion, exacerbated by large fanout systems – if a given system has services with only 1% 1 sec latency and have you to touch 100 services to finish your call, 63% of calls take more than a second. Traditional latency reduction uses techniques like differentiated service classes, breaking up large requests, managing background activity (rate limit, wait till low load). Tolerating faults is a lot like tolerating variability – extra resources make your system reliable – do the same with variability, but much lower timeframe. There’s two ways to do that…
- Cross Request Adaptation – examine recent behavior and make changes (load balance, scale) – low timescale, this tends to make the “next call” faster. Fine grained dynamic partitioning relies on equal sizes and constant load, but if you break up into 10-100 things a machine you can shed load more effectively. Selective replication, like in query system they make more copies of important docs. Use latency-induced probation via your load balancer, offload to other boxes, shadow stream to original, return to service when it’s better.
- Within Request Adaptation – make the call faster within the single call! Basically this is a series of refinements on “send the request two places.” First he modeled sending the request again to another server if it didn’t return in an expected amount of time. You can get cuter, like by always sending to two destinations and having the one that starts working on it give a sideways “I’ve got it” to the other. His mathematical analysis says that you can cut latency dramatically for a very small increase in load, and not only that, but the response of a loaded cluster and an idle cluster become very similar (less dramatic spiking under load).
And I did one! Just a 5 minute spot since Copperegg was a platinum sponsor; I talked about applying a Lean approach to implementing monitoring. It was called A 5 Minute Checklist For Application Monitoring and slides/video are at the link. I also wrote a white paper to expand on it that’s available for download here.
I went to a number of sessions that I enjoyed; here’s a quick breakdown of the ones I thought were winners. I’ll try to find slides and link them where they exist. O’Reilly charges for the videos though.
Vladimir Vuskan’s workshop on ganglia. People like the gathering of mass metrics. They did rake him over the coals a bit on the 15s time resolution and the relatively primitive RRDTool graphs. He had some interesting bits like a “check that a value is the same everywhere” alert for consistency. He also summed up “why we monitor” well – MTTD, MTTR, trending, learning.
Theo Schlossnagle’s presentation on Understanding Slowness. He recommended a system map as step 1 – high level box and line but low level with all versions, locations, and service connections. He also talked about going to histograms but less sophisticated users find those hard to understand, so displaying quantiles can be a happy medium. He sees three different tool spaces: observational, synthetic, and manipulation.
There was a good presentation by Dan Slimmon (video of same talk from Monitorama)on the math around false alarms, using the “sensitivity” and “specificity” terms from medicine. Here’s a quick reference on those and how you calculate a positive predictive value. Undetected outages are embarrassing so the response is to narrow the monitoring thresholds but this just generates more false alerts, aka “pagerrhea.” This segued into the discussion of using better means to detect deviation – hysteresis, moving thresholds like Holt-Winters, cross-correlation of metrics, Fourier transforms. You should alert on whether work is getting done, not on CPU or swap but on HTTP response time and requests per second. He wants “something like nagios but that separates detection from diagnosis.”
I also really appreciated the LinkedIn talk on technical debt. They admitted that several years ago, they were trying to keep up in the social world and just ground to a halt because of accumulated technical debt. They had to stop and take a bunch of time to fix it before they could move forward. Important takeaways included:
- Technical debt comes small decision by small decision
- Don’t wait for version n+1, fix it now
- “One in a million” problems happen a lot at web scale
- Cancerous workarounds are no good
- Broken window syndrome – if things are broken, people will tend to leave things broken
- Zombie tech will eat you
- Use our cool rest.li REST framework!
- Employee engagement drains KPIs
- Strategies – recognize debt choices and decisions
- Use new eyes – consultants, interns – to identify the “bad parts”
- Measure the right things
- Technical debt you can see is only the tip of the iceberg
- Make active decisions otherwise in Soviet Russia, Decision Makes You! (well, I added that last part)
The last really good one was about confirmation bias and monitoring. When dealing with metrics there are a lot of cognitive illusions – the anchoring effect (whatever it was recently before it deviated must have been right), the validity effect (a couple people told me that so it must be true), illusory correlation (looks like those happened around the same time), attitude polarization (round up the usual suspects). The way to combat this is with analysis. Rethink your data flow, validate your stats. Use anomaly detection like the open sourced skyline and oculus to really detect correlations and deviations.
Though there weren’t as many breakthroughs this year, I appreciated the incremental uptick in wisdom about how to use what we have!
Much of the benefit of conferences isn’t the sessions, it’s the great people you meet and share experiences with. Once you’ve been a couple years, you get to see old friends – though sadly none of our compatriots from Agile Admin alumni companies were there (National Instruments, Bazaarvoice, PowerReviews) we did get to see most of the “usual suspects” we get to see at these shows – we had the usual “hang out at the Hyatt bar fiesta” with Andrew Schafer, John Willis, Ben Rockwood, Cameron Haight and Jonah Kowall from Gartner, Gene Kim, and many more. Notable in his absence was Patrick Debois who remained in Belgium, we all missed him.
If you went to Velocity this year, chime in below (especially if we met you there!).
@bridgetkromhout spoke on “how I learned to stop worrying and love devops” and @benzobot spoke on onboarding and mentoring apprentices. DoD SV certainly made a strong effort to get more female speakers this year! We tried in Austin (I personally wrote like every local techie woman group I could find) but we only had like one.
Then there were two super bad ass presentations back to back. I can’t find the slides online yet.
The Future of Configuration Management
Mark Burgess (@markburgess_osl), aka “The CFEngine Guy” and noted Promise Theory advocate, spoke. Chef and Puppet had eclipsed CFEngine for a while but it turns out as the Internet of Things and containers and stuff are arriving that maybe many of his design decisions were actually prescient and not retro. Here it is broken down into wise sayings.
- Why do we not have CAD for IT systems?
- Orchestration is not bricklaying.
- We need the equivalent of style sheets for servers.
- We are entering a world of decentralized smart infrastructure.
- Scale, complexity, and knowledge increase as our desire for flexibility increases.
- Separation of concerns adds complexity and fragility.
- To handle complexity – atomize and untether.
- 3D printed datacenters are coming.
DevOps as Relationship Management
James Urquhart (@jamesurquhart) spoke about the interconnectedness of our systems. The SEC, post flash crash, added circuit breakers, defined rollback protocols, inserted agents into the flow of the stock exchange trading systems to prevent uncontrolled cascading.
One simple rule – visualize the whole system (monitor your outside relationships) but take action at the agent level. “How are you doing today?” “Good.” Monitoring is going well, new approaches in the space look at policies and interactions and performance and business medtrics – but need to differentiate reductionist vs expansionist approaches.
Michal Nygard’s book Release It! is full of great patterns, and Netflix’ open sourced Hystrix is an example of the kind of relational system safeguards you can build off it.
- Tips for Introverts (at Conventions) by Tom Duffield – They include find a role, don’t fear failure, attend preconference activities, go to lunch early and sit, engage, share interests, find a comfortable setting, take time to recharge. As someone initially introverted myself (no one believes that now) I like that this has actual tips to get past it; in some circles “introversion” has become the new “Asperger’s” as a blanket excuse for not wanting to bother to relate to people.
- Mike Place on scalable container management – Google kubernetes is an example. Don’t just provision your systems, you need to manage them too. Images came and went and came back now, but you also can’t ignore what’s onboard the image. It’s time to join image and config management.
This was really good and the world should listen. On the one hand, conducting CM operations on 1000 servers in parallel is contributing unnecessarily to the heat death of the universe. On the other hand, you need to build those images in a non-manual way in the first place! And too many systems worry about the configuration but not the runtime operation. Amen brother!
- Finally (well, there were two more, but I didn’t care for them so took no notes), John Willis (@botchagalupe) did [Darwin to] Deming to DevOps, a burst-fire reading list of nondeterminism tracing from Darwin through various scientists to the Deming/TPS stuff through into the DevOps world with Gene Kim and Patrick Debois. It was pimp. Here it is when he gave it at another venue:
Here’s some big themes from the week.
- Deterministic, reductionist, and centralized are for suckers.
- Complexity is the enemy. Systems thinking is necessary.
- We love continuous deployment. But DevOps is not just about delivering code to production.
- Women exist in DevOps and are cool. More would be great.
- Most vendors have figured out to just relax and talk to techies in a way they might listen to. Some haven’t.
It was a great event, kudos to Marius and the other organizers who put in a lot of work to wrangle 500 people, nearly 30 sponsors, food, venue, and the like. If you haven’t been to a DevOpsDays, look around, there may be one near you! I help organize DevOpsDays Austin (just had our third annual) and there’s ones coming this year from Tel Aviv to Minneapolis.
If you went to DoD SV, feel free and comment below with your thoughts (linking any posts you’ve made, slides, etc. is welcome too)!