Author Archives: Ernest Mueller

Ernest Mueller's avatar

About Ernest Mueller

Ernest is the VP of Engineering at the cloud and DevOps consulting firm Nextira in Austin, TX. More...

Velocity 2013 Day 1 Liveblog – Bringing the Noise

Next up it’s the Etsy Crew!  A great bunch of guys.  Rembetsy is cutely nervous and proud about his guys presenting. Slides are available here!

And the topic is Bring the Noise: Making Effective Use of a Quarter Million Metrics by @abestanway and @jonlives. Anomaly detection is hard…

At Etsy we want to deploy lots – we have 250 committers, everyone has to deploy code, coder or not. Big “deploy to production” button. 30 deploys/day.

How can we control that kind of pace? Instead of fearing error, we put in the means to detect and recover quickly.

They use ganglia, graphite, and nagios – and they wrote statsd, supergrep, skyline, and oculus as well.

First line of defense – node daemon tailing log files and looking for errors using supergrep.

But not everything throws errors. 😦

So they use statsd to collect zillions of metrics and put them onto dashboards. But dashboards are manually curated “what’s important” – and if you have .25M metrics you just can’t do that.  So the dashboard approach has fallen over here.  And if no one’s watching the graph, why do you have it?

So that’s why Satan invented Nagios, to alert when you can’t look at a graph, but again it breaks down at scale.

Basically you have unknown anomalies and unknown correlations.

They have “kale,” their monitoring stack to try to solve this – skyline solves anomaly detection and oculus solves metrics correlation.

Skyline

A realtime anomaly detection system (where realtime means ~90s). They have a 10s flush on statsd and a 1 min res on ganglia so that’s still fast.

They had to do this in memory and not disk, using Redis. But how to stream them all in?  They looked around and realized that the carbon-relay on graphite could be used to fork into it by pretending it’s another backup graphite destination.

They import from ganglia too via graphite reading its RRDs. Skyline also has other listeners.

To store time-series data in Redis, minimizing I/O and memory… redis.append() is constant time.

Tried to store in JSON but that was slow (half the CPU time was decoding JSON).

Found Messagepack, a binary-based serialization protocol. Much faster.

So they keep appending, but had to have a process go through and clean up old data past the defined duration. Hence “roomba.py.” Python because of all the good stats libraries. They just keep 24 hours of operational data.

But so what is an anomaly and how do you detect it?

Skyline uses the consensus model. [Ed. This is a common way of distinguishing sensor faults from process faults in real-world engineering.]

Using statistical process control – a metric is anomanouls if its latest datapoint is over three standard deviations above its moving average.

Use “Grubb’s test” and “ordinary least squares”… OK, most of the crowd is lost now. Histogram binning.

Problems – seasonality, spike influence (big spike biases average masking smaller spikes), normality (stddev is for normal distributions, and most data isn’t normal), and parameters. They are trying to further their algorithms.

OK, how about correlations?

Oculus does this.  Can we just compare the graphs? Image comparison is expensive and slow. Numerical comparison is not a hard problem.

“Euclidean Distance” is the most basic comparison of two time series. Dynamic Time Warping helps with phase shifts from time. But that’s expensive – O(n^2).

So how can we discard “obviously” dissimilar data?  Use a shape description alphabet – “basically flat, sharp increment,” etc.  Apply to graphs, cluster using elasticsearch, run dynamic time algorithm on that smaller sample size to polish it. But that’s still slow.  Luckily there’s a fast DTW variant that’s O(n).

So they do an elastic search phrase query with a high slop against the shape fingerprints.
Populate elastic search from redis using resque workers, but it makes it slow to update and search. Solved with rotating pool of elastic search servers – new index/last index. Allows you to purge the index and reindex. They cron-rotate every 2 min. Takes 25s to import, but queries take a while and you don’t want to rotate out from under it.
Sinatra frontend to query ES and render results off the live ES index.

Save collections of interesting correlations and then index those, so that later searches match against current data but also old fingerprints.

Devops is the key to us being able to do this. Abe the dev and Jon the ops guy managed to work all this out in a pretty timely manner.

Demo: Draw your query! He schetched a waveform and it found matching metrics -nice.

Leave a comment

Filed under Conferences, DevOps

Velocity 2013 Day 1 Liveblog – Monitoring and Observability

I’m in San Jose, California for this year’s Velocity Conference! James, Karthik and I flew in on the same flight last night.  I gave them a ride in my sweet rental minivan – a quick In-n-Out run, then to the hotel where we ended up drinking and chatting with Gene Kim, James Turnbull, Marcus, Rembetsy, and some other Etsyers, and even someone from our client Nordstrom’s.

Check out our coverage of previous Velocity events – Peco and I have been to every single one.

I always take notes but then don’t have time to go back and clean them up and post them all – so this time I’m just going to liveblog and you get what you get!

Theo Schlossnagle of OmniTI, getting back to his roots by rocking a psycho hillbilly hairstyle, kicked off the first workshop of the day on Monitoring and Observability. The slides are on Slideshare.

Theo Schlossnagle

The talk starts with a bunch of basic term definitions.

  • Observability is about measuring “things” or state changes and not alter things too bad while observing them.
  • A measurement is a single value from a point in time that you can perform operations upon.

“JSON makes all this worse, being the worst encoding format ever.” JSON lets you describe for example arbitrarily large numbers but the implementations that read/write it are inconsistent.

  • A metric is the thing you are measuring.  Version, cost, # executed, # bugs, whatever.

Basic engineering rule – Never store the “rate” of something.  Collect a measurement/timestamp for a given metric and calculate a rate over time.  Direct measurement of rates generates data loss and ignorance.

  • Measurement velocity is the rate at which new measurements are taken
  • Perspective is from where you’re taking the measurement
  • Trending means understanding the direction/pattern of your measurements on a metric
  • Alerting, durr
  • Anomaly detection is determining that a specific measurement is not within reason

All this is monitoring.  Management is different, we’re just going to talk about observation. Most people suck at monitoring and monitor the wrong things and miss the real important things.

Prefer high level telemetry of business KPIs, team KPIs, staff KPIs… Not to say don’t measure the CPU, but it’s more important to measure “was I at work on time?” not “what’s my engine tach?” That’s not “someone else’s job.”

He wrote reconnoiter (open source) and runs Circonus (service) to try to fix deficiencies.

“Push vs pull” is a dumb question, both have their uses. [Ed. In monitoring, most “X vs Y” debates are stupid because both give you different valid intel.]

Why pull?

  • Synthesized obervations desirable (e.g. “URL monitor”)
  • Observable activity infrequent
  • Alterations in observation/frequency are useful

Why push?

  • Direct observation is desirable
  • Discrete observed actions are useful (e.g. real user monitoring)
  • Discrete observed actions are frequent

“Polling doesn’t scale” – false. This is the age where Google scrapes every Web site in the world, you can poll 10,000 servers from a small VM just fine.

So many protocols to use…

  • SNMP can push (trap) and pull (query)
  • collectd v5/v5 push only
  • statsd push only
  • JMX, etc etc etc.

Do it RESTy. Use JSON now.  XML is better but now people stop listening to you when you say “XML” – they may be dolts but I got tired of swimming upstream. PUT/POST for push and GET for pull.

nad – Node Agent Daemon, new open source widget Theo wrote, use this if you’re trying to escape from the SNMP helhole.  Runs scripts, hands back in JSON. Can push or pull. Does SSL. Tiny.

But that’s not methodology, it’s technology. Just wanted to get “but how?” out of the way. The more interesting question is “what should I be monitoring?”  You should ask yourself this before, during, and after implementing your software. If you could only monitor one thing, what would it be?  Hint: probably not CPU. Sure, “monitor all the things” but you need to understand what your company does and what you really need to watch.

So let’s take an example of an ecomm site.  You could monitor if customers can buy stuff from your site (probably synthetic) or if they are buying stuff from your site (probably RUM). No one right answer, has to do with velocity.  1 sale/day for $600k per order – synthetic, want to know capability. 10 sales/minute with smooth trends – RUM, want to know velocity.

We have this whole new field of “data science” because most of us don’t do math well.

Tenet: Always synthesize, and additionally observe real data when possible.

Synthesizing a GET with curl gets you all kinds of stuff – code, timings (first byte, full…), SSL info, etc.

You can curl but you could also use a browser – so try phantomjs. It’s more representative, you see things that block users that curl doesn’t interpret.

Demo of nad to phantomjs running a local check with start and end of load timings.

Passive… Google Analytics, Omniture.  Statsd and Metrics are a mediocre approach here. But if you have lots of observable data, e.g. the average of N over the last X time is not useful. NO RATES I TOLD YOU DON’T MAKE ME STICK YOU! At least add stddev, cardinality, min/max/95th/99th… But these things don’t follow standard distributions so e.g. stddev is deceptive.  If you take 60k API hits and boil it down to 8 metrics you lose a lot.

How do you get more richness out of that data? We use statsd to store all the data and shows histograms. Oh look, it’s a 3-mode distribution, who knew.

A heat map of histograms doesn’t take any more space than a line graph of averages and is a billion times more useful.  Can use some tools, or build in R.

Now we’ll talk about dtrace… Stop having to “wonder” if X is true about your software in production right now. “Is that queue backed up? Is my btree imbalanced?” Instrument your software. It’s easy with DTrace but only a bit more work otherwise.

Use case – they wrote a db called Sauna that’s a metrics db. They can just hit and get a big JSON telemetry exposure with all the current info, rollups, etc.

Monitoring everything is good but make sure you get the good stuff first and then don’t alert on things without specific remediation requirements.

Collect once and then split streams – if you collect and alert in Zabbix but graph in graphite it’s just confusing and crappy.

Tenet: Never make an alert without a failure condition in plain English, the business impact of the failure condition, a concise and repeatable remediation procedure, and an escalation path. That doesn’t have to all be “in the alert” but linking to a wiki or whatever is good.

How to get there? Do alerting postmortems. Understand why it alerted, what was done to fix, bring in stakeholders, have the stakeholder speak to the business impact. [Ed. We have super awful alerting right now and this is a good playbook to get started!]

Q: How do you handle alerts/oncall?  Well, the person oncall is on call during the day too, so they handle 24×7. [Ed. We do that too…]

Q: How does your monitoring system identify the root cause of an issue?  That’s BS, it can’t without AI.  Human mind is required for causation.  A monitoring system can show you highly correlated behavior to guide that determination. Statistical data around a window.

Q: How to set thresholds?  We use lots. Some stock, some Holt-Winters, starting into some Markov… Human train on which algorithms are “less crappy.”

Q: Metrics db? We use a commercial one called Snowth that is cool, but others use cassandra successfully.

Q: How much system performance compromise is OK to get the data? I hate sampling because you lose stuff, and dropping 12 bytes into UDP never hurt anyone… Log to the network, transmit everything, then decide later how to store/sample.

Don’t forget to check out his conference, SURGE.

2 Comments

Filed under Conferences, DevOps

Welcome to Velocity 2013!

All the agile admins – James (@wickett), Peco (@bproverb), and Ernest (@ernestmueller) are reunited at the Web performance and operations conference Velocity again this year!  And with us we have newer cool guys, dev extraordinaire Karthik (@iteration1) from Mentor Graphics and operations muscleman Bryan (@bryguy1211) from Bazaarvoice, and some of our old NI colleagues, Eric and Matt.  Then some of us are staying over for DevOpsDays Mountain View.

So buckle in and experience one of the handiest Web/DevOps conferences by proxy! We’re encouraging everyone to liveblog along here on the agile admin.  I always take notes but always run out of time to prettify and post them after so I’m trying liveblogging in hopes of staying caught up. Comment if you are getting value out of it to encourage us to keep it up!

I hear the Velocity marketing stuff is using one of my quotes from the blog, which is cool; it credits the old defunct webadminblog.com, but we’ve moved here now!

Leave a comment

Filed under Conferences, DevOps

DevOpsDays Austin 2013 Is Upon Us

Well, I hope you got a ticket for next week’s event because we’re sold out, sponsor slots sold out, all filled up with volunteers, the train has left the station.  It’s going to be a sweet ride.  Check out the program – Patrick Debois, John Willis, Gene Kim, Nick Galbreath, and many more will be speaking.

We have a sweet venue, the Marchesa; and on the first evening, Tuesday, you should free up your night.  We’ve got a happy hour from Dell, the band Lord Buffalo is playing, and then the Austin Film Society will be doing a private screening of Office Space for us! Then at 10 if you’re still rarin’ to go we can hook you up. We’re providing breakfast and lunch both days; expect breakfast tacos, barbecue, all the Texas standbys.

Of the agile admins, James and I have been working (along with the many other great volunteers who put many hours into putting together the event) to make it a fun and informative time for everyone who’s signed up!  Come early, leave late!

Leave a comment

Filed under Conferences, DevOps

Must Read: The Phoenix Project

The Phoenix ProjectHave you read the famous systems-management novel The Goal?  No, I know you haven’t, don’t feel bad, I only got to it this year myself.

Well, Gene Kim, entrepreneur, consultant, founder of Tripwire, and general insatiable Tweeter, has written a sequel of sorts his Visible Ops coauthors Kevin Behr and George Spafford.  Bearing the tongue-twisting title The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win, it’s like The Goal in that it’s written as a novel about the people in a large IT shop and how they’re faced will all the usual soul-crushing BS that we all get faced with, but use Lean and DevOps and pluck and courage to overcome it.

I got to be a pre-reader on large chunks of the book and I like it; it definitely has characters and situations directly torn from your IT department. (Man, the security guy… Just like every security guy…) And I’ve seen these techniques work, so I know it’s not just a wish-fulfillment novel.

If you’re wondering how DevOps can help you because “you live in the real world, man,” this is a good read that’ll give you some ideas along those lines! I’ve seen Gene speak at everywhere from AppSec USA to South by Southwest Interactive to DevOpsDays to Velocity… Go read the testimonials from everyone from Cockroft to Humble to even yours truly and then buy the book!

3 Comments

Filed under DevOps, Security

Awesome Austin Events!

Besides the “big one,” DevOpsDays Austin 2013, there’s a bunch of great events going on in Austin for techies.

The Agile Austin DevOps SIG meets every last Wednesday over lunch at Bazaarvoice; lunch is provided.  This month’s meeting on January 30 is Breaking the Barriers.

The Austin Cloud User Group meets every third Tuesday in the evening at Pervasive; dinner is provided. This month’s meeting is January 15, sponsored by Canonical, and there is a talk on Openstack Quantum, the network virtualization platform.

South by Southwest Interactive is here of course on March 8-12.

Hip security event BSides Austin is on March 21-22.

Data Day Austin is on January 29.

Texas Linux Fest will be May 31 – June 1.

It’s never been a better time to be a techie in Austin!

3 Comments

Filed under Cloud, Conferences, DevOps

DevOpsDays Austin 2013 Is Coming!

devopsdaysaustinIt’s been quiet here on the blog but we’ve been busy… And one of those items of busy-ness is setting up DevOpsDays Austin 2013!

Many of you came to DevOpsDays Austin 2012, the largest super-awesome Austin DevOps event ever!  Well, it’s even bigger this year.  Registration is open for DevOpsDays Austin 2013! Sign up quick to be assured a spot.

It’ll be April 30th and March 1st at the Marchesa in the middle of Austin. Because we had to pay for a venue this time to let more people attend, and to prevent no-shows from taking up the limited slots, we’ve instituted a $120 early bird fee for the event, which covers the food/venue/shirts for both days.

Proposals are open too, as are opportunities for companies to sponsor – the more sponsorships, the more cool activities and goodies for everyone involved!

I had a blast at DevOpsDays Austin in 2012 and this year stands to be huge.  Come on out and share expert tips with other elite DevOps practitioners from around the world! Patrick Debois and a growing list of other respected DevOps ninjas will be in attendance.

Email organizers-austin-2013@devopsdays.org with questions!

Leave a comment

Filed under Conferences, DevOps

AppSec USA 2012 Is Here (in Austin)!

AppSec USA 2012, the big OWASP security convention, is here in Austin this year!  And the agile admin’s own @wickett is coordinating it.

“Why do I care if I’m not a security wonk,” you ask? Well, guess what, the security world is waking up and smelling the coffee – this isn’t like security conventions were even just a couple years ago.  There’s a Cloud track and a Rugged DevOps track.

We have like 20 people going from Bazaarvoice. It’s two days, Thursday and Friday (yes, tomorrow – I don’t know why James didn’t post this earlier, sorry) and just $500. So it’s cheap and low impact.

And who’s speaking?  Well, how about Douglas Crockford, inventor of JSON?  And Gene Kim, author of Visible Ops?  That’s not the usual infosec crowd, is it?  Also Michael Howard from Microsoft, Josh Corman from Akamai, a trio of Twitter engineers, Nick Galbreath (formerly of Etsy), Jason Chan from Netflix, Brendan Eich from Mozilla… This is a star-studded techie event that you want to be at!

I’ll be there and will report in…

1 Comment

Filed under Conferences, DevOps, Security

Speeding Up Releases

Hi all!  My new job’s been affording me few opportunities for blogging, but I’m getting into the groove, so you should see more of me now.

Releasing All The Time!

Continuous integration is the bomb.  We can all generally agree on that.  But my life has become one of halfway steps that I think will be familiar to many of you, and I don’t believe in hiding the real world that’s not all case study perfect out there.  So rather than give you the standard theory-list of “what you should do for nice futuristic DevOps releases,” let me tell you of our march from a 10 week to 2 week to 1 week release tempo at Bazaarvoice.

Biweekly Releases!

I started with BV at the start of February of this year. They said, “Our new release manager!  We’ve been waiting for you!  We adopted agile and then tried to move from our big-bang 10 week release cycle to 2 weeks and it blew up like you wouldn’t believe.  Get us to two week releases.  You’ve got a month. Go!”  The product management team really needed us to be able to roll out features more quickly, do piloting and A/B testing, and generally be way more agile in delivery to the customer and not just in dev-land.

Background – our primary application is for the collection and display of user generated content – for example, ratings and reviews – and a lot of the biggest Internet retailers use our solution for that purpose. The codebase started seven years ago and grew monolithically for much of that time . (“The monolith” was the semi-affectionate code name for the stack when I started, as in “is your app’s code on the monolith?”) The app is running across multiple physical and cloud based datacenters and pushing out billions of hits a day, so there’s a low tolerance window for errors – our end user facing display apps have to have zero downtime for releases, though we can do up to two hours of downtime in a 3-5 AM window for customer administrative systems. Stack is Java, Linux, mySQL, Solr, et al. Extremely complex, just like any app added on to for years.

There had been a SWAT team formed after the semi-disastrous 2 week release that identified the main problems.  Just like everywhere else, the main impediments were:

  • Lack of automation in testing
  • Poor SCM code discipline

Our CTO was very invested in solving the problem, so he supported the solution to #1 – the QA team hired up and got some automation folks in, and the product teams were told they had to stop feature development and do several sprints of writing and automating tests until they could sustain the biweekly cadence.

The solution to #2 had two parts.  One was a feature flagging system so we could launch code “dark.” We had a crack team of devs crank this one out. I won’t belabor it because Facebook etc. love to do DevOps presentations on the approach and its benefits, but it’s true.  Now we release all code dark first, and can enable it for certain clients or other segments.

Two was process – a new branching process where a single release branch comes off trunk every two weeks several days before release, and changes aren’t allowed to it except to fix issues found in QA, and those are approved and labeled into discrete release candidates. The dev environment gets trunk twice a day, the QA environment gets branch every time a new release candidate is labeled. Full product CIT must be passing to get a release candidate. As always, process steps like this sound like common sense but when you need 100 developers in 10 teams to uptake them immediately, the little issues come out and play.

There were a couple issues we couldn’t fix in the time allotted.  One was that our Solr indexes are Godawful huge.  Like 20 GB huge.  JVM GC tuning is a popular hobby with us. To make changes, reindex, and distribute the indexes in time to perform a zero-downtime deployment, with replication lag nipping at our heels, was a bigger deal.  The other was that our build and deploy pipeline was pretty bad.  All the keywords you want to hear are there – Puppet, TeamCity, Rundeck, svn, noah, maven/Nexus, yum…  But they are inconsistently implemented, embedded in a huge crufty bash script framework and parts have gone largely untended.

The timeframe was extremely aggressive.  I project managed the hell out of it and all the teams were very on board and helpful, and management was very supportive.  I actually got a slight delay, and was grateful for it, because our IPO date came up on the same date when we were supposed to start biweekly releases, and even the extremely ambitious were taken aback by the risk of cocking up the service on that day. We did our first biweekly release on March 6th and then every two weeks thereafter.  We had a couple rough patches, but they were good learning experiences.

For example, as our first biweekly release day approached, tests just weren’t passing. I brought all the dev managers to the go/no-go meeting (another new institution) and asked them, “are we go?” (The release manager role had been set up by upper management as more prescriptive, with the thought I’d be sitting there yelling at them “It’s no-go,” but that’s really not an effective long term strategy).  They all kinda shuffled, and hemmed, and hawed (a lot of pressure from internal stakeholders wanted this release to go out NOW), but then one manager said “No, we’re no go.  It’s just not safe.” Once she said that everyone else got over that initial taboo of saying “no go” and concurred that some of their areas were no go.  The release went out 5 calendar days late but a lot more smoothly than the last release did (44 major issues then, 5 this time).

The next release, though, was the real make-or-break.  On the one hand everyone had a first real pass through the process and so some of the “but I didn’t know I needed to have testing signoff by that day and time” breaking-in static was gone, but on the other hand they’d had 2 months between the previous two releases to test and plan, and this one allowed only two weeks.  It went off with no delay and only 1 issue.

Of course, we had deliberately sandbagged that a little because it coincided a with ‘test development only” sprint.  But anyone who thinks a complex release in a large scale environment will go smoothly just because you’re deploying code with no functional changes has clearly never been closer than a 10-foot pole to real world Web operations. As we ramped back up on feature development, the process was also becoming more ingrained and testing better, so it went well.

We had one release go bad in May, and when we looked at it we realized a lot of changes weren’t being sufficiently QA’ed.  So what we did was simply add a set of fields to all JIRA tickets for the team to specify who tested the change, and we wrote a script to parse our Subversion commit comments and label JIRA tickets with the appropriate release (trying to get people to actually fill out tickets correctly is pain and usually doomed to failure, so we made an end run with automation).  So then as a release came up, on a wiki page is a list of all the tickets in the release and who tested them and how (automatic, manual, did not test). We actually did this for two releases with paper printouts and physical signoffs to develop the process before we automated it.  This corrected the issue and we ran from then on with very low problem rates. As advertised, releasing fewer changes more frequently allows us to get both a higher throughput of changes and, paradoxically, higher quality with them.

Weekly Releases!

The process worked great through the summer. In the biweekly release communication and presentations, I had explained we’d be moving to weekly and then to continuous deployment as soon as we could make it happen. Well, the solr index distribution problem took a while – two reorgs kicked it around and it was an ambitious “use bittorrent to distribute the index to all the servers in our various DCs” pretty propellerhead kind of thing that had to happen. It took the summer to get that squared away. In the meantime I also conducted a project internally called “Neverland” to fix some of the most egregious technical debt in our TeamCity and Nexus setup and deployment scripts.

The real testament to the culture change that happened as part of the biweekly release project is that while that project was a “big deal” – I had stakeholders from all over the business, big all hands presentations, project plans out the yin-yang, the entire technical leadership team sweating the details – moving from biweekly to weekly releases was largely a non-event.

The QA team worked in the background leading up to it to push test automation levels up higher. Then we basically just said “Hey, you guys want to release faster don’t you?” “Well sure!” “OK. we’re going weekly in two weeks. Check out the updated process docs.” “All right.” And we did, starting the first release in September.  The Solr index got reindexed and redistributed (and man, it had been a while – it compacted down nicely) and deployment ran great. No change in error rate at all. We’ve been weekly since then, the only change  is when we don’t release during critical change freeze windows around Black Friday/Cyber Monday and other holiday prime times. We think our setup is robust enough that it’s safe to release even then, but, heck, no one’s perfect so it’s probably prudent to pause, and many of our clients are really adamant about holiday change freezes to us and to all their suppliers.

The one concern voiced by engineers about the overhead of the release process was addressed by automating it more and by educating.  For example, the go/no-go meeting was, at times, a little messy. Some of the other teams (especially ones not located in Austin) wouldn’t show up, or test signoffs wouldn’t be ready, and it would turn into delays and running around. The opportunity to do it more quickly actually helped a lot! Whereas the meeting had been 30 minutes if we were lucky when we started, now the meeting is taking 5 minutes, and only longer when someone screws around and doesn’t dial into the Webex on time.

“If it’s painful, do it more often” is a message that some folks still balk at when confronted with, but it is absolutely true.

Now, the path wasn’t easy and I was blessed with a very high caliber of people at Bazaarvoice – Dev, Ops, and QA. Everyone was always very focused on “how do we make this work” or “how do we improve this” with very little of the turf warring, blocking, and politics that I sadly have come to expect in a corporate environment. The mindset is very much “if we come up with a new way that’s better and we all agree on that, we will change to do that thing TOMORROW and not spend months dithering about it,” which is awesome and helped drive these changes through much faster than, honestly, I initially estimated it would take.

Releasing All The Time!

Continuous integration on “the monolith” was a distant myth initially, but now we’re seeing how we can get there and the benefits we’ll reap from doing so. Our main impediments remaining are:

1. CIT not passing.  We don’t have a rule where if CIT is failing checkins are blocked, mainly because there’s a bunch of old legacy tests that are flaky. This often results in release milestones being delayed because CIT isn’t passing and there’s 6 devs’ checkins in the last failing build. Step 1 is fix the flaky tests and step 2 is declare work stoppage when CIT is failing. The senior developers see the wisdom in this so I expect it to go down without much friction. Again, the culture is very much about ruthlessly adopting an innovation if the key players agree it will be beneficial.

2. Builds, CIT, and deployment are slow as molasses in January. Build 1 hour, CIT 40 minutes, deploy 3 hours. Why? Various legacy reasons that give me a headache when I have to listen to them. Basically “that’s how it is now, and complete rewrite is potentially beyond any one person’s ability and definitely would take multiple man-months.” We’re analyzing what to do here. We also have a “staging” environment customers use for integration, and so currently we have to deploy to dev, test, deploy to QA, test, deploy to staging (hitting the downtime window), test, deploy to production (hitting the downtime window), test. So basically 2 days minimum. However, staging is really production and step one is release them at the same time.  There’s a couple “but I can only test this kind of change in staging” items left that basically just require telling someone “Figure out how to test it in QA now.” Going to “always release trunk” will remove the whole branch deployment and separate dev and QA environments. So that’s 2 of 4 deployments removed, but then it’s a matter of figuring out cost vs benefit of smashing down parts of that 4:40. I have one proposal in front of me for chucking all the current deploy infrastructure for a Jenkins-driven one, I need to figure out if it is complete enough…

Am I Doing It Wrong?

Chime in in the comments below with questions or if there’s some way I could have cut the Gordian knot better.  I think we’ve moved about as fast as you can given a lot of legacy code and technical debt (and having a lot of other stuff people need to be working on to keep a service up and get out new functionality).   The three step process I used that works, as it does so often, was:

  1. Communicate a clear vision
  2. Drive execution relentlessly
  3. Keep metrics and continually improve

Thanks for reading, and happy releasing!

1 Comment

Filed under DevOps

Velocity 2012 Day Two

After John Allspaw and Steve Souders caper about in fake muscles, we get started with the keynotes.

Building for a Billion Users (Facebook)

Jay Parikh, Facebook, spoke about building for a billion users.  Several folks yesterday warned with phrases like “if you have a billion users and hundreds of engineers then this advice may be on point…”

Today in 30 minutes, Facebook did…

  • 10 TB log data into hadoop
  • Scan 105 TB in Hive
  • 6M photos
  • 160m newsfeed stories
  • 5bn realtime messages
  • 10bn profile pics
  • 108bn mysql queries
  • 3.8 tn cache ops

Principle 1: Focus on Impact

Day one code deploys. They open sourced fabricator for code review. 30 day coder boot camp – Ops does coder boot camp, then weeks of ops boot camp. Mentorship program.

Principle 2: Move Fast

  • Commits scale with people, but you have to be safe! Perflab does a performance test before every commit with log-replay. Also does checks for slow drift over time.
  • Gatekeeper is the feature flag tool, A/B testing – 500M checks/sec. We have one of these at Bazaarvoice and it’s super helpful.
  • Claspin is a high density heat map viewer for large services
  • fast deployment technique – used shared memory segment to connect cache, so they can swap out binaries on top of it, took weeks of back end deploys down to days/hours
  • Built lots of ops tools – people | tools | process

Random Bits:

  • Be bold
  • They use BGP in the data center
  • Did we mention how cool we are?
  • Capacity engineering isn’t just buying stuff, it’s APM to reduce usage
  • Massive fail from gatekeeper bug.  Had to pull dns to shut the site down
  • Fix more, whine less

Investigating Anomalies, Amazon

John Rauser, Amazon data scientist, on investigating anomalies.  He gave a well received talk on statistics for operations last year.
He used a very long “data in the time of cholera” example.  Watch the video for the long version.
You can’t just look at the summary, look at distributions, then look into the logs themselves.
Look at the extremes and you’ll find things that are broken.
Check your long tail, monitor percentiles long in the tail.

Building Resilient User Experiences

Mike Brittain, Etsy on Building Resilient User Experiences – here’s the video.

Don’t mess up a page just because one of the 14 lame back end services you compose into it is down. Distinguish critical back end service failures and just suppress other stuff when composing a product page. Consider blocking vs non-blocking ajax. Load critical stuff synchronously and noncritical async/not at all if it times out.
Google apps has the nice problem/retrying in an interval messages (use exponential back off in your ui!)

Planning for 100% availability is not enough; plan for failure.  Your UI should adapt to failure. That’s usually a joint dev+ops+product call.
Do operability reviews, postmortems. Watch your page views for your error template!  (use an error template)

Speeding up the web using prediction of user activity

Arvind Jain/Dominic Hamon from Google – see the video.

Why things would be so much faster if you just preload the next pages a user might go to.  Sure.  As long as you don’t mind accidentally triggering all kinds of shit.  It’s like cross browser request forgery as a feature!
<link rel=’prerender’> to provide guidance.

Annoyed by this talk, and since I’ve read How Complex Systems Fail already, I spent rest of the morning plenary time wandering the vendor floor

It’s All About Telemetry

From the famous and irascible Theo Schlossnagle (@postwait). Here’s the slides.
Monitor what matters! Most new big data problems are created by our own solutions in the first place, and are thus solvable despite their ROI. e.g. logs.

What’s the cost/benefit of your data?
Don’t erode granularity (as with RRD).  It controls your storage but your ability to, say, do YOY black friday compares sucks.
As you zoom in you see obscured major differences and patterns.

There’s a cost/benefit curve for monitoring that goes positive but eventually goes negative.Value=benefit-cost.  So you don’t go to the end of the curve, you want the max difference!

Technique 1: Text
Just store changes- be careful not to have too many changes and store them

Technique 2: Numeric
store rollups, over 1 minute min/max/avg/sttdev/covar/50/95/99%
store first order derivative and then derivative of that (jerkiness)
db replication – lag AND RATE OF LAG CHANGE
“It’s a lot easier to see change when you’re actually graphing change”
Project numbers out.   Graph storage space doing down!
With simple numeric data you can do prediction (Holt-Winters) even hacked into RRD

Technique 3: Histograms
about 2k for a 1 minute histogram (5b for a single bucket)

Event correlation – change mgmt vs performance, is still human eye driven

Monitor everything.

  • the business – financials, marketing, support! problems, custsat, res time
  • operations! durr
  • system/db/yeah
  • middleware! messaging, apis, etc.

They use reconnoiter, statsd, d3, flot for graphing.

This was one of the best sessions of the day IMO.

RUM For Breakfast

Carlos from Facebook on routing users to the closest datacenter with Doppler. Normal DNS geo-routing depends on resolvers being near the user, etc.
They inject js into some browsers and log packet latency.  Map resolver IP to user IP, datacenter, latency. Then you can cluster users to nearest data centers. They use for planning and analysis – what countries have poor latency, where do we need peering agreements
Akamai said doing it was impractical, so they did it.
Amazon has “latency based routing” but no one knows how it works.
Google as part of their standard SOP has proposed a DNS extension that will never be adopted by enough people to work.

Looking at log-normal perf data – look at the whole spread but that can be huge, so filter then analyze.
Margin of error – 1.96*sd/sqrt(num), need less than 5% error.

How does performance influence human behavior?
Very fast sessions had high bounce rates (probably errors)
Strong bounce rate to load time correlation, and esp front end speed
Toxicology has the idea of a “median lethal dose”-  our LD50 is where we pass 50% bounce rate
These are: Back end 1.7s, dom load 1.8s, dom interactive 2.75s, front end 3.5s, dom complete 4.75s, load event 5.5s

Rollback, the Impossible Dream

by James Turnbull! Here’s the slides.

Rollback is BS.  You are insane if you rely on it. It’s theoretically possible, if you apply sufficient capital, and all apps are idempotent, resources…

  • Solve for availability, not rollback. Do small iterative changes instead.
  • Accept that failure happens, and prevent it from happening again.
  • Nobody gives a crap whose fault it was.
  • Assumption is the mother of all fuckups.
  • This can’t be upgraded like that because…  challenge it.

Stability Patterns

by Michael Nygard (@mtnygard) – Slides are here.   For more see his pragmatic programmers book Release It!

Failures come in patterns. Here’s some major ones!

Integrations are the #1 risk to stability, from both “in spec” and “out of spec” errors. Integrations are a necessary evil.  To debug you have to peel back the layers of abstraction; you have to diagnose a couple levels lower than the level an error manifests. Larger systems fail faster than small ones.

Chain Reactions – #2!
Failure moves horizontally across tiers, search engines and app servers get overloaded, esp. with connection pools. Look for resource leaks.

Cascading Failure
Failure moves vertically cross tiers. Common in SOA and enterprise services. Contain the damage – decouple higher tiers with timeouts, circuit breakers.

Blocked Threads
All threads blocked = “crash”. Use java.util/concurrent or system.threading (or ruby/php, don’t do it). Hung request handlers = less capacity and frustrated users. Scrutinize resource pools. Decompile that lame ass third party code to see how it works, if you’re using it you’re responsible for it.

Attacks of Self-Denial
Making your own DoS attack, often via mass unexpected promotions. Open lines of communication with marketers

Unbalanced Capacities
Your environment scaling ratios are different dev to qa to prod. Simulate back end failures or overloading during testing!

Unbounded result sets
Dev/test have smaller data volumes and unreal relationships (what’s the max in prod?). SOA with chatty providers – client can’t trust not being hurt. Don’t trust data producers, put limits in your APIs.

Stability patterns!

Circuit Breaker
Remote call wrapped with a retry loop (always 3!)
Immediate retries are very likely to fail again – TCP fixes packet drops for you man
Makes the user wait longer for their error, and floods the servers (cascading failure)
Count failures (leaky bucket) and stop calling the back end for a cool-off period
Can critical work be queued for later, or rejected, or what?
State of circuit breakers is a good dashboard!

Bulkheads
Partition the system and allow partial failure without losing service.
Classes of customer, etc.
If foo and bar are coupled with baz, then hitting baz can bork both. Make baz pools or whatever.
Less efficient resource use but important with shared-service models

Test Harness
Real-world failures are hard to create in QA
Integration tests don’t find those out-of-spec errors unless you force them
Service acting like a back end but doing crazy shit – slow, endless, send a binary
Supplement testing methods

The Colo or the Cloud?

Ken King/James Sheridan, Yammer

They started in a colo.
Dual network, racks have 4 2U quad and 10 1Us
EC2 provides you: network, cabling, simple load balancing, geo distribution, hardware repair, network and rack diagrams (?)
There are fixed and variable costs, assume 3 year depreciation
AWS gives you 20% off for 2M/year?
Three year commit, reserve instances
one rack one year – $384k cool, $378k ec2, $199k reserved
20 racks one year – $105k colo, $158 ec2 even with reserve and 20% discount
20 racks 3 years – $50k colo, $105k ec2

But – speed (agility) of ramping up…

To me this is like cars.  You build your own car, you know it.  But for arbitrary small numbers of cars, you don’t have time for that crap.  If you run a taxi fleet or something, then you start moving back towards specialized needs and spec/build your own.

Colo benefits – ownership and knowledge.  You know it and control it.

  • Load balancing (loadbalancer.com or zeus/riverbed only cloud options)
  • IDS etc. (appliances)
  • Choose connectivity
  • know your IO, get more RAM, more cores
  • Throw money at vertical scalability
  • Perception of control = security

Cloud benefits – instant. Scaling.

  • Forces good architecture
  • Immediate replacement, no sparing etc.
  • Unlimited storage, snapshots (1 TB volume limit)
  • No long term commitment
  • Provisioning APIs, autoscaling, EU storage, geodist

Hybrid.

  • crocodoc and encoder yammer partners are good to be close to
  • need to burst work
  • dev servers, windows dev, demo servers banished to AWS
  • moving cross connects to ec2 with vpc

Whew!  Man I hope someone’s reading this because it’s a lot of work.  Next, day three and bonus LSPE meeting!

2 Comments

Filed under Conferences, DevOps