Hi all! My new job’s been affording me few opportunities for blogging, but I’m getting into the groove, so you should see more of me now.
Releasing All The Time!
Continuous integration is the bomb. We can all generally agree on that. But my life has become one of halfway steps that I think will be familiar to many of you, and I don’t believe in hiding the real world that’s not all case study perfect out there. So rather than give you the standard theory-list of “what you should do for nice futuristic DevOps releases,” let me tell you of our march from a 10 week to 2 week to 1 week release tempo at Bazaarvoice.
I started with BV at the start of February of this year. They said, “Our new release manager! We’ve been waiting for you! We adopted agile and then tried to move from our big-bang 10 week release cycle to 2 weeks and it blew up like you wouldn’t believe. Get us to two week releases. You’ve got a month. Go!” The product management team really needed us to be able to roll out features more quickly, do piloting and A/B testing, and generally be way more agile in delivery to the customer and not just in dev-land.
Background – our primary application is for the collection and display of user generated content – for example, ratings and reviews – and a lot of the biggest Internet retailers use our solution for that purpose. The codebase started seven years ago and grew monolithically for much of that time . (“The monolith” was the semi-affectionate code name for the stack when I started, as in “is your app’s code on the monolith?”) The app is running across multiple physical and cloud based datacenters and pushing out billions of hits a day, so there’s a low tolerance window for errors – our end user facing display apps have to have zero downtime for releases, though we can do up to two hours of downtime in a 3-5 AM window for customer administrative systems. Stack is Java, Linux, mySQL, Solr, et al. Extremely complex, just like any app added on to for years.
There had been a SWAT team formed after the semi-disastrous 2 week release that identified the main problems. Just like everywhere else, the main impediments were:
- Lack of automation in testing
- Poor SCM code discipline
Our CTO was very invested in solving the problem, so he supported the solution to #1 – the QA team hired up and got some automation folks in, and the product teams were told they had to stop feature development and do several sprints of writing and automating tests until they could sustain the biweekly cadence.
The solution to #2 had two parts. One was a feature flagging system so we could launch code “dark.” We had a crack team of devs crank this one out. I won’t belabor it because Facebook etc. love to do DevOps presentations on the approach and its benefits, but it’s true. Now we release all code dark first, and can enable it for certain clients or other segments.
Two was process – a new branching process where a single release branch comes off trunk every two weeks several days before release, and changes aren’t allowed to it except to fix issues found in QA, and those are approved and labeled into discrete release candidates. The dev environment gets trunk twice a day, the QA environment gets branch every time a new release candidate is labeled. Full product CIT must be passing to get a release candidate. As always, process steps like this sound like common sense but when you need 100 developers in 10 teams to uptake them immediately, the little issues come out and play.
There were a couple issues we couldn’t fix in the time allotted. One was that our Solr indexes are Godawful huge. Like 20 GB huge. JVM GC tuning is a popular hobby with us. To make changes, reindex, and distribute the indexes in time to perform a zero-downtime deployment, with replication lag nipping at our heels, was a bigger deal. The other was that our build and deploy pipeline was pretty bad. All the keywords you want to hear are there – Puppet, TeamCity, Rundeck, svn, noah, maven/Nexus, yum… But they are inconsistently implemented, embedded in a huge crufty bash script framework and parts have gone largely untended.
The timeframe was extremely aggressive. I project managed the hell out of it and all the teams were very on board and helpful, and management was very supportive. I actually got a slight delay, and was grateful for it, because our IPO date came up on the same date when we were supposed to start biweekly releases, and even the extremely ambitious were taken aback by the risk of cocking up the service on that day. We did our first biweekly release on March 6th and then every two weeks thereafter. We had a couple rough patches, but they were good learning experiences.
For example, as our first biweekly release day approached, tests just weren’t passing. I brought all the dev managers to the go/no-go meeting (another new institution) and asked them, “are we go?” (The release manager role had been set up by upper management as more prescriptive, with the thought I’d be sitting there yelling at them “It’s no-go,” but that’s really not an effective long term strategy). They all kinda shuffled, and hemmed, and hawed (a lot of pressure from internal stakeholders wanted this release to go out NOW), but then one manager said “No, we’re no go. It’s just not safe.” Once she said that everyone else got over that initial taboo of saying “no go” and concurred that some of their areas were no go. The release went out 5 calendar days late but a lot more smoothly than the last release did (44 major issues then, 5 this time).
The next release, though, was the real make-or-break. On the one hand everyone had a first real pass through the process and so some of the “but I didn’t know I needed to have testing signoff by that day and time” breaking-in static was gone, but on the other hand they’d had 2 months between the previous two releases to test and plan, and this one allowed only two weeks. It went off with no delay and only 1 issue.
Of course, we had deliberately sandbagged that a little because it coincided a with ‘test development only” sprint. But anyone who thinks a complex release in a large scale environment will go smoothly just because you’re deploying code with no functional changes has clearly never been closer than a 10-foot pole to real world Web operations. As we ramped back up on feature development, the process was also becoming more ingrained and testing better, so it went well.
We had one release go bad in May, and when we looked at it we realized a lot of changes weren’t being sufficiently QA’ed. So what we did was simply add a set of fields to all JIRA tickets for the team to specify who tested the change, and we wrote a script to parse our Subversion commit comments and label JIRA tickets with the appropriate release (trying to get people to actually fill out tickets correctly is pain and usually doomed to failure, so we made an end run with automation). So then as a release came up, on a wiki page is a list of all the tickets in the release and who tested them and how (automatic, manual, did not test). We actually did this for two releases with paper printouts and physical signoffs to develop the process before we automated it. This corrected the issue and we ran from then on with very low problem rates. As advertised, releasing fewer changes more frequently allows us to get both a higher throughput of changes and, paradoxically, higher quality with them.
The process worked great through the summer. In the biweekly release communication and presentations, I had explained we’d be moving to weekly and then to continuous deployment as soon as we could make it happen. Well, the solr index distribution problem took a while – two reorgs kicked it around and it was an ambitious “use bittorrent to distribute the index to all the servers in our various DCs” pretty propellerhead kind of thing that had to happen. It took the summer to get that squared away. In the meantime I also conducted a project internally called “Neverland” to fix some of the most egregious technical debt in our TeamCity and Nexus setup and deployment scripts.
The real testament to the culture change that happened as part of the biweekly release project is that while that project was a “big deal” – I had stakeholders from all over the business, big all hands presentations, project plans out the yin-yang, the entire technical leadership team sweating the details – moving from biweekly to weekly releases was largely a non-event.
The QA team worked in the background leading up to it to push test automation levels up higher. Then we basically just said “Hey, you guys want to release faster don’t you?” “Well sure!” “OK. we’re going weekly in two weeks. Check out the updated process docs.” “All right.” And we did, starting the first release in September. The Solr index got reindexed and redistributed (and man, it had been a while – it compacted down nicely) and deployment ran great. No change in error rate at all. We’ve been weekly since then, the only change is when we don’t release during critical change freeze windows around Black Friday/Cyber Monday and other holiday prime times. We think our setup is robust enough that it’s safe to release even then, but, heck, no one’s perfect so it’s probably prudent to pause, and many of our clients are really adamant about holiday change freezes to us and to all their suppliers.
The one concern voiced by engineers about the overhead of the release process was addressed by automating it more and by educating. For example, the go/no-go meeting was, at times, a little messy. Some of the other teams (especially ones not located in Austin) wouldn’t show up, or test signoffs wouldn’t be ready, and it would turn into delays and running around. The opportunity to do it more quickly actually helped a lot! Whereas the meeting had been 30 minutes if we were lucky when we started, now the meeting is taking 5 minutes, and only longer when someone screws around and doesn’t dial into the Webex on time.
“If it’s painful, do it more often” is a message that some folks still balk at when confronted with, but it is absolutely true.
Now, the path wasn’t easy and I was blessed with a very high caliber of people at Bazaarvoice – Dev, Ops, and QA. Everyone was always very focused on “how do we make this work” or “how do we improve this” with very little of the turf warring, blocking, and politics that I sadly have come to expect in a corporate environment. The mindset is very much “if we come up with a new way that’s better and we all agree on that, we will change to do that thing TOMORROW and not spend months dithering about it,” which is awesome and helped drive these changes through much faster than, honestly, I initially estimated it would take.
Releasing All The Time!
Continuous integration on “the monolith” was a distant myth initially, but now we’re seeing how we can get there and the benefits we’ll reap from doing so. Our main impediments remaining are:
1. CIT not passing. We don’t have a rule where if CIT is failing checkins are blocked, mainly because there’s a bunch of old legacy tests that are flaky. This often results in release milestones being delayed because CIT isn’t passing and there’s 6 devs’ checkins in the last failing build. Step 1 is fix the flaky tests and step 2 is declare work stoppage when CIT is failing. The senior developers see the wisdom in this so I expect it to go down without much friction. Again, the culture is very much about ruthlessly adopting an innovation if the key players agree it will be beneficial.
2. Builds, CIT, and deployment are slow as molasses in January. Build 1 hour, CIT 40 minutes, deploy 3 hours. Why? Various legacy reasons that give me a headache when I have to listen to them. Basically “that’s how it is now, and complete rewrite is potentially beyond any one person’s ability and definitely would take multiple man-months.” We’re analyzing what to do here. We also have a “staging” environment customers use for integration, and so currently we have to deploy to dev, test, deploy to QA, test, deploy to staging (hitting the downtime window), test, deploy to production (hitting the downtime window), test. So basically 2 days minimum. However, staging is really production and step one is release them at the same time. There’s a couple “but I can only test this kind of change in staging” items left that basically just require telling someone “Figure out how to test it in QA now.” Going to “always release trunk” will remove the whole branch deployment and separate dev and QA environments. So that’s 2 of 4 deployments removed, but then it’s a matter of figuring out cost vs benefit of smashing down parts of that 4:40. I have one proposal in front of me for chucking all the current deploy infrastructure for a Jenkins-driven one, I need to figure out if it is complete enough…
Am I Doing It Wrong?
Chime in in the comments below with questions or if there’s some way I could have cut the Gordian knot better. I think we’ve moved about as fast as you can given a lot of legacy code and technical debt (and having a lot of other stuff people need to be working on to keep a service up and get out new functionality). The three step process I used that works, as it does so often, was:
- Communicate a clear vision
- Drive execution relentlessly
- Keep metrics and continually improve
Thanks for reading, and happy releasing!