Check out agile admin James Wickett’s talk from DeliveryConf last month on adding security into your continuous software delivery pipeline!
Check out agile admin James Wickett’s talk from DeliveryConf last month on adding security into your continuous software delivery pipeline!
Woooo! Last day of a week of conferencing. DevOpsDays Day 1 was good and I have even more openspace topics I plan to propose next time. As usual this is being livestreamed and will be viewable later as well at bmc.com/devops.
Sponsor Watch… Got to talk to our friends at PagerDuty (alert management) and Datadog (monitoring/dashboarding), we use them and love them. And I got to see Stormpath again, they first showed up at last DevOpsDays with a SaaS hosted auth solution (not like PingIdentity and Okta, they actually store the usernames/passwords for you, Les Hazlewood the Apache Shiro guy started it) and they’re growing quickly. Also talked to SaltStack which does salt, a remote command execution framework. 10gen was here with a MongoDB SaaS backup solution (nice!) and monitoring solution.
By Damon Edwards (@damonedwards) from DTO and now #SimplifyOps.
How to spread DevOps in enterprises. There’s silos you know. The term DevOps may work against you – it’s evangelical and being overused/washed already.
There is no ‘why’ other than the why of the business. Read your Deming/Collins/Four Steps to the Epiphany/etc.
Go ask people… Something.
Develop a common DevOps vision. Not a process because they’ll get blinders on. [Ed: I believe this is a false dichotomy – you should teach both. Vision without process lacks focus and process without vision lacks direction. It’s like accuracy and precision.]
Do a value stream mapping – read Learning to See. OK, this is the meat of the preso – very hard to read though.
Take your information flow and turn it into an artifact flow
Do a timeline analysis, find waste
Metrics. Establish the metric chain of what matters to the business, driven down to a capability which influences what matters to the business, and driven down to an activity over which an infividual can cause/influence outcomes.
Doesn’t require saying “devops.”
Only takes like 3 days to bootcamp it. Then put in continuous improvement loops.
You can only break silos by brute-force being the boss, but misalignment will reassert itself. Have to change the alignment.
Q&A: Do it with everyone in the same room with whiteboards/postits, it works better than getting fancy
Toufic Boubez from Metafor. Cofounded Layer 7 and escaped when CA acquired them.
Came from a popular DOD Austin Openspace – see the blog post!
How much data do you need? No more res that twice your highest frequency (Nyquist-Shanon). Most algorithms will smooth/average/etc.
Q&A: Are control systems more appropriate for small not large systems? No – just like in industry, as long as you design for that then it’s not just for toys.
And now I step in for the vendor pitch for Riverbed. Agile Admin Peco left yesterday and the other Riverbed booth guys made themselves scarce, so I did their shout-out for them. They have Zeus EC2 LBs, Aptimize web front end optimizer, and Opnet Appinternals Xpert APM tool! Very cool.
Scott Turnquest from Thoughtworks
Tools: Value stream mappings, fishbone analysis, “5 Whys”
So how do we do that value stream mapping? Here we go! [Ed: Oh, this is nice, I was sad that in the DTO presentation they mentioned them and threw some up but didn’t really dive down into one.]
A day of analysis of one small feature – a day of wait, 4 days of dev, 2 mins of wait, 1 hour of acceptance tests, 4 hours of deploy, 1 day in staging, 4 hours to deploy to prod. Note the waste areas – “4 days in dev? Really?” and the long ass deploy windows [Ed: Our value stream looks depressingly like this.] Process cycle efficiency of 75% (value creation time/total time)
So to determine the source of those waste areas, use the fishbone diagram. Had long feedback cycles from structure of code and build/deploy pipelines. Couldn’t test w/o AWS and can’t test individual components, provisioning was serial and repos were flaky.
Fix underlying cause (most impact first) – deploy pipelines. Reduce failure rate of deployments. Half were failing, and failing slow. Moved to AMI baking for reliability. [Ed: They said I was crazy a couple years ago when I said this, “no it’s a foil ball…” Bake when you can!] So this got them from 4 hours to 2 hours, and then parallelized and got down to 25 minutes. This cut down the staging and prod deploys but also the dev time. Process cycle efficiency up to 83%.
5 Whys root cause analysis method. Figured out manual hard to automate deployments were at the root, automated them – don’t be afraid to restructure/redesign when complexity gets in the ways.
Analysis techniques are not just for analysts!
Read Jez Humble’s “Continuous Delivery”, Poppendieck’s “Lean Soft Dev”/”Implementing Lean Soft Dev” , Derby/Larsen “Agile Retrospectives”
Antoni Batchelli of PalletOps. Complexity, essential and accidental. Building a system is simple but the systems are complex at runtime, and “complexity of a system is the degree of difficulty in predicting the properties of the system given the properties of the system’s parts.”
In DevOps we see infrastructure-aware software and concepts moving up into dev processes.
Devs want to run “their own” cluster with all the setups they need – productionlike, but with specific versions/timings/data/code/etc. Don’t care about infra details but want consistent envs/code.
Software has to be infrastructure aware now to autoscale, self-heal, etc. The app is the best informed actor to make/orchestrate infra decisions.
[Ed: This late into a conference week, I get a little irritated about presentations that are not really clear *why* they are telling you what they’re telling you.]
He hates incidental complexity. Me too.
OK, maybe we’re getting to a thesis. Let people solve problems where they are less complex: at the right level of abstraction. Build layers of abstraction – infrastructure, OS, services, actions. Make them into modules, make them functional and polymorphic.
James Wickett (@wickett) on Rugged DevOps and gauntlt for security + DevOps. gauntlt is a gem for continuous security testing as part of your build cycle. BDD your app’s security! Knock Out!!! go to gauntlt.org to get started.
Karthik Gaekwad (@iteration1) on DevOps Culture in the CIA. Devops is culture/automation/measurement/sharing. Seen Zero Dark Thirty? Well, the true story behind that details the COA’s transformation from a split between analysts and operatives especially using Sisterhood, a group of female analysis tracking Bin Laden since 1980. Post 9/11 there was a mass reorg to become more tactical – analysts became Targeters and worked with Operatives hand in hand. Same kind of silo busting. The Phoenix Project is Zero Dark Thirty for DevOps!
Dave Mangot (@davemengot) for DevOps Do’s and Don’ts from Salesforce. Do give everyone the tools they need to do their jobs. Don’t make ops the constraint, Do lots of communicating. Don’t forget to include everyone. Do get ops involved early. Don’t create a front door (loaded) process. Do have integration environments, Don’t forget config management. Do have blameless post-mortems. Don’t use the Phoenix Project as a bludgeon. Do use Agile as a cultural tool. Don’t rely on tools to change culture. Do get executive sponsorship. Don’t do shadow IT. Do use Damon Edward’s levers. Don’t just lecture, it’s a participation sport. Do structure the org around delivery. Don’t make separate DevOps teams or jackets. Do get the whole company involved, DevOps is for everyone.
Jonathan Thorpe – Preventing DevOps success. Not planning for scale. Not having unit tests. Not designing automated tests to scale. Not managing your capacity. Not using your resources effectively. Not using same deployment process for all environments. Not knowing what/where/when/who (activity tracking). Getting covered in ants.
DevOps is the future – John Esser from ancestry.com. What keeps CIOs up at night? Besides ants? IT. Need time to value. Transform mindset/processes/tools/etc. Strangler pattern.
DevOps productivity survey by Oliver White from ZeroTurnaround. DevOps oriented teams spend more time on infrastructure improvements and less on firefighting and support. Problem recoveries are shorter. Release software faster. use more custom tools. Make love for longer time. @rebel_labs
Nathan Harvey on leveling up your skills. Quit! Go to a conference. Try new things. Do a project somewhere. Always be interviewing.
Hi all! My new job’s been affording me few opportunities for blogging, but I’m getting into the groove, so you should see more of me now.
Continuous integration is the bomb. We can all generally agree on that. But my life has become one of halfway steps that I think will be familiar to many of you, and I don’t believe in hiding the real world that’s not all case study perfect out there. So rather than give you the standard theory-list of “what you should do for nice futuristic DevOps releases,” let me tell you of our march from a 10 week to 2 week to 1 week release tempo at Bazaarvoice.
I started with BV at the start of February of this year. They said, “Our new release manager! We’ve been waiting for you! We adopted agile and then tried to move from our big-bang 10 week release cycle to 2 weeks and it blew up like you wouldn’t believe. Get us to two week releases. You’ve got a month. Go!” The product management team really needed us to be able to roll out features more quickly, do piloting and A/B testing, and generally be way more agile in delivery to the customer and not just in dev-land.
Background – our primary application is for the collection and display of user generated content – for example, ratings and reviews – and a lot of the biggest Internet retailers use our solution for that purpose. The codebase started seven years ago and grew monolithically for much of that time . (“The monolith” was the semi-affectionate code name for the stack when I started, as in “is your app’s code on the monolith?”) The app is running across multiple physical and cloud based datacenters and pushing out billions of hits a day, so there’s a low tolerance window for errors – our end user facing display apps have to have zero downtime for releases, though we can do up to two hours of downtime in a 3-5 AM window for customer administrative systems. Stack is Java, Linux, mySQL, Solr, et al. Extremely complex, just like any app added on to for years.
There had been a SWAT team formed after the semi-disastrous 2 week release that identified the main problems. Just like everywhere else, the main impediments were:
Our CTO was very invested in solving the problem, so he supported the solution to #1 – the QA team hired up and got some automation folks in, and the product teams were told they had to stop feature development and do several sprints of writing and automating tests until they could sustain the biweekly cadence.
The solution to #2 had two parts. One was a feature flagging system so we could launch code “dark.” We had a crack team of devs crank this one out. I won’t belabor it because Facebook etc. love to do DevOps presentations on the approach and its benefits, but it’s true. Now we release all code dark first, and can enable it for certain clients or other segments.
Two was process – a new branching process where a single release branch comes off trunk every two weeks several days before release, and changes aren’t allowed to it except to fix issues found in QA, and those are approved and labeled into discrete release candidates. The dev environment gets trunk twice a day, the QA environment gets branch every time a new release candidate is labeled. Full product CIT must be passing to get a release candidate. As always, process steps like this sound like common sense but when you need 100 developers in 10 teams to uptake them immediately, the little issues come out and play.
There were a couple issues we couldn’t fix in the time allotted. One was that our Solr indexes are Godawful huge. Like 20 GB huge. JVM GC tuning is a popular hobby with us. To make changes, reindex, and distribute the indexes in time to perform a zero-downtime deployment, with replication lag nipping at our heels, was a bigger deal. The other was that our build and deploy pipeline was pretty bad. All the keywords you want to hear are there – Puppet, TeamCity, Rundeck, svn, noah, maven/Nexus, yum… But they are inconsistently implemented, embedded in a huge crufty bash script framework and parts have gone largely untended.
The timeframe was extremely aggressive. I project managed the hell out of it and all the teams were very on board and helpful, and management was very supportive. I actually got a slight delay, and was grateful for it, because our IPO date came up on the same date when we were supposed to start biweekly releases, and even the extremely ambitious were taken aback by the risk of cocking up the service on that day. We did our first biweekly release on March 6th and then every two weeks thereafter. We had a couple rough patches, but they were good learning experiences.
For example, as our first biweekly release day approached, tests just weren’t passing. I brought all the dev managers to the go/no-go meeting (another new institution) and asked them, “are we go?” (The release manager role had been set up by upper management as more prescriptive, with the thought I’d be sitting there yelling at them “It’s no-go,” but that’s really not an effective long term strategy). They all kinda shuffled, and hemmed, and hawed (a lot of pressure from internal stakeholders wanted this release to go out NOW), but then one manager said “No, we’re no go. It’s just not safe.” Once she said that everyone else got over that initial taboo of saying “no go” and concurred that some of their areas were no go. The release went out 5 calendar days late but a lot more smoothly than the last release did (44 major issues then, 5 this time).
The next release, though, was the real make-or-break. On the one hand everyone had a first real pass through the process and so some of the “but I didn’t know I needed to have testing signoff by that day and time” breaking-in static was gone, but on the other hand they’d had 2 months between the previous two releases to test and plan, and this one allowed only two weeks. It went off with no delay and only 1 issue.
Of course, we had deliberately sandbagged that a little because it coincided a with ‘test development only” sprint. But anyone who thinks a complex release in a large scale environment will go smoothly just because you’re deploying code with no functional changes has clearly never been closer than a 10-foot pole to real world Web operations. As we ramped back up on feature development, the process was also becoming more ingrained and testing better, so it went well.
We had one release go bad in May, and when we looked at it we realized a lot of changes weren’t being sufficiently QA’ed. So what we did was simply add a set of fields to all JIRA tickets for the team to specify who tested the change, and we wrote a script to parse our Subversion commit comments and label JIRA tickets with the appropriate release (trying to get people to actually fill out tickets correctly is pain and usually doomed to failure, so we made an end run with automation). So then as a release came up, on a wiki page is a list of all the tickets in the release and who tested them and how (automatic, manual, did not test). We actually did this for two releases with paper printouts and physical signoffs to develop the process before we automated it. This corrected the issue and we ran from then on with very low problem rates. As advertised, releasing fewer changes more frequently allows us to get both a higher throughput of changes and, paradoxically, higher quality with them.
The process worked great through the summer. In the biweekly release communication and presentations, I had explained we’d be moving to weekly and then to continuous deployment as soon as we could make it happen. Well, the solr index distribution problem took a while – two reorgs kicked it around and it was an ambitious “use bittorrent to distribute the index to all the servers in our various DCs” pretty propellerhead kind of thing that had to happen. It took the summer to get that squared away. In the meantime I also conducted a project internally called “Neverland” to fix some of the most egregious technical debt in our TeamCity and Nexus setup and deployment scripts.
The real testament to the culture change that happened as part of the biweekly release project is that while that project was a “big deal” – I had stakeholders from all over the business, big all hands presentations, project plans out the yin-yang, the entire technical leadership team sweating the details – moving from biweekly to weekly releases was largely a non-event.
The QA team worked in the background leading up to it to push test automation levels up higher. Then we basically just said “Hey, you guys want to release faster don’t you?” “Well sure!” “OK. we’re going weekly in two weeks. Check out the updated process docs.” “All right.” And we did, starting the first release in September. The Solr index got reindexed and redistributed (and man, it had been a while – it compacted down nicely) and deployment ran great. No change in error rate at all. We’ve been weekly since then, the only change is when we don’t release during critical change freeze windows around Black Friday/Cyber Monday and other holiday prime times. We think our setup is robust enough that it’s safe to release even then, but, heck, no one’s perfect so it’s probably prudent to pause, and many of our clients are really adamant about holiday change freezes to us and to all their suppliers.
The one concern voiced by engineers about the overhead of the release process was addressed by automating it more and by educating. For example, the go/no-go meeting was, at times, a little messy. Some of the other teams (especially ones not located in Austin) wouldn’t show up, or test signoffs wouldn’t be ready, and it would turn into delays and running around. The opportunity to do it more quickly actually helped a lot! Whereas the meeting had been 30 minutes if we were lucky when we started, now the meeting is taking 5 minutes, and only longer when someone screws around and doesn’t dial into the Webex on time.
“If it’s painful, do it more often” is a message that some folks still balk at when confronted with, but it is absolutely true.
Now, the path wasn’t easy and I was blessed with a very high caliber of people at Bazaarvoice – Dev, Ops, and QA. Everyone was always very focused on “how do we make this work” or “how do we improve this” with very little of the turf warring, blocking, and politics that I sadly have come to expect in a corporate environment. The mindset is very much “if we come up with a new way that’s better and we all agree on that, we will change to do that thing TOMORROW and not spend months dithering about it,” which is awesome and helped drive these changes through much faster than, honestly, I initially estimated it would take.
Continuous integration on “the monolith” was a distant myth initially, but now we’re seeing how we can get there and the benefits we’ll reap from doing so. Our main impediments remaining are:
1. CIT not passing. We don’t have a rule where if CIT is failing checkins are blocked, mainly because there’s a bunch of old legacy tests that are flaky. This often results in release milestones being delayed because CIT isn’t passing and there’s 6 devs’ checkins in the last failing build. Step 1 is fix the flaky tests and step 2 is declare work stoppage when CIT is failing. The senior developers see the wisdom in this so I expect it to go down without much friction. Again, the culture is very much about ruthlessly adopting an innovation if the key players agree it will be beneficial.
2. Builds, CIT, and deployment are slow as molasses in January. Build 1 hour, CIT 40 minutes, deploy 3 hours. Why? Various legacy reasons that give me a headache when I have to listen to them. Basically “that’s how it is now, and complete rewrite is potentially beyond any one person’s ability and definitely would take multiple man-months.” We’re analyzing what to do here. We also have a “staging” environment customers use for integration, and so currently we have to deploy to dev, test, deploy to QA, test, deploy to staging (hitting the downtime window), test, deploy to production (hitting the downtime window), test. So basically 2 days minimum. However, staging is really production and step one is release them at the same time. There’s a couple “but I can only test this kind of change in staging” items left that basically just require telling someone “Figure out how to test it in QA now.” Going to “always release trunk” will remove the whole branch deployment and separate dev and QA environments. So that’s 2 of 4 deployments removed, but then it’s a matter of figuring out cost vs benefit of smashing down parts of that 4:40. I have one proposal in front of me for chucking all the current deploy infrastructure for a Jenkins-driven one, I need to figure out if it is complete enough…
Chime in in the comments below with questions or if there’s some way I could have cut the Gordian knot better. I think we’ve moved about as fast as you can given a lot of legacy code and technical debt (and having a lot of other stuff people need to be working on to keep a service up and get out new functionality). The three step process I used that works, as it does so often, was:
Thanks for reading, and happy releasing!