Although we’re currently engaged in a more radical agile infrastructure implementation, I thought I’d share our previous evolutionary DevOps implementation here (way before the term was coined, but in retrospect I think it hits a lot of the same notes) and what we learned along the way.
Here at NI we did what I’ll call a larval DevOps implementation starting about seven years ago when I came and took over our Web Systems team, essentially an applications administration/operations team for our Web site and other Web-related technologies. There was zero automation and the model was very much “some developers show up with code and we had to put it in production and somehow deal with the many crashes per day resulting from it.” We would get 100-200 on-call pages a week from things going wrong in production. We had especially entertaining weeks where Belgian hackers would replace pages on our site with French translations of the Hacker’s Manifesto. You know, standard Wild West stuff. You’ve been there.
Step One: Partner With The Business
First thing I did (remember this is 2002), I partnered with the business leaders to get a “seat at the table” along with the development managers. It turned out that our director of Web marketing was very open to the message of performance, availability, and security and gave us a lot of support.
This is an area where I think we’re still ahead of even a lot of the DevOps message. Agile development carries a huge tenet about developers partnering side-by-side with “the business” (end users, domain experts, and whatnot). DevOps is now talking about Ops partnering with developers, but in reality that’s a stab at the overall more successful model of “biz, dev, and ops all working together at once.”
Of course we had ongoing relationships with the development managers, but it was obvious within a very short timeframe that wasn’t going to be the end solution. Sure, there was a little ossified opinion syndrome about the role of Ops (they are the blue collar “dirty people,” as we all know) in relation to the developers – “You just move files, how hard is it?” – but mostly it was that our dev groups take a huge amount of very direct guidance and project management from our biz teams, and they would always be telling us “I don’t have time to load test and my business analyst doesn’t think it’s important and says we have to release this week and I just do what they tell me.” The devs weren’t empowered in and of themselves to do a lot of the stuff we needed them to do. That made it critical to get direct business engagement. Some environments (like my new group) have extremely empowered dev teams that are basically writing specs themselves; in that case probably starting with dev partnership would work well.
I’ll also note that being aligned with the business helped us know what we should work on. One year they’d be more interested in performance, another they’d be more interested in availability, etc. That gave us the opportunity to be heroes – for example, when we switched over from our old load balancing software to our Citrix Netscalers and our global Web site performance improved by 50%. We demonstrated “hey, some things Ops can give you, no programmers needed.” So then to a business guy you’re just as valuable as a programmer. They just want people to give them what they want, and if that’s you then you’re sitting pretty.
Step Two: Automate Some Stuff
Here’s where I have to give huge credit to my partner in crime Peco. I was making sure we were fixing root causes of problems and focusing on metrics to get us more people hired/permission to spend group time on improvement projects to ease the hundred pages a week problem. Peco’s a code hacker at heart so he quickly hit the two biggest pain points we needed to automate, Java application deployment and automated response to pages. Our app deployment was manual up to that point – devs would hand over long incoherent install docs and we’d go perform them manually on multiple redundant servers. He quickly wrote some scripts and an SSH framework and we made the developers structure their artifacts and fill out a config file that allowed us to take their apps from revision control and deploy them in an automated fashion to redundant dev, test, or production systems. The devs loved it – we had a couple yap for the first week about “but but I like .wars not .ears” but the immense benefits became immediately apparent to them, largely thanks to our already pretty mature change control/release process. Only ops could do deploys to test and we had big monthy Friday night releases for production code that devs had to attend, and dev time spent messing with dev deploys, waiting for test deploys, and staying up all night for production deploys plummeted. A month later it was one of those “it’s always been this way” steps.
After that we always wished we had time to go to a fuller automated CM solution, and as things like puppet and chef appeared we eyed them with interest, but we stemmed a huge part of the tide with that one Perl program.
Peco also wrote a framework to integrate with our monitoring solution and go do things like ‘restart the server’ without operator intervention. Sure, there’s down sides to that, but when you have people quitting your team because their oncall week involves restarting app servers all night every night, it serves a clear purpose.
These targeted automation steps moved us from chronic firefighting mode and pushed us over the hump to where we could actually try to make some more powerful changes.
Step Three: Value Added Services
Now that we had a handle on operations, we started to look for areas that no one was really addressing that needed help. Performance/availability and security were clear weak spots and we had people with expertise and energy in both. So we started what we called ‘practice areas’ and started in on tools and training in those areas. First we focused on detection and metric collection – showing the real performance of our Web site and individual apps and scanning for security holes.
The partnering with the business and dev managers was bearing fruit, in fact a Web steering project was established that had as its primary business goals for IT performance, availability, efficiency, TCO, and agility. This created pressure on individual app teams to address the problems we could now detect, and we were ready with tools to help them fix it – for example, a shout out to Opnet’s Panorama, an excellent application performance management (APM) tool that collects OS, software, and deep-dive Java metrics from across a distributed system and correlates those metrics. We showed them “look, your app running in production isn’t a mystery – you can get to it and look into it with this tool, and when there’s a slowdown or failure you can get to where it is.” Similary, we implemented Splunk and gave developers constant access to their own log files.
From one point of view, it was us “passing the buck.” “Here, look at your own log files.” But the truth is, that the developers are the ones that wrote those cryptic error messages and know what they mean, and which errors are “normal” (to this day, whenever a developer tells me an error in their log file is “normal” I have to restrain the urge to punch them in the face). By taking the tack of empowering devs to inspect their apps in production, in conjunction with producing pressure for them to improve operational aspects of their software,we were able to get the right people to work on the problems to really fix them.
Step Four: A Process Emerges
By this time the group had grown from the 2-3 people it started with to 7 people, including an operations guy in Hungary for follow the sun support. In general we hired only high quality people; some were more experienced than others but we had high skills and esprit de corps.
There started to be a lot of pressure from production support duties interrupting project work. Once we got large enough we split into a project subteam and an ops subteam to try to prevent that (what O’Reilly’s Time Management for System Administrators book calls “interrupt shielding”). We did it as a “rotation” where admins would go onto “tech rotation” for 3 months and then back into projects, so that we could keep developing everyone and keep it “one team” and not “two teams” (no conceptual barrier causing finger pointing, communication hassle…) We had talked about this when we were 4 people strong, and that wasn’t enough people to make it tenable – and if we had grown to 10+ people I’d split it into two permanent groups, but for that middle area this model worked for us.
From the beginning, we had been advocating the idea of “you need to come and get an admin assigned to your project at kickoff, who will work with you to get all the systems stuff worked out.” This worked OK but was hit or miss – there was no formal project initiation process to hook into and so sometimes it happened, sometimes it didn’t, sometimes it happened late. But on projects where it happened, we had a proto-DevOps style ops person working with devs on a specific project. Sadly though since there were so many projects, usually it wasn’t embedded enough to work truly hand in hand (one ops person would be working on 5 projects at a time…).
Over time we did this enough that patterns began to emerge – what things need to happen in different kinds of engagements. We then created a systems development process where each project needs to engage a Web Admin as part of their project, and it has a detailed playbook on what both the systems engineer and the developers need to do to make sure all the performance/availability/deployment/security/etc “nonfunctional requirements” are addressed.
Complications: Things That Didn’t Work Well
Keep in mind we didn’t have the full DevOps vision in mind when we started, we were evolving in the direction that seemed right to us as we got the opportunity over time. So this all happened over years, whereas now I think with this blueprint in mind it could be implemented much more quickly even in an existing org. These were the most significant issues that recurred while we were trying to do all this.
- The Web Admins are a Web application administration group, but has to interact with other teams in our pretty large infrastructure group (UNIX admins, Windows admins, DBAs, Notes admins, network admins… We tried to represent all the infrastructure groups to the programmers but, you know ops groups, there were constant outbreaks of squabbling. A talk about this a bit in “Before DevOps, Don’t You Need OpsOps?” For a variety of reasons our relationships with the business (usually very good) and the devs (good and rocky alternating) were always better than our relationships with the other Infrastructure groups (always grumpy).
- Our software dev process hasn’t been very mature traditionally and has tended to scope projects down to single programmer size. This has been changing over the last 2 years as they have embraced agile but there’s still a tendency to have “many small projects.” As there were about 50 programmers and about 7 Web Admins (of which only 5 were available for projects), that caused a significant problem with trying to really partner with them on projects. We tried focusing on “only large projects”, but still you’d have projects where the ops person just didn’t give the project enough time and attention. And we uptook a lot of major complex platforms (Vignette VCM, Oracle SOA, etc.) that ended up leeching off a pretty constant amount of admin time to master and support (.5 person per platform in the steady state). So the desire to do full DevOps partnering was always crushed under the wheels of a 1:10 ops:dev ratio.
- It was also extremely harsh at a management level – our corporate culture, at least in IT, is very “relationship based” and not strongly management-driven. I was used to “transactional based” organizations, where you go to a group and say “I need a something that is your group’s job, so here’s the request” and they do it. Here you kinda have to go ask/sweet-talk/convince other groups to perform tasks. Which is fine, it’s an approach that other “consensus-driven” companies have scaled up with (HP, for example). But in this role, we had to maintain relationships with a truly huge number of teams – a bunch of infrastructure teams, a bunch of Web programmer teams, and a bunch of Web business teams. That level of serving/being a customer of 15 or so teams was just barely sustainable, but when we got tapped with running some internal systems like our company’s SOA suite and internal collaboration system, and ten other customer and programming teams got dropped into the mix, it broke down (or at least that’s where I reached the limit of my modest management skills).
I think this is a place agile often hits rough patches – you bring in people from various disciplines to get the job done, but if the company’s inherent organizational structure requires many separate org structures to do negotiation and resource management to make that work, you hit management paralysis (since the number of relationships is a factorial of the number of teams involved). To scale you have to put some kind of lubricating process in place to enable that to happen, and we didn’t have that.
We weren’t operating in an agile process (iterations), but we were pretty successful at several important aspects of agile, especially those people usually mean when they say DevOps – partnering with business and developers, automating routine tasks, etc. And we definitely build a large, skilled, and respected team that continually brought new tools and practices into the organization. A lot of lessons learned that tell me that agile operations/DevOps is definitely the way to go, and now that we’re getting to implement systems and processes from scratch (more on that later!) we have a lot of understanding of what we need to do/not do.