Tag Archives: agile

Agile Organization: Separate Teams By Discipline

This is the first in the series of deeper-dive articles that are part of Agile Organization Incorporating Various Disciplines. It’s very easy to keep reorganizing and trying different models without actually learning from the process. I’ve worked with all of these so am trying to condense the pros and cons to help people understand the implications of the type of organizational model they choose.

The Separate Team By Discipline Model

Separate teams striated by discipline is the traditional method of organizing technical teams – segmented horizontally by technical skill.  You have one or more development teams, one or more operations teams, one or more QA teams.  In larger shops you have even more horizontal subdivisions – like in an enterprise IT shop, under the banner of Infrastructure you might have a data center team, a UNIX admin team, a SAN team, a Windows admin team, a networking team, a DBA team, a telecom team, applications administration team(s), and so on. It’s more unusual to have the dev side specifically segmented horizontally by tech as well (“Java programmers,” “COBOL programmers,” “Javascript programmers”) but not unheard of; it is more commonly seen as “UX team, services team, backend team…”

separateteamsIn this setup in its purest form, each team takes on tasks for a given product or project inside their team, works on them, and either returns them to the requester or passes them through to yet another team. Team members are not dedicated to that product or effort in any way except inasmuch as the hours they spend working on the to-do(s). Usually this is manifested as a waterfall approach, as a product or feature is conceived, handed to developers to develop, handed to QA to test, and finally handed to Operations to deploy and maintain.

This model dates back to the mainframe days, where it works pretty well – you’re not innovating on the infrastructure side, the building’s been built, you’re moving your apps into the pre-built apartment units. It also works OK when you have heavy regulation requirements or are constrained to extensively documenting requirements, then design, etc. (government contracts, for example).

It works a lot less well when you need to move quickly or need any kind of change or innovation from the other teams to achieve your goal. Linking up prioritization across teams is always hard but that’s the least of the issues. Teams all have their own goals, their own cadences, and even their own cultures and languages. The oft-repeated warning that “devs are motivated to make changes and ops is motivated by system stability” is a trivial example of this mismatch of goals. If the shared teams are supporting a limited number of products it can work. When there are competing priorities, I’ve seen it be extremely painful.  I worked in a shop where the multiple separate dev teams were vertical (line of business organized) but the operations teams were horizontal (technical specialty organized) – and frankly, the results of trying to produce results with the impedance mismatch generated by that setup was the nightmare that sent me down the Agile and DevOps path initially.

Benefits of Disciplinary Teams

The primary benefit of this approach is that you tend to get stable teams of like individuals, which allows them to bond with those of similar skills and experience organizational stability and esprit de corps. Providing this sense of comfort ends up being the key challenge of the other organizational approaches.

The second benefit is that it provides a good degree of standardization across products – if one ops team is creating the infrastructure for various applications, then there will be some efficiencies there.  This is almost always at least partially, and sometimes more than entirely, counteracted in value by the fact that not all apps need the same thing and that centralized teams bottleneck delivery and reduce velocity. I remember the UNIX team that would only provide my Web team expensive servers, even though we were highly horizontally scaled and told them we’d rather have twice as many $3500 servers instead of half as many $7000 servers as it would serve uptime, performance, etc. much better. But progress on our product was offered up upon the altar of nominal cost savings from homogeneity.

The third benefit is that if the horizontal teams are correctly cross-trained, it is easier avoid having single points of failure; by collecting the workers skilled in something into one group, losses are more easily picked up by others in the group. I have to say though, that this benefit is often more honored in the breach in my experience – teams tend to naturally divide up until there’s one expert on each thing and managers who actively maintain a portfolio and drive crosstraining are sadly rare.

Drawbacks of Disciplinary Teams

Conway’s Law is usually invoked to worry about vertical divisions in a product – one part of the UI written by one team, another by another, such that it looks like a Frankenstein’s monster of a product to the end user. However, the principle applies to horizontal divisions as well – these produce more of a Human Centipede, with the issue of one phase becoming the input of the next. The front end may not show any clear sign of division, but the seams in quality, reliability, and agility of the system can grow like a cancer underneath, which users certainly discover over time.

This approach promotes a host of bad behaviors. Pushing work to other people is always tempting, as is taking shortcuts if the results of those shortcuts fall on another’s shoulders. With no end to end ownership of a product, you get finger pointing and no one taking responsibility for driving the excellence of the service – and without an overall system thinking perspective, attempts at one of the teams in that value chain to drive an improvement in their domain often has unintended effects on the other teams in that chain that may or may not result in overall improvement. If engineers don’t eat their own dog food, but pass it on to someone else, then chronic quality problems often result. I personally spent years trying to build process and/or relationships to try to mitigate the dev->QA->ops passing of issues downstream with only mixed success.

Another way of stating this is that shared services teams always provide a route to the tragedy of the commons. Competing demands from multiple customers and the need for “nonfunctional” requirements (performance, availability, etc.) could potentially be all reconciled in priority via a strong product organization – but in my experience this is uncommon; product orgs tend to not care about prioritization of back end concerns and are more feature driven. Most product orgs I have dealt with have been more or less resistant to taking on platform teams, managing nonfunctional requirements, and otherwise interacting with that half of demands on the product. Without consistent prioritization, shared teams then become the focus of a lot of lobbying by all their internal customers trying to get resource. These teams are frequently understaffed and thus a bottleneck to overall velocity.

Ironically, in some cases this can be beneficial – technically, focusing on cost efficiency over delivering new value is always a losing game, but some organizations are self-unaware enough that they have teams continuing to churn out “stuff” without real ROI associated (our team exists therefore we must make more), in which case a bottleneck is actually helpful.

Mitigations for the weaknesses of this approach (abdication of responsibility and bottlenecking constraints) include:

  1. Very strong process guidance. “If only every process interface is 100% defined, then this will work”, the theory goes, just as it works on a manufacturing line.  Most software creation, however, is not similar to piecing components together to make an iPod. In one shop we worked for years on making a system development process that was up to this task, but it was an elusive goal. This is how, for example, Microsoft makes the various Office products look the same in partial defiance of Conway’s Law – books and books of standards.
  2. Individuals on shared teams with functional team affinities. Though not going as far as embedding into a product team, you can have people in the shared teams who are the designated reps for various client teams. Again, this works better when there is a few-to-one instead of a many-to-one relationship.  I had an ops team try this, but it was a many-to-one environment and each individual engineer ended up with three different ownership areas, which was overwhelming. In addition, you have to be careful not to simply dedicate all of one sort of work to one person, as you then create many single points of failure.
  3. Org variation: Add additional crossfunctional teams that try to bridge the gap.  At one place I worked, the organization had accepted that trying to have the systems needs of their Web site fulfilled by six separate infrastructure teams was not working well, so they created a “Web systems” team designed to sit astride those, take primary responsibility, and then broker needs to the other infrastructure teams. This was an improvement, and led to the addition of a parallel team responsible for internal apps, but never really got to the level of being highly effective. In addition those were extremely high-stress roles, as they bore responsibility but not control of all the results.

Conclusion

Though this is the most typical organization of technology teams historically, that history comes from a place much different than many situations we find ourselves in today. The rapid collaboration approach that Agile has brought us, and the additional understanding that Lean has given us in the software space, tells us that though this approach has its merits it is much overused and other approaches may be more effective, especially for product development.

Next, we’ll look at embedded crossfunctional service teams!

2 Comments

Filed under Agile, DevOps

Agile Austin asked me to help re-launch their blog, so I’ve contributed a piece on “What Is DevOps?” for them!

Leave a comment

by | March 16, 2014 · 9:58 am

Agile Organization Incorporating Various Disciplines

I started thinking about this recently because there was an Agile Austin QA SIG meeting that I sadly couldn’t attend entitled “How does a QA manager fit into an Agile organization?” which wondered about how to fit members of other disciplines (in this case QA) in with agile teams. Over the last couple years I’ve tried this, and seen it tried, in several ways with DevOps, QA, Product Management, and other disciplines, and I thought I’d elaborate on the pros and cons of some of these approaches.

Two Fundamental Discoveries

There are two things I’ve learned from this process that are pretty universal in terms of their truth.

1. Conway’s Law is true. To summarize, it states that a product will tend to reflect the structure of the organization that produces it. The corollary is that if your organization has divisions which are of no practical value to the product’s consumer, you will be creating striations within your product that impact client satisfaction. Hence basic ITSM and Agile doctrines on creating teams around owning a service/product.

2. People want to form teams and stay with them. This should be obvious from basic psychology/sociology, but if you set up an organization that is too flexible it strongly degrades the morale of your workers. In my previous role we conducted frequent engineer satisfaction surveys and the most prominent truth drawn from them is the more frequently people are asked to change roles, reorg, move to different teams, the less happy they are. Even people that want to move around to new challenges frequently are happier if they are moving to those new challenges with a team they’ve had an opportunity to move through Tuckman’s stages of group development with.

I have seen enough real-world quantified proof of both of these assertions that I will treat them as assumptions going forward.

Organizational Options

We tried out all four of these models within the same organization of high performing engineers and thus had a great opportunity to compare their results.

Separate Teams

When we started, we had the traditional model of separate teams which would hand off work to each other.  “Dev,” “Ops,” “QA,” “Product” were all under separate management up through several levels and operated as independent teams; individual affinities with specific products were emergent and simply matters of convenience (e.g. “Oh, he knows a lot about that BI stuff, let him handle that request”) and not a matter of being dedicated to specific product(s).

Embedded Crossfunctional Service Teams

Our first step away from the pure separate team model was to take those separate teams and embed specific members from them into service oriented teams, while still having them report to a manager or director representing their discipline. In some cases, the disciplinary teams would reserve some number of staff for tool development or other cross-cutting concerns. So QA, for example, had several engineers assigned to each product team, even though they were regarded as part of the permanent QA org primarily.

We (very loosely) considered our approach to be decentralized and microservice based; Martin Fowler is doing a good article in installments on Microservice Architecture if you want more on that topic.

Fully Integrated Service Teams

With our operations staff we went one step farther and simply permanently assigned them to product teams and removed the separate layer of management entirely. Dev and “DevOps” engineers reported to the same engineering manager and were a permanent part of a given product team. Any common tooling needed was created by a separate “platform” engineering team which was similarly integrated.

Project Based Organization

Due to the need to surge effort at times, we also had some organizations that were project, not product, based. Engineers would be pulled either from existing teams or entire teams would additionally be pulled into a short term (1, 2, 3 month) effort to try to make significant headway across multiple products, and then dissolve afterwards.

Hmm, this looks like it’s getting big (and I need to do some diagrams).  I’ll break it up into separate articles for each type of org and its pros and cons, and then a conclusion.

12 Comments

Filed under Agile, DevOps

Is It a Bug Or A Feature? Who Cares?

Today I’ve been treated to the about 1000th hour of my life debating whether something someone wants is a “bug” or a “feature.”  This is especially aggravating because in most of these contexts where it’s being debated, there is no meaningful difference.

A feature, or bug, or, God forbid, an “enhancement” or other middle road option, is simply a difference between the product you have and the product you want. People try to declare something a “bug” because they think that should justify a faster fix, but it doesn’t and it shouldn’t. I’ve seen so many gyrations of trying to qualify something as a bug. Is it a bug because the implementation differs from the (likely quite limited and incomplete) spec or requirements presented?  Is it a bug because it doesn’t meet client expectation?

In a backlog, work items should be prioritized based on their value.  There’s bugs that are important to fix first and bugs it’s important to fix never.  There’s features it’s important to have soon and features it’s important to have never.  You need (and your product people) need to be able to reconcile the cost/benefit/risk/etc across any needed change and to single stack-rank prioritize them for work in that order regardless of the imputed “type” of work it is.  This is Lean/Agile 101.

Now, something being a bug is important from an internal point of view, because it exposes issues you may have with your problem definition, or coding, or QA processes. But from a “when do we fix it” point of view, it should have absolutely no relation. Fixing a bug first because it’s “wrong” is some kind of confused version of punishment theory. If you’re distinguishing between the two meaningfully in prioritization, it’s just a fancy way of saying you like to throw good money after bad without analysis.

So stop wasting your life arguing and philosophizing about whether something in your backlog is a bug or enhancement or feature.  It’s a meaningless distinction, what matters is the value that change will convey to your users and the effort it will take to perform it.

I’m not saying one shouldn’t fix bugs – no one likes a buggy product.  But you should always clearly align on doing the highest leverage work first, and if that’s a bug that’s great but if it’s not, that’s great too.  What label you hang on the work doesn’t alter the value of the work, and you should be able to evaluate value, or else what are you even doing?

We have a process for my product team – if you want something that’s going to take more than a week of engineer time, it needs justification and to be prioritized amongst all the other things the other stakeholders want.  Is it a feature?  A bug?  A week worth of manual labor shepherding some manual process? It doesn’t matter.  It’s all work consuming my high value engineers, and we should be doing the highest value work first.  It’s a simple principle, but one that people manage to obscure all too often.

12 Comments

Filed under General

Scrum for Operations: How We Got Started

Welcome to the newest article in Scrum for Operations. I started this series when I was working for NI. But now I’m going through the same process at BV so time to pick it back up again! Like my previous post on Speeding Up Releases, I’m going to go light on theory and heavy on the details, good and bad, of how exactly we implemented Agile and DevOps and where we are with it.

Here at BV (Bazaarvoice), the org had adopted Agile wholesale just a couple months before I started. We also adopted DevOps shortly after I joined by embedding ops folks in the product teams.  Before the Agile/DevOps implementation there was a traditional organization consisting of many dev teams and one ops team, with all the bottlenecking and siloing and stuff that’s traditional in that kind of setup.  Newer teams (often made up of newly hired engineers, since we were growing quickly) that started out on the new DevOps model picked it up fine, but in at least one case we had a lot of culture change to do with an existing team.

Our primary large legacy team is called the PRR team (Product Ratings and Reviews) after the name of their product, which now does lots more than just ratings and reviews, but naturally marketing rebranding does little to change what all the engineers know an app is called. Many of the teams working on emerging greenfield products that were still in development had just one embedded ops engineer, but on our primary production software stack, we had a bunch. PRR serves content into many Internet retailer’s pages; 450 million people see our reviews and such. So for us scalability, performance, monitoring, etc. aren’t a sideline, they’re at least half of the work of an engineering team!

This had previously been cast as “a separate PRR operations team.” The devs were used to tossing things over the wall to ops and punting on the responsibility even if it was their product, and the ops were used to a mix of firefighting and doing whatever they wanted when not doing manual work the devs probably should have automated.

I started at BV as Release Manager, but after we got our releases in hand, I was asked to move over to lead the PRR team and take all these guys and achieve a couple major goals, so I dug in.

Moving Ops to Agile

I actually started implementing Agile with the PRR Ops team because I managed just them for a couple months before being given ownership of the whole department. I had worked closely with many of them before in my release manager role so I knew how they did things already. The Ops team consisted of 15 engineers, 2/3 of which were in Ukraine, outsourced contractors from our partner Softserve.

At start, there was no real team process.  There were tickets in JIRA and some bigger things that were lightly project managed. There was frustration with Austin management about the outsourcers’ performance, but I saw that there was not a lot of communication between the two parts of the team. “A lot of what’s going bad is our fault and we can fix it,” I told my boss.

Standups

A the first process improvement step, I introduced daily standups (in Sep 2012). These were made more complicated by the fact that we have half of our large team in Ukraine; as a result we used Webex to conduct them. “Let’s do one Austin standup and one Ukraine standup” was suggested – but I vetoed that since many of the key problems we were facing were specifically because of poor communication between Austin and Ukraine. After the initial adjustment period, everyone saw the value of the visibility it was giving us and it became self-sustaining.  (With any of these changes, you just have to explain the value and then make them do it a little while “as a pilot” to get it rolling. Once they see how it works in practice and realize the value it’s bringing, the team internalizes it and you’re ready for the next step.) Also because of the large size and international distribution I did the “no-no” of writing up the standup and sending the notes out in email.  It wasn’t really that hard, here’s an example standup email from early on:

Subject: PRR Infrastructure Daily Standup Notes 11/05/2012

Individual Standups
(what you did since last standup, what you will do by the next standup, blockers if any)

Alexander C – did: AVI-93 dev deploy testing of c2, release activity training; will do: finish dev c2, start other clusters
Anton P – did: review AVI-271 sharded solr in AWS proxy, AVI-282 migrating AWS to solr sharding; will do: finish and test
Bryan D – did: Hosted SEO 2.0 discussion may require Akamai SSL, Tim’s puppet/vserver training, DOS-2149 BA upgrade problems, document surgical deploy safety, HOST-71 lab2 ssh timeout, AVI-790, 793 lab monitoring, nexus errors; will do: finish prep Magpie retro, PRR sprint planning, Akamai tickets for hosted SEO, backlog task creation.
Larry B – did: MONO-107,109 7.3 release branch cut, release training; will do: AVI-311 dereg in DNS (maybe monitoring too?)
Lev P – did: deploy script change testing AVI-771; will do: more of that
Oleg K – did: review AVI-676 changes, investigate deployment runbooks/scripts for solr sharding AVI-773; to do: testing that, AVI-774 new solr slaves
Oleksandr M – did: out Friday after taking solr sharding live; will do: prod cleanup AVI-768, search_engine.xml AVI-594
Oleksandr Y – did: AVI-789 BF monitoring, had to fix PDX2 zabbix; will do: finish it and move to AVI-585 visualization
Robby M – did: testing AVI-676 and communicating about AWS sharding; will do: work with Alex and do AVI-698 c7 db patches for solr sharding
Sergii V – did: AVI-703 histograms, AVI-763 combining graphs; will do: continue that and close AVI-781 metrics deck
Serhiy S – did: tested aws solr puppet config AVI-271, CMOD stuff AVI-798, AVI-234
Taras U – did: tested BVC-126599 data deletion. Will do: pick up more tickets for testing
Taras Y – did: AVI-776 black Friday scale up plan, AVI-762 testing BF scale up; will do: more scale up testing
Vasyl B – did: MONO-94 GTM automation to test; will do: AVI-770 ftp/zabbix thing
Artur P – did: AVI-234 remove altstg environment, AVI-86 zabbix monitoring of db performance “mikumi”; will do: more on those

For context, while this was going on we were planning for Black Friday (BF) and executing on a large project to shard our Solr indexes for scaling purposes. The standup itself brought loads of visibility to both sides of the team and having the emails brought a lot of visibility to managers and stakeholders too. It also helped us manage what all the outsourcers were doing (I’ll be honest, before that sometimes we didn’t know what a given guy in Ukraine was doing that week – we’d get reports in code later on, but…).

I took the notes in the standup straight into an email and it didn’t really slow us down (I cheated by having the JIRA project up so I could copy over the ticket numbers). Because of the number of people, the Webex, and the language barrier the standups took 30 minutes. Not the fastest, but not bad.

Backlog

After everyone got used to the standups, I introduced a backlog (maybe 2 weeks after we started the standups). We had JIRA tickets with priorities before, but I added a Greenhopper Scrum style backlog. Everyone got the value of that immediately, since “we have 200 P2 tickets!” is obviously Orwellian at best. When stakeholders (my boss, other folks) had opinions on priorities we were able to bring up the stack-ranked backlog and have a very clear discussion about what it was more or less urgent/important than. (Yes, there were a couple yelling matches about “it’s meaningless to have five ‘top priorities!'” before we had this.) Interrupt tickets would just come in at the top.

Here’s a clip of our backlog just to get the gist of it…

Screen Shot 2013-07-18 at 6.02.33 PMAll the usual work… just in a list.  “Work it from the top!” We still had people cherry-picking things from farther down because “I usually work on builds” or “I usually work on metrics” but I evangelized not doing that.

Swimlanes

Using this format also gave me insight into who was doing what via the swimlanes view in JIRA.  When we’d do the standup we started going down in swimlane order and I could ask “why I don’t see that ticket” or see other warning signs like lots of work in progress.  An example swimlane:

Screen Shot 2013-07-18 at 6.24.53 PM

 

This helped engineers focus on what they were supposed to be doing, and encouraged them to put interrupts into the queue instead of thrashing around.

Sprints

Once we had the backlog, it was time to start sprinting! We had our first sprint planning meeting in October and I explained the process. They actually wanted to start with one week sprints, which was interesting – in the dev world often times you start in with really long (4-6 week) sprints and try to get them shorter as you get more mature.  In the ops world, since things are faster paced, we actually started at a week and then lengthened it out later once we got used to it.

The main issue that troubled people was the conjunction of “interrupt” tickets with proactive implementation tickets.  This kind of work is why lots of people like to say “Ops should use kanban.”

However, I learned two things doing all this.  The first is that for our team at least, the lion’s share of the work was proactive, not reactive, especially if you use a 1-2 week lookahead. “Do they really need that NOW, or just by next sprint?” Work that used to look interrupt driven under a “chaos plus big projects” process started to look plannable. That helped us control the thrash of “here’s a new urgent request” and resist it breaking the current sprint where possible.

Also, the amount of interrupt work varies from day to day but not significantly for a large team over a 1-2 week period.  This means that after a couple sprints, people could reliably predict how many points of stories they could pull because they knew how much time got pulled to interrupt work on average. This was the biggest fear of the team in doing sprint planning – that interrupt work would make it impossible to plan – and there was no way to bust through it except for me to insist that we do a couple sprints and reevaluate.  Once we’d done some, and people learned to estimate, they got comfortable with it and we’ve been scrumming away since.

And the third thing – kanban is harder to do correctly than Scrum.  Scrum enforces some structure. I’ve seen a lot of teams that “use kanban” and by that they mean “they do whatever comes to mind, in a completely uncontrolled manner,” indistinguishable from how Ops used to do things. Real kanban is subtle and powerful, and requires a good bit of high level understanding to do correctly. Having a structure helped teach my team how to be agile – they may be ready for kanban in another 6 months or so, perhaps, but right now some guard rails while they learn a lot of other best practices are serving us well.

Poker Planning

After the traditional explanation (several times) about what story points are, people started to get it. We used planningpoker.com for the actual voting – it’s a bit buggy but free, and since sprint planning was also 15 people on both (or more) sides of a Webex, it was invaluable.

Velocity

It’s hard to argue with success.  We watched the team velocity, and it basically doubled sprint to sprint for the first 4 sprints; by the end of November we were hitting 150 story points per sprint. I wish I had a screen cap of the velocity from those original sprints; Greenhopper is a little cussed and refuses to show you more than 7 sprints back, but it was impressive and everyone got to see how much more work they were completing (as opposed to ‘doing’).  I do have one interesting one though:

Screen Shot 2013-02-01 at 11.12.19 AMThis is our 6th and following sprints; you see how our average velocity was still increasing (a bit spikily) but in that last sprint we finally got to where we weren’t overpromising and underdelivering, which was an important milestone I congratulated the team on. Nowadays their committed/completed numbers are always very close, which is a sign of maturity.

Just Add Devs – False Start!

After the holiday rush, they asked me and another manager, Kelly, to take over the dev side of PRR as well, so we had the whole ball of wax (doubling the number of people I was managing). We tried to move them straight to full Scrum and also DevOps the team up using the embedded ops engineer model we were using on the other 2.0 teams.  PRR is big enough there were enough people for four subteams, so we divided up the group into four sprint teams, assigned a couple ops engineers to each one, and said “Go!”.

This went over pretty much like a lead balloon. It was too much change too fast.  Most of the developers were not used to Agile, and trying to mentor four teams at once was difficult. Combined with that was the fact that most of the ops staff was remote in Ukraine, what happened was each Austin-BV-employee-led team didn’t really consider “those ops guys” part of their team (I look around from my desk and see four other devs but don’t see the ops people… Therefore they’re not on my team.)  And that ops team was used to working as one team and they didn’t really segment themselves along those lines meaningfully either.  Since they were mostly remote, it was hard to break that habit. We tried to manage that for a little while, but finally we had to step back and try again.

Check back soon for Scrum for Operations: Just Add DevOps, where I reveal how we got Agile and DevOps to work for us after all!

17 Comments

Filed under Agile, DevOps

Scrum for Operations: Fitting In As An Ops Engineer

So far in this series, I’ve introduced the basics of Scrum as it generally is used and explained the practices that make it extremely successful. But that’s for developers, right? If you are in operations, what does this mean to you? How do you fit in? For an ops person, the major challenges are mental – you have to reorient your way of thinking, and then things drop into place very well.

I’m writing from the perspective of a Web operations guy, though I’ve done more traditional sysadmin work and managed infrastructure (and dev) teams over time (and started off as a dev, many years ago). Some of my terminology is oriented towards creating a product and keeping a Web site up, but you should be able to conceptually substitute your own kind of system, just as all different kinds of developers, not just Web developers, use and benefit from agile.

The Team

First, “DevOps.” Get an Ops person assigned to the dev team. This is fundamental – if it’s an externalized relationship, where the dev team is making requests of your “Infrastructure org”, you will not be seen as part of the team and your effectiveness will be extremely diminished. You need to be more or less dedicated to this project, not handling it from some shared work queue. This reinforces the fundamental values of Agile. You join the team, and you dedicate yourself to the overall success of the product you are working on. It is this integration, and the trust that arises from shared goals, that will remove a lot of the traditional roadblocks you are used to facing when dealing with a dev team. A real agile team should have similarly embedded product, QA, and UX folks, it’s not a new idea.

You are not “a UNIX guy” or “A DBA” any more.  You are “a member of the Ratings and Reviews team,” and you happen to have a technical specialty. This may seem like sophistry but it’s actually one of the most critical parts of this cultural transformation.

The Backlog

Start thinking of tasks in a customer-feature-facing kind of way for the backlog. For example, no one but you wants to hear about “configuring the SAN,” they want to know that at the end of the sprint “customers will be able to save files to persistent storage.” If what you’re doing doesn’t have any benefit to the end customer – why are you doing it again? You shouldn’t be.

Figure out how to state operational concerns like performance, maintainability, and availability as benefits in the backlog. Some infrastructure stuff belongs in the backlog, other parts of it belong more in standards (e.g. the team Definition of Done now states you have to have monitoring on a new service…). The product manager and dev team aren’t dumb, they will understand that performance, availability, security, ability to release their software, etc. are important goals that have merit in the backlog. The typical story-lingo is “As an X, I want Y so I can do Z.” “As a client, I want my data backed up so that in the case of a disaster, I am minimally affected.” “As an engineer, I want the uptime state of my services monitored so I can ensure customers are being served.”

You will be challenged (and this is good) on items that are “monkey work.”  “I need to go delete log files off that server, so it doesn’t crash.” Hey, why are we doing that?  Why is it manual? Should we have a story for proper log rotation? Need a developer to help? You will see a virtuous cycle develop to “fix things right.” Most of the devs haven’t seen a lot of the demeaning stuff you’re asked to do, and they’ll try to help fix it.

The Sprint

I’ll be honest, the first time I was confronted with the prospect of breaking up systems work into sprints I thought it was very unlikely it could be done. “Things are either short interrupts or long projects, right, that doesn’t make any sense.” And then I did it, and the scales dropped from my eyes. Remember refactoring. Developers doing agile are used to refactoring, while we are used to only having “one bite at the apple” – if we don’t get the systems all 100% right before we unleash the developers on them, then we won’t be able to change them later right?  Wrong!

In a certain sense, sprint planning is a big load off from traditional planning. Infrastructure folks are used to being asked to provide a granular task breakdown and timeline of 6 months worth of work for some big-bang implementation. Then when reality causes the plan to deviate from that, everyone freaks.  Agile takes horizon planning and institutionalizes it – you only need to be able to specifically plan your next 2 (or so) weeks, and if you can’t do that you need to try harder. What can you implement in 2 weeks that has some kind of value? Get a Tomcat running sprint 1, then tune it sprint 2, then monitor it sprint 3 – don’t bundle everything up into one huge mass.

Testing

Figure out what unit tests mean to you for things you are implementing.  “Nothing” is the wrong answer.  If you’re making a network change, for example, there is something you can do to test that short of “waiting for people to complain.” If you are installing tomcat on a server – if you’re using a framework like chef or puppet they’ll have testing options built in, but even if not there’s certain things you can do to ensure its functionality instead of passing it on and causing lost time and rework when someone else finds out it’s not working right.

More to come, meditate upon those truths for a bit – ask questions in the comments!

2 Comments

Filed under DevOps

Service Delivery Best Practices?

So… DevOps.  DevOps vs NoOps has been making the rounds lately. At Bazaarvoice we are spawning a bunch of decentralized teams not using that nasty centralized ops team, but wanting to do it all themselves.  This led me to contemplate how to express the things that Operations does in a way that turns them into developer requirements?

Because to be honest a lot of our communication in the Ops world is incestuous.  If one were to point a developer (or worse, a product manager) at the venerable infrastructures.org site, they’d immediately poop themselves and flee.  It’s a laundry list of crap for neckbeards, but spends no time on “why do you want this?” “What value will this bring to your customer?”

So for the “NoOps” crowd who want to do ops themselves – what’s an effective way to state these needs?  I took previous work – ideas from the pretty comprehensive but waterfally “Systems Development Framework” Peco and I did for NI, Chris from Bazaarvoice’s existing DevOps work – and here’s a cut.  I’d love feedback. The goal is to have a small discrete set of areas (5-7), with a small discrete set (5-7) of “most important” items – most importantly, stated in a way so that people understand that these aren’t “annoying things some ops guy wants me to do instead of writing 3l33t code” but instead stated as they should be, customer facing requirements like any other.  They could then be each broken into subsidiary more specific user stories (probably a whole lot of them, to be honest) but these could stand in as “epics” or “pillars” for making your Web thing work right.

Service Delivery Best Practices

Availability

  • I will make my service highly available, so that customers can use it constantly without issues
  • I will know about any problems with my service’s delivery to customers
  • I can restore my service quickly from any disaster or loss
  • I can make my data resistant to corruption from runtime issues
  • I will test my service under routine disruption to make it rugged and understand its failure modes

Performance

  • I know how all parts of the product perform and can measure performance against a customer SLA
  • I can run my application and its data globally, to serve our global customer base
  • I will test my application’s performance and I know my app’s limitations before it reaches them

Support

  • I will make my application supportable by the entire organization now and in the future
  • I know what every user hitting my service did
  • I know who has access to my code and data and release mechanism and what they did
  • I can account for all my servers, apps, users, and data
  • I can determine the root cause of issues with my service quickly
  • I can predict the costs and requirements of my service ahead of time enough that it is never a limiter
  • I will understand the communication needs of customers and the organization and will make them aware of all upgrades and downtime
  • I can handle incidents large and small with my services effectively

Security

  • I will make every service secure
  • I understand Web application vulnerabilities and will design and code my services to not be subject to them
  • I will test my services to be able to prove to customers that they are secure
  • I will make my data appropriately private and secure against theft and tampering
  • I will understand the requirements of security auditors and expectations of customers regarding the security of our services

Deployment

  • I can get code to production quickly and safely
  • I can deploy code in the middle of the day with no downtime or customer impact
  • I can pilot functionality to limited sets of users
  • I will make it easy for other teams to develop apps integrated with mine

Also to capture that everyone starting out can’t and shouldn’t do all of this… We struggled over whether these were “Goals” or “Best Practices” or what… On the one hand you should only put in e.g. “as much security as you need” but on the other hand there’s value in understanding the optimal condition.

2 Comments

Filed under DevOps

Addressing the IT Skeptic’s View on DevOps

A recent blog post on DevOps by the IT Skeptic entitled DevOps and traditional ITSM – why DevOps won’t change the world anytime soon got the community a’frothing. And sure, the article is a little simmered in anti-agile hate speech (apparently the Agilistias and cloud hypesters and cowboys are behind the whole DevOps thing and are leering at his wife and daughter and dropping his property values to boot) but I believe his critiques are in general very perceptive and that they are areas we, the DevOps movement, should work on.

Go read the article – it’s really long so I won’t sum the whole thing up here.

Here’s the most germane critiques and what we need to do about them. He also has some poor and irrelevant or misguided critiques, but why would I waste time on those?  Let’s take and action on the good stuff that can make DevOps better!

Lack of a coherent definition

This is a very good point. I went to the first meeting of an Austin DevOps SIG lately and was treated to the usual debate about “the definition of DevOps” and all the varied viewpoints going into that.  We need to emerge more of a structured definition that either includes and organizes or excludes the various memetic threads. It’s been done with Agile, and we can do it too. My imperfect definition of DevOps on this site tries to clarify this by showing there are different levels (principles, methods, and practices) that different thoughts about DevOps slot into.

Worry about cowboys

This is a valid concern, and one I share. Here at NI, back in the day programmers had production passwords, and they got taken away for real good reasons.  “Oh, let’s just give the programmers pagers and the root password” is not a responsible interpretation of DevOps but it’s one I’ve heard bandied about; it’s based on a false belief that as long as you have “really smart” developers they’ll never jack everything up.

Real DevOps shops that are uptaking practices that could be risky, like continuous deployment, are doing it with extreme levels of safeguard put into place (automated testing, etc.).  This is similar to the overall problem in agile – some people say “agile? Great!  I’ll code at random,” whereas really you need to have a very high percentage of unit test coverage. And sure, when you confront people with this they say “Oh, sure, you need that” but there is very little constructive discussion or tooling around it. How exactly do I build a good systems + app code integration/smoke test rig? “Uh you could write a bunch of code hooked to Hudson…” This should be one of the most discussed and best understood parts of the chain, not one of the least, to do DevOps responsibly.

We’re writing our own framework for this right now – James is doing it in Ruby, it’s called Sparta, and devs (and system folks) provide test chunks that the framework runs and times in an automated fashion. It’s not a well solved problem (and the big-dollar products that claim to do test automation are nightmares and not really automated in the “devs easily contribute tests to integrate into a continuous deploy” sense.

Team size

Working at a large corporation, I also share his concern about people’s cunning DevOps schemes that don’t scale past a 12 person company.  “We’ll just hire 7 of the best and brightest and they’ll do everything, and be all crossfunctional, and write code and test and do systems and ops and write UIs and everything!” is only a legit plan for about 10 little hot VC funded Web 2.0 companies out there.  The rest of us have to scale, and doing things right means some specialization and risks siloization.

For example, performance testing.  When we had all our developers do their own performance testing, the limit of the sophistication of those tests was “I’ll run 1000 hits against it and time how long it takes to finish.  There, 4 seconds.  Done, that’s my performance testing!”  The only people who think Ops, QA, etc. are such minor skill sets that someone can just do them all is someone who is frankly ignorant of those fields. Oh, P.S. The NoOps guys fall into this category, please don’t link them to DevOps.

We have struggled with this.  We’ve had to work out what testing our devs do versus how we closely align with external test teams.  Same with security, performace, etc.  The answer is not to completely generalize or completely silo – Yahoo! had a great model with their performance team, where their is a central team of super-experts but there are also embedded folks on each product team.

Hiring people

Very related to the previous point – again unless you’re one of the 10 hottest Web 2.0 plays and you can really get the best of the best, you are needing to staff your organization with random folks who graduated from UT with a B average. You have to have and manage tiers as well as silos – some folks are only ready to be “level 1 support” and aren’t going to be reading some dev’s Java code.

Traditional organizations and those following ITIL very closely can definitely create structures that promote bad silos and bad tiering. But just assuming everyone will be of the same (high) skill level and be able to know everything is a fallacy that is easy to fall into, since it’s those sort of elite individuals who are the leading uptakers of DevOps.  Maybe Gene Kim’s book he’s working on (“Visible DevOps” or similar) will help with that.

Tools fixation

Definitely an issue.  An enhanced focus on automation is valuable.  Too many ops shops still just do the same crap by hand day after day, and should be challenged to automate and use tools.  But a lot of the DevOps discussions do become “cool tool litanies” and that’s cart before the horse.  In my terminology, you don’t want to drive the principles out of the practices and methods – tooling is great but it should serve the goals.

We had that problem on our team. I had to talk to our Ops team and say “Hey, why are we doing all these tool implementations?  What overall goal are they serving? ”  Tools for the sake of tools are worse than pointless.

Process

It is true that with agile and with DevOps that some folks are using it as an excuse to toss out process.  It should simply be a different kind of process! And you need to take into account all the stuff that should be in there.

A great example is Michael Howard et al. at Microsoft with their Security Development Lifecycle.  The first version of it was waterfall.  But now they’ve revamped it to have an agile security development lifecycle, so you know when to do your threat modeling etc.

Build instead of buy

Well, there are definitely some open source zealots involved with most movements that have any sysadmins involved. We would like to buy instead of build, but the existing tools tend to either not solve today’s problems or have poor ROI.

In IT, we implemented some “ITIL compliant” HP tools for problem tracking, service desk, and software deployment. They suck, and are very rigid, and cost a lot of money, and required as much if not more implementation time than writing something from scratch that actually addressed our specific requirements. And in general that’s been everyone’s experience. The Ops world has learned to fear the HP/IBM/CA/etc systems management suites because it’s just one of those niches that is expensive and bad (like medical or legal software).

But having said that, we buy when we can! Splunk gave us a lot more than cobbling together our own open source thing.  Cloudkick did too. Sure, we tend to buy SaaS a lot more than on prem software now because of the velocity that gives us, but I agree that you need to analyze the hidden costs of building as part of a build/buy – you just need to also see the hidden costs and compromised benefits of a buy.

Risk Control

This simply goes back to the cowboy concern. It’s clearly shown that if you structure your process correctly, with the right testing and signoff gates, then agile/devops/rapid deploys are less risky.

We came to this conclusion independently as well.  In IT, we ran (still do) these Web go lives once a month.  Our Web site consists of 200+ applications  and we have 70 or so programmers, 7 Web ops, a whole Infrastructure department, a host of third party stuff (Oracle and many more)… Every release plan was 100 lines long and the process of planning them and executing on them was horrific. The system gets complex enough, both technically and organizationally, that rollbacks + dependencies + whatnot simply turn into paralysis, and you have to roll stuff out to make money.  When the IT apps director suggested “This is too painful – we should just do these quarterly instead, and tell the business they get to wait 2 more months to make their money,” the light went on in my mind. Slower and more rigorous is actually worse.  It’s not more efficient to put all the product you’re shipping for the month onto a huge ass warehouse on the back of a giant truck and drive it around doing deliveries, either; this should be obvious in retrospect. Distribution is a form of risk management. “All the eggs in one big basket that we’ll do all at one time” is the antithesis of that.

The Future

We started DevOps here at NI from the operations guys.  We’d been struggling for years to get the programmers to take production responsibility for their apps. We had struggled to get them access to their own logs, do their own deploys (to dev and test), let business users input Apache redirects into a Web UI rather than have us do it… We developed a whole process, the Systems Development Framework, that we used to engage with dev teams and make sure all the performance, reliability, security, manageability, provisioning, etc. stuff was getting taken care of… But it just wasn’t as successful as we felt like it could be.  Realizing that a more integrated model was possible, we realized success was actually an option. Ask most sysadmin shows if they think success is actually a possible outcome of their work, and you’ll get a lot of hedging kinds of “well success is not getting ruined today” kinds of responses.

By combining ops and devs onto one team, by embedding ops expertise onto other dev teams, by moving to using the same tools and tracking systems between devs and ops, and striving for profound automation and self service, we’ve achieved a super high level of throughput within a large organization. We have challenges (mostly when management decides to totally change track on a product, sigh) but from having done it both ways – OMG it’s a lot better. Everything has challenges and risks and there definitely needs to to be some “big boy” compatible thinking  on DevOps – but it’s like anything else, those who adopt early will reap the rewards and get competitive advantage on the others. And that’s why we’re all in. We can wait till it’s all worked out and drool-proof, but that’s a better fit for companies that don’t actually have to produce/achieve any more (government orgs, people with more money than God like oil and insurance…).

1 Comment

Filed under DevOps

Our Cloud Products And How We Did It

Hey, I’m not a sales guy, and none of us spend a lot of time on this blog pimping our company’s products, but we’re pretty proud of our work on them and I figured I’d toss them out there as use cases of what an enterprise can do in terms of cloud products if they get their act together!

Some background.  Currently all the agile admins (myself, Peco, and James) work together in R&D at National Instruments.  It’s funny, we used to work together on the Web Systems team that ran the ni.com Web site, but then people went their own ways to different teams or even different companies. Then we decided to put the dream team back together to run our new SaaS products.

About NI

Some background.  National Instruments (hereafter, NI) is a 5000+ person global company that makes hardware and software for test & measurement, industrial control, and graphical system design. Real Poindextery engineering stuff. Wireless sensors and data acquisition, embedded and real-time, simulation and modeling. Our stuff is used to program the Lego Mindstorms NXT robots as well as control CERN’s Large Hadron Collider. When a crazed highlander whacks a test dummy on Deadliest Warrior and Max the techie looks at readouts of the forces generated, we are there.

About LabVIEW

Our main software product is LabVIEW.  Despite being an electrical engineer by degree, we never used LabVIEW in school (this was a very long time ago, I’ll note, most programs use it nowadays), so it wasn’t till I joined NI I saw it in action. It’s a graphical dataflow programming language. I assumed that was BS when I heard it. I had so many companies try to sell be “graphical” programming over the years, like all those crappy 4GLs back in the ‘9o’s, that I figured that was just an unachieved myth. But no, it’s a real visual programming language that’s worked like a champ for more than 20 years. In certain ways it’s very bad ass, it does parallelism for you and can be compiled and dropped onto a FPGA. It’s remained niche-ey and hasn’t been widely adopted outside the engineering world, however, due to company focus more than anything else.

Anyway, we decided it was high time we started leveraging cloud technologies in our products, so we created a DevOps team here in NI’s LabVIEW R&D department with a bunch of people that know what they’re doing, and started cranking on some SaaS products for our customers! We’ve delivered two and have announced a third that’s in progress.

Cloud Product #1: LabVIEW Web UI Builder

First out of the gate – LabVIEW Web UI Builder. It went 1.0 late last year. Go try it for free! It’s a Silverlight-based RIA “light” version of LabVIEW – you can visually program, interface with hardware and/or Web services. As internal demos we even had people write things like “Duck Hunt” and “Frogger” in it – it’s like Flash programming but way less of a pain in the ass. You can run in browser or out of browser and save your apps to the cloud or to your local box. It’s a “freemium” model – totally free to code and run your apps, but you have to pay for a license to compile your apps for deployment somewhere else – and that somewhere else can be a Web server like Apache or IIS, or it can be an embedded hardware target like a sensor node. The RIA approach means the UI can be placed on a very low footprint target because it runs in the browser, it just has to get data/interface with the control API of whatever it’s on.

It’s pretty snazzy. If you are curious about “graphical programming” and think it is probably BS, give it a spin for a couple minutes and see what you can do without all that “typing.”

A different R&D team wrote the Silverlight code, we wrote the back end Web services, did the cloud infrastructure, ops support structure, authentication, security, etc. It runs on Amazon Web Services.

Cloud Product #2: LabVIEW FPGA Compile Cloud

This one’s still in beta, but it’s basically ready to roll. For non-engineers, a FPGA (field programmable gate array) is essentially a rewritable chip. You get the speed benefits of being on hardware – not as fast as an ASIC but way faster than running code on a general purpose computer – as well as being able to change the software later.

We have a version of LabVIEW, LabVIEW FPGA, used to target LabVIEW programs to an FPGA chip. Compilation of these programs can take a long time, usually a number of hours for complex designs. Furthermore the software required for the compilation is large and getting more diverse as there’s more and more chips out there (each pretty much has its own dedicated compiler).

So, cloud to the rescue. The FPGA Compile Cloud is a simple concept – when you hit ‘compile’ it just outsources the compile to a bunch of servers in the cloud instead of locking up your workstation for hours (assuming you’ve bought a subscription).  FPGA compilations have everything they need with them, there’s not unique compile environments to set up or anything, so it’s very commoditizable.

The back end for this isn’t as simple as the one for UI Builder, which is just cloud storage and load balanced compile servers – we had to implement custom scaling for the large and expensive compile workers, and it required more extensive monitoring, performance, and security work. It’s running on Amazon too. We got to reuse a large amount of the infrastructure we put in place for systems management and authentication for UI Builder.

Cloud Product #3: Technical Data Cloud

It’s still in development, but we’ve announced it so I get to talk about it! The idea behind the Technical Data Cloud is that more and more people need to collect sensor data, but they don’t want to fool with the management of it. They want to plop some sensors down and have the acquired data “go to the cloud!” for storage, visualization, and later analysis. There are other folks doing this already, like the very cool Pachube (pronounced “patch-bay”, there’s a LabVIEW library for talking to it), and it seems everyone wants to take their sensors to the cloud, so we’re looking at making one that’s industrial strength.

For this one we are pulling our our big guns, our data specialist team in Aachen, Germany. We are also being careful to develop it in an open way – the primary interface will be RESTful HTTP Web services, though LabVIEW APIs and hardware links will of course be a priority.

This one had a big technical twist for us – we’re implementing it on Microsoft Windows Azure, the MS guys’ cloud offering. Our org is doing a lot of .NET development and finding a lot of strategic alignment with Microsoft, so we thought we’d kick the tires on their cloud. I’m an old Linux/open source bigot and to be honest I didn’t expect it to make the grade, but once we got up to speed on it I found it was a pretty good bit of implementation. It did mean we had to do significant expansion of our underlying platform we are reusing for all these products – just supporting Linux and Windows instance in Amazon already made us toss a lot of insufficiently open solutions in the garbage bin, and these two cloud worlds are very different as well.

How We Did It

I find nothing more instructive than finding out the details – organizational, technical, etc. – of how people really implement solutions in their own shops.  So in the interests of openness and helping out others, I’m going to do a series on how we did it!  I figure it’ll be in about three parts, most likely:

  • How We Did It: People
  • How We Did It: Process
  • How We Did It: Tools and Technologies

If there’s something you want to hear about when I cover these areas, just ask in the comments!  I can’t share everything, especially for unreleased products, but promise to be as open as I can without someone from Legal coming down here and Tasering me.

5 Comments

Filed under Cloud, DevOps

Security and the Rise (and Fall?) of DevOps

As I’ve been involved with DevOps and its approach of blending development and operations staff together to create better products, I’ve started to see similar trends develop in the security space. I think there’s some informative parallels where both can learn from each other and perhaps avoid some pitfalls.

Here’s a recent article entitled “Agile: Most security guys are useless” that states the problem succinctly. In successful and agile orgs, the predominant mindset is that if you’re not touching the product, you are semi-useless overhead. And there’s some truth to that. When people are segregated into other “service” orgs – like operations or security – the us vs. them mindset predominates and strangles innovation in its crib.

The main initial drive of agile was to break down that wall between the devs and the “business”, but walls remain that need similar breaking down. With DevOps, operations organizations faced with this same problem are innovating new approaches; a collaborative approach with developers and operations staff working together on the product as part of the same team. It’s working great for those who are trying it, from the big Web shops like Facebook to the enterprise guys like us here at NI. The movement is gathering steam and it seems clear to those of us doing it this way that it’s going to be a successful and disruptive pattern for adopters.

But let’s not pat ourselves on the back too much just yet. We still have a lot of opportunity to screw it up. Let’s review an example from another area.

In the security world, there is a whole organization, OWASP (the Open Web Application Security Project) whose goal is to promote and enable application security. Security people and developers, working together!  Dev+Sec already exists! Or so the plan was.

However, recently there have been some “shots across the bow” in the OWASP community.  Read Security People vs Developers and especially OWASP: Has It Reached A Tipping Point? The latter is by Mark Curphey, who started OWASP. He basically says OWASP is becoming irrelevant because it’s leaving developers behind. It’s becoming about “security professionals” selling tools and there’s few developers to be found in the community any more.

And this is absolutely true.  We host the Austin OWASP chapter here at NI’s Austin campus, and two of the officers are NI employees. We make sure and invite NI developers to come to OWASP. Few do, at least not after the first couple times.  I asked some of the devs on our team why not, and here’s some answers I got.

  • I want to leave sessions by saying, “I need to think about this the next time I code”. I leave sessions by saying, “that was cool, I can talk about this at a happy hour”. If I could do the former, I’d probably attend most/all the sessions.
  • A lot of the sessions don’t seem business focused and it is hard to relate. Demos are nicer; but a lot of times they are so specific (for example: specific OS + specific Java Version + specific javascript library = hackable) that it’s not actionable.
  • “Security people” think “developers” don’t know what they are doing and don’t care about security. Which to developers is offensive. We like to write secure applications; sometimes we just find the bugs too late….
  • I’ve gone to, I think, 4 OWASP meetings.  Of those, I probably would only have recommended one of them to others – Michael Howard’s.  I think it helped that he was a well-known speaker and seemed to have a developer focus. So, well-known speakers, with a compelling and relevant subject..  Even then, the time has to be weighed against other priorities.  For example, today’s meeting sounds interesting, but not particularly relevant.  I’ll probably skip it.

In the end, the content at these meetings is more for security pros like pen testers, or for tool buyers in security or sysadmin groups. “How do I code more securely” is the alleged point of the group but frankly 90% of the activity is around scanners and crackers and all kinds of stuff that is fine but should be simple testing steps after the code’s written securely in the first place.

As a result there have been interesting ideas coming from the security community that are reminiscent of DevOps concepts. Pen tester @atdre did a talk here to the Austin OWASP chapter about how security testers engaging with agile teams “from the outside” are failing, and shouldn’t we instead embed them on the team as their “security buddy.” (I love that term.  Security buddy. I hate my “compliance auditor” without even meeting the poor bastard, but I like my security buddy already.) At the OWASP convention LASCON, Matt Tesauro delivered a great keynote similarly trying to refocus the group back on the core problem of developing secure software; in fact, they’re co-sponsoring a movement called “Rugged” that has a manifesto similar to the Agile Manifesto but is focused on security, availability, reliability, et cetera. (As a result it’s of interest to us sysadmin types, who are often saddled with somehow applying those attributes in production to someone else’s code…)

The DevOps community is already running the risk of “leaving the devs behind” too.  I love all my buddies at Opscode and DTO and Puppet Labs and Thoughtworks and all. But a lot of DevOps discussions have started to be completely sysadmin focused as well; a litany of tools you can use for provisioning or monitoring or CI. And that wouldn’t be so bad if there was a real entry point for developers – “Here’s how you as a developer interact with chef to deploy your code,” “Here’s how you make your code monitorable”. But those are often fringe discussions around the core content which often mainly warms the cockles of a UNIX sysadmin’s heart. Why do any of my devs want to see a presentation on how to install Puppet?  Well, that’s what they got at a recent Austin Cloud User Group meeting.

As a result, my devs have stopped coming to DevOps events.  When I ask them why, I get answers similar to the ones above for why they’re not attending OWASP events any more. They’re just not hearing anything that is actionable from the developer point of view. It’s not worth the two hours of their valuable time to come to something that’s not at all targeted at them.

And that’s eventually going to scuttle DevOps if we let it happen, just as it’ll scuttle OWASP if it continues there. The core value of agile is PEOPLE over processes and tools, COLLABORATION over negotiation. If you are leaving the collaboration behind and just focusing on tools, you will eventually fail, just in a more spectacular and automated fashion.

The focus at DevOpsDays US 2010 was great, it was all about culture, nothing about tools. But that culture talk hasn’t driven down to anything more actionable, so tools are just rising up to fill the gap.

In my talk at that DevOpsDays I likened these new tools and techniques to the introduction of the Minie ball to rifles during the Civil War. In that war, they adopted new tools and then retained their same old tactics, walking up close in lines designed for weapons with much shorter ranges and much lower accuracy – and the slaughter was profound.

All our new DevOps tools are great, but in the same way, if we don’t adapt our way of thinking to them, they will make our lives worse, not better, for all their vaunted efficiency. You can do the wrong thing en masse and more quickly. The slaughter will similarly be profound.

A sysadmin suddenly deciding to code his own tools isn’t really the heart of DevOps.  It’s fine and good, and I like seeing more tools created by domain experts. But the heart of DevOps, where you will really see the benefits in hard ROI, is developers and operations folks collaborating on real end-consumer products.

If you are doing anything DevOpsey, please think about “Why would a developer care about this?” How is it actionable to them, how does it make their lives easier?  I’m a sysadmin primarily, so I love stuff that makes my job easier, but I’ve learned over the years that when I can get our devs to leverage something, that’s when it really takes off and gives value.

The same thing applies to the people on the security side.  Why do we have this huge set of tools and techniques, of OWASP Top 10s and Live CDs and Metasploits and about a thousand wonderful little gadgets, but code is pretty much as shitty and insecure as it was 20 years ago? Because all those things try to solve the problem from the outside, instead of targeting the core of the matter, which is developers developing secure code in the first place.  And to do that, it’s more of a hearts-and-minds problem than a tools-and-processes problem.

That’s a core realization that operations folks, and security folks, and testing folks, and probably a bunch of other folks need to realize, deeply internalize, and let it change the way they look at the world and how they conduct their work.

5 Comments

Filed under DevOps, Security