Tag Archives: agile
I started thinking about this recently because there was an Agile Austin QA SIG meeting that I sadly couldn’t attend entitled “How does a QA manager fit into an Agile organization?” which wondered about how to fit members of other disciplines (in this case QA) in with agile teams. Over the last couple years I’ve tried this, and seen it tried, in several ways with DevOps, QA, Product Management, and other disciplines, and I thought I’d elaborate on the pros and cons of some of these approaches.
Two Fundamental Discoveries
There are two things I’ve learned from this process that are pretty universal in terms of their truth.
1. Conway’s Law is true. To summarize, it states that a product will tend to reflect the structure of the organization that produces it. The corollary is that if your organization has divisions which are of no practical value to the product’s consumer, you will be creating striations within your product that impact client satisfaction. Hence basic ITSM and Agile doctrines on creating teams around owning a service/product.
2. People want to form teams and stay with them. This should be obvious from basic psychology/sociology, but if you set up an organization that is too flexible it strongly degrades the morale of your workers. In my previous role we conducted frequent engineer satisfaction surveys and the most prominent truth drawn from them is the more frequently people are asked to change roles, reorg, move to different teams, the less happy they are. Even people that want to move around to new challenges frequently are happier if they are moving to those new challenges with a team they’ve had an opportunity to move through Tuckman’s stages of group development with.
I have seen enough real-world quantified proof of both of these assertions that I will treat them as assumptions going forward.
We tried out all four of these models within the same organization of high performing engineers and thus had a great opportunity to compare their results.
When we started, we had the traditional model of separate teams which would hand off work to each other. “Dev,” “Ops,” “QA,” “Product” were all under separate management up through several levels and operated as independent teams; individual affinities with specific products were emergent and simply matters of convenience (e.g. “Oh, he knows a lot about that BI stuff, let him handle that request”) and not a matter of being dedicated to specific product(s).
Embedded Crossfunctional Service Teams
Our first step away from the pure separate team model was to take those separate teams and embed specific members from them into service oriented teams, while still having them report to a manager or director representing their discipline. In some cases, the disciplinary teams would reserve some number of staff for tool development or other cross-cutting concerns. So QA, for example, had several engineers assigned to each product team, even though they were regarded as part of the permanent QA org primarily.
We (very loosely) considered our approach to be decentralized and microservice based; Martin Fowler is doing a good article in installments on Microservice Architecture if you want more on that topic.
Fully Integrated Service Teams
With our operations staff we went one step farther and simply permanently assigned them to product teams and removed the separate layer of management entirely. Dev and “DevOps” engineers reported to the same engineering manager and were a permanent part of a given product team. Any common tooling needed was created by a separate “platform” engineering team which was similarly integrated.
Project Based Organization
Due to the need to surge effort at times, we also had some organizations that were project, not product, based. Engineers would be pulled either from existing teams or entire teams would additionally be pulled into a short term (1, 2, 3 month) effort to try to make significant headway across multiple products, and then dissolve afterwards.
Hmm, this looks like it’s getting big (and I need to do some diagrams). I’ll break it up into separate articles for each type of org and its pros and cons, and then a conclusion.
Today I’ve been treated to the about 1000th hour of my life debating whether something someone wants is a “bug” or a “feature.” This is especially aggravating because in most of these contexts where it’s being debated, there is no meaningful difference.
A feature, or bug, or, God forbid, an “enhancement” or other middle road option, is simply a difference between the product you have and the product you want. People try to declare something a “bug” because they think that should justify a faster fix, but it doesn’t and it shouldn’t. I’ve seen so many gyrations of trying to qualify something as a bug. Is it a bug because the implementation differs from the (likely quite limited and incomplete) spec or requirements presented? Is it a bug because it doesn’t meet client expectation?
In a backlog, work items should be prioritized based on their value. There’s bugs that are important to fix first and bugs it’s important to fix never. There’s features it’s important to have soon and features it’s important to have never. You need (and your product people) need to be able to reconcile the cost/benefit/risk/etc across any needed change and to single stack-rank prioritize them for work in that order regardless of the imputed “type” of work it is. This is Lean/Agile 101.
Now, something being a bug is important from an internal point of view, because it exposes issues you may have with your problem definition, or coding, or QA processes. But from a “when do we fix it” point of view, it should have absolutely no relation. Fixing a bug first because it’s “wrong” is some kind of confused version of punishment theory. If you’re distinguishing between the two meaningfully in prioritization, it’s just a fancy way of saying you like to throw good money after bad without analysis.
So stop wasting your life arguing and philosophizing about whether something in your backlog is a bug or enhancement or feature. It’s a meaningless distinction, what matters is the value that change will convey to your users and the effort it will take to perform it.
I’m not saying one shouldn’t fix bugs – no one likes a buggy product. But you should always clearly align on doing the highest leverage work first, and if that’s a bug that’s great but if it’s not, that’s great too. What label you hang on the work doesn’t alter the value of the work, and you should be able to evaluate value, or else what are you even doing?
We have a process for my product team – if you want something that’s going to take more than a week of engineer time, it needs justification and to be prioritized amongst all the other things the other stakeholders want. Is it a feature? A bug? A week worth of manual labor shepherding some manual process? It doesn’t matter. It’s all work consuming my high value engineers, and we should be doing the highest value work first. It’s a simple principle, but one that people manage to obscure all too often.
Welcome to the newest article in Scrum for Operations. I started this series when I was working for NI. But now I’m going through the same process at BV so time to pick it back up again! Like my previous post on Speeding Up Releases, I’m going to go light on theory and heavy on the details, good and bad, of how exactly we implemented Agile and DevOps and where we are with it.
Here at BV (Bazaarvoice), the org had adopted Agile wholesale just a couple months before I started. We also adopted DevOps shortly after I joined by embedding ops folks in the product teams. Before the Agile/DevOps implementation there was a traditional organization consisting of many dev teams and one ops team, with all the bottlenecking and siloing and stuff that’s traditional in that kind of setup. Newer teams (often made up of newly hired engineers, since we were growing quickly) that started out on the new DevOps model picked it up fine, but in at least one case we had a lot of culture change to do with an existing team.
Our primary large legacy team is called the PRR team (Product Ratings and Reviews) after the name of their product, which now does lots more than just ratings and reviews, but naturally marketing rebranding does little to change what all the engineers know an app is called. Many of the teams working on emerging greenfield products that were still in development had just one embedded ops engineer, but on our primary production software stack, we had a bunch. PRR serves content into many Internet retailer’s pages; 450 million people see our reviews and such. So for us scalability, performance, monitoring, etc. aren’t a sideline, they’re at least half of the work of an engineering team!
This had previously been cast as “a separate PRR operations team.” The devs were used to tossing things over the wall to ops and punting on the responsibility even if it was their product, and the ops were used to a mix of firefighting and doing whatever they wanted when not doing manual work the devs probably should have automated.
I started at BV as Release Manager, but after we got our releases in hand, I was asked to move over to lead the PRR team and take all these guys and achieve a couple major goals, so I dug in.
Moving Ops to Agile
I actually started implementing Agile with the PRR Ops team because I managed just them for a couple months before being given ownership of the whole department. I had worked closely with many of them before in my release manager role so I knew how they did things already. The Ops team consisted of 15 engineers, 2/3 of which were in Ukraine, outsourced contractors from our partner Softserve.
At start, there was no real team process. There were tickets in JIRA and some bigger things that were lightly project managed. There was frustration with Austin management about the outsourcers’ performance, but I saw that there was not a lot of communication between the two parts of the team. “A lot of what’s going bad is our fault and we can fix it,” I told my boss.
A the first process improvement step, I introduced daily standups (in Sep 2012). These were made more complicated by the fact that we have half of our large team in Ukraine; as a result we used Webex to conduct them. “Let’s do one Austin standup and one Ukraine standup” was suggested – but I vetoed that since many of the key problems we were facing were specifically because of poor communication between Austin and Ukraine. After the initial adjustment period, everyone saw the value of the visibility it was giving us and it became self-sustaining. (With any of these changes, you just have to explain the value and then make them do it a little while “as a pilot” to get it rolling. Once they see how it works in practice and realize the value it’s bringing, the team internalizes it and you’re ready for the next step.) Also because of the large size and international distribution I did the “no-no” of writing up the standup and sending the notes out in email. It wasn’t really that hard, here’s an example standup email from early on:
Subject: PRR Infrastructure Daily Standup Notes 11/05/2012
(what you did since last standup, what you will do by the next standup, blockers if any)
Alexander C – did: AVI-93 dev deploy testing of c2, release activity training; will do: finish dev c2, start other clusters
Anton P – did: review AVI-271 sharded solr in AWS proxy, AVI-282 migrating AWS to solr sharding; will do: finish and test
Bryan D – did: Hosted SEO 2.0 discussion may require Akamai SSL, Tim’s puppet/vserver training, DOS-2149 BA upgrade problems, document surgical deploy safety, HOST-71 lab2 ssh timeout, AVI-790, 793 lab monitoring, nexus errors; will do: finish prep Magpie retro, PRR sprint planning, Akamai tickets for hosted SEO, backlog task creation.
Larry B – did: MONO-107,109 7.3 release branch cut, release training; will do: AVI-311 dereg in DNS (maybe monitoring too?)
Lev P – did: deploy script change testing AVI-771; will do: more of that
Oleg K – did: review AVI-676 changes, investigate deployment runbooks/scripts for solr sharding AVI-773; to do: testing that, AVI-774 new solr slaves
Oleksandr M – did: out Friday after taking solr sharding live; will do: prod cleanup AVI-768, search_engine.xml AVI-594
Oleksandr Y – did: AVI-789 BF monitoring, had to fix PDX2 zabbix; will do: finish it and move to AVI-585 visualization
Robby M – did: testing AVI-676 and communicating about AWS sharding; will do: work with Alex and do AVI-698 c7 db patches for solr sharding
Sergii V – did: AVI-703 histograms, AVI-763 combining graphs; will do: continue that and close AVI-781 metrics deck
Serhiy S – did: tested aws solr puppet config AVI-271, CMOD stuff AVI-798, AVI-234
Taras U – did: tested BVC-126599 data deletion. Will do: pick up more tickets for testing
Taras Y – did: AVI-776 black Friday scale up plan, AVI-762 testing BF scale up; will do: more scale up testing
Vasyl B – did: MONO-94 GTM automation to test; will do: AVI-770 ftp/zabbix thing
Artur P – did: AVI-234 remove altstg environment, AVI-86 zabbix monitoring of db performance “mikumi”; will do: more on those
For context, while this was going on we were planning for Black Friday (BF) and executing on a large project to shard our Solr indexes for scaling purposes. The standup itself brought loads of visibility to both sides of the team and having the emails brought a lot of visibility to managers and stakeholders too. It also helped us manage what all the outsourcers were doing (I’ll be honest, before that sometimes we didn’t know what a given guy in Ukraine was doing that week – we’d get reports in code later on, but…).
I took the notes in the standup straight into an email and it didn’t really slow us down (I cheated by having the JIRA project up so I could copy over the ticket numbers). Because of the number of people, the Webex, and the language barrier the standups took 30 minutes. Not the fastest, but not bad.
After everyone got used to the standups, I introduced a backlog (maybe 2 weeks after we started the standups). We had JIRA tickets with priorities before, but I added a Greenhopper Scrum style backlog. Everyone got the value of that immediately, since “we have 200 P2 tickets!” is obviously Orwellian at best. When stakeholders (my boss, other folks) had opinions on priorities we were able to bring up the stack-ranked backlog and have a very clear discussion about what it was more or less urgent/important than. (Yes, there were a couple yelling matches about “it’s meaningless to have five ‘top priorities!'” before we had this.) Interrupt tickets would just come in at the top.
Here’s a clip of our backlog just to get the gist of it…
All the usual work… just in a list. “Work it from the top!” We still had people cherry-picking things from farther down because “I usually work on builds” or “I usually work on metrics” but I evangelized not doing that.
Using this format also gave me insight into who was doing what via the swimlanes view in JIRA. When we’d do the standup we started going down in swimlane order and I could ask “why I don’t see that ticket” or see other warning signs like lots of work in progress. An example swimlane:
This helped engineers focus on what they were supposed to be doing, and encouraged them to put interrupts into the queue instead of thrashing around.
Once we had the backlog, it was time to start sprinting! We had our first sprint planning meeting in October and I explained the process. They actually wanted to start with one week sprints, which was interesting – in the dev world often times you start in with really long (4-6 week) sprints and try to get them shorter as you get more mature. In the ops world, since things are faster paced, we actually started at a week and then lengthened it out later once we got used to it.
The main issue that troubled people was the conjunction of “interrupt” tickets with proactive implementation tickets. This kind of work is why lots of people like to say “Ops should use kanban.”
However, I learned two things doing all this. The first is that for our team at least, the lion’s share of the work was proactive, not reactive, especially if you use a 1-2 week lookahead. “Do they really need that NOW, or just by next sprint?” Work that used to look interrupt driven under a “chaos plus big projects” process started to look plannable. That helped us control the thrash of “here’s a new urgent request” and resist it breaking the current sprint where possible.
Also, the amount of interrupt work varies from day to day but not significantly for a large team over a 1-2 week period. This means that after a couple sprints, people could reliably predict how many points of stories they could pull because they knew how much time got pulled to interrupt work on average. This was the biggest fear of the team in doing sprint planning – that interrupt work would make it impossible to plan – and there was no way to bust through it except for me to insist that we do a couple sprints and reevaluate. Once we’d done some, and people learned to estimate, they got comfortable with it and we’ve been scrumming away since.
And the third thing – kanban is harder to do correctly than Scrum. Scrum enforces some structure. I’ve seen a lot of teams that “use kanban” and by that they mean “they do whatever comes to mind, in a completely uncontrolled manner,” indistinguishable from how Ops used to do things. Real kanban is subtle and powerful, and requires a good bit of high level understanding to do correctly. Having a structure helped teach my team how to be agile – they may be ready for kanban in another 6 months or so, perhaps, but right now some guard rails while they learn a lot of other best practices are serving us well.
After the traditional explanation (several times) about what story points are, people started to get it. We used planningpoker.com for the actual voting – it’s a bit buggy but free, and since sprint planning was also 15 people on both (or more) sides of a Webex, it was invaluable.
It’s hard to argue with success. We watched the team velocity, and it basically doubled sprint to sprint for the first 4 sprints; by the end of November we were hitting 150 story points per sprint. I wish I had a screen cap of the velocity from those original sprints; Greenhopper is a little cussed and refuses to show you more than 7 sprints back, but it was impressive and everyone got to see how much more work they were completing (as opposed to ‘doing’). I do have one interesting one though:
This is our 6th and following sprints; you see how our average velocity was still increasing (a bit spikily) but in that last sprint we finally got to where we weren’t overpromising and underdelivering, which was an important milestone I congratulated the team on. Nowadays their committed/completed numbers are always very close, which is a sign of maturity.
Just Add Devs – False Start!
After the holiday rush, they asked me and another manager, Kelly, to take over the dev side of PRR as well, so we had the whole ball of wax (doubling the number of people I was managing). We tried to move them straight to full Scrum and also DevOps the team up using the embedded ops engineer model we were using on the other 2.0 teams. PRR is big enough there were enough people for four subteams, so we divided up the group into four sprint teams, assigned a couple ops engineers to each one, and said “Go!”.
This went over pretty much like a lead balloon. It was too much change too fast. Most of the developers were not used to Agile, and trying to mentor four teams at once was difficult. Combined with that was the fact that most of the ops staff was remote in Ukraine, what happened was each Austin-BV-employee-led team didn’t really consider “those ops guys” part of their team (I look around from my desk and see four other devs but don’t see the ops people… Therefore they’re not on my team.) And that ops team was used to working as one team and they didn’t really segment themselves along those lines meaningfully either. Since they were mostly remote, it was hard to break that habit. We tried to manage that for a little while, but finally we had to step back and try again.
Check back soon for Scrum for Operations: Just Add DevOps, where I reveal how we got Agile and DevOps to work for us after all!
So far in this series, I’ve introduced the basics of Scrum as it generally is used and explained the practices that make it extremely successful. But that’s for developers, right? If you are in operations, what does this mean to you? How do you fit in? For an ops person, the major challenges are mental – you have to reorient your way of thinking, and then things drop into place very well.
I’m writing from the perspective of a Web operations guy, though I’ve done more traditional sysadmin work and managed infrastructure (and dev) teams over time (and started off as a dev, many years ago). Some of my terminology is oriented towards creating a product and keeping a Web site up, but you should be able to conceptually substitute your own kind of system, just as all different kinds of developers, not just Web developers, use and benefit from agile.
First, “DevOps.” Get an Ops person assigned to the dev team. This is fundamental – if it’s an externalized relationship, where the dev team is making requests of your “Infrastructure org”, you will not be seen as part of the team and your effectiveness will be extremely diminished. You need to be more or less dedicated to this project, not handling it from some shared work queue. This reinforces the fundamental values of Agile. You join the team, and you dedicate yourself to the overall success of the product you are working on. It is this integration, and the trust that arises from shared goals, that will remove a lot of the traditional roadblocks you are used to facing when dealing with a dev team. A real agile team should have similarly embedded product, QA, and UX folks, it’s not a new idea.
You are not “a UNIX guy” or “A DBA” any more. You are “a member of the Ratings and Reviews team,” and you happen to have a technical specialty. This may seem like sophistry but it’s actually one of the most critical parts of this cultural transformation.
Start thinking of tasks in a customer-feature-facing kind of way for the backlog. For example, no one but you wants to hear about “configuring the SAN,” they want to know that at the end of the sprint “customers will be able to save files to persistent storage.” If what you’re doing doesn’t have any benefit to the end customer – why are you doing it again? You shouldn’t be.
Figure out how to state operational concerns like performance, maintainability, and availability as benefits in the backlog. Some infrastructure stuff belongs in the backlog, other parts of it belong more in standards (e.g. the team Definition of Done now states you have to have monitoring on a new service…). The product manager and dev team aren’t dumb, they will understand that performance, availability, security, ability to release their software, etc. are important goals that have merit in the backlog. The typical story-lingo is “As an X, I want Y so I can do Z.” “As a client, I want my data backed up so that in the case of a disaster, I am minimally affected.” “As an engineer, I want the uptime state of my services monitored so I can ensure customers are being served.”
You will be challenged (and this is good) on items that are “monkey work.” “I need to go delete log files off that server, so it doesn’t crash.” Hey, why are we doing that? Why is it manual? Should we have a story for proper log rotation? Need a developer to help? You will see a virtuous cycle develop to “fix things right.” Most of the devs haven’t seen a lot of the demeaning stuff you’re asked to do, and they’ll try to help fix it.
I’ll be honest, the first time I was confronted with the prospect of breaking up systems work into sprints I thought it was very unlikely it could be done. “Things are either short interrupts or long projects, right, that doesn’t make any sense.” And then I did it, and the scales dropped from my eyes. Remember refactoring. Developers doing agile are used to refactoring, while we are used to only having “one bite at the apple” – if we don’t get the systems all 100% right before we unleash the developers on them, then we won’t be able to change them later right? Wrong!
In a certain sense, sprint planning is a big load off from traditional planning. Infrastructure folks are used to being asked to provide a granular task breakdown and timeline of 6 months worth of work for some big-bang implementation. Then when reality causes the plan to deviate from that, everyone freaks. Agile takes horizon planning and institutionalizes it – you only need to be able to specifically plan your next 2 (or so) weeks, and if you can’t do that you need to try harder. What can you implement in 2 weeks that has some kind of value? Get a Tomcat running sprint 1, then tune it sprint 2, then monitor it sprint 3 – don’t bundle everything up into one huge mass.
Figure out what unit tests mean to you for things you are implementing. “Nothing” is the wrong answer. If you’re making a network change, for example, there is something you can do to test that short of “waiting for people to complain.” If you are installing tomcat on a server – if you’re using a framework like chef or puppet they’ll have testing options built in, but even if not there’s certain things you can do to ensure its functionality instead of passing it on and causing lost time and rework when someone else finds out it’s not working right.
More to come, meditate upon those truths for a bit – ask questions in the comments!
So… DevOps. DevOps vs NoOps has been making the rounds lately. At Bazaarvoice we are spawning a bunch of decentralized teams not using that nasty centralized ops team, but wanting to do it all themselves. This led me to contemplate how to express the things that Operations does in a way that turns them into developer requirements?
Because to be honest a lot of our communication in the Ops world is incestuous. If one were to point a developer (or worse, a product manager) at the venerable infrastructures.org site, they’d immediately poop themselves and flee. It’s a laundry list of crap for neckbeards, but spends no time on “why do you want this?” “What value will this bring to your customer?”
So for the “NoOps” crowd who want to do ops themselves – what’s an effective way to state these needs? I took previous work – ideas from the pretty comprehensive but waterfally “Systems Development Framework” Peco and I did for NI, Chris from Bazaarvoice’s existing DevOps work – and here’s a cut. I’d love feedback. The goal is to have a small discrete set of areas (5-7), with a small discrete set (5-7) of “most important” items – most importantly, stated in a way so that people understand that these aren’t “annoying things some ops guy wants me to do instead of writing 3l33t code” but instead stated as they should be, customer facing requirements like any other. They could then be each broken into subsidiary more specific user stories (probably a whole lot of them, to be honest) but these could stand in as “epics” or “pillars” for making your Web thing work right.
Service Delivery Best Practices
- I will make my service highly available, so that customers can use it constantly without issues
- I will know about any problems with my service’s delivery to customers
- I can restore my service quickly from any disaster or loss
- I can make my data resistant to corruption from runtime issues
- I will test my service under routine disruption to make it rugged and understand its failure modes
- I know how all parts of the product perform and can measure performance against a customer SLA
- I can run my application and its data globally, to serve our global customer base
- I will test my application’s performance and I know my app’s limitations before it reaches them
- I will make my application supportable by the entire organization now and in the future
- I know what every user hitting my service did
- I know who has access to my code and data and release mechanism and what they did
- I can account for all my servers, apps, users, and data
- I can determine the root cause of issues with my service quickly
- I can predict the costs and requirements of my service ahead of time enough that it is never a limiter
- I will understand the communication needs of customers and the organization and will make them aware of all upgrades and downtime
- I can handle incidents large and small with my services effectively
- I will make every service secure
- I understand Web application vulnerabilities and will design and code my services to not be subject to them
- I will test my services to be able to prove to customers that they are secure
- I will make my data appropriately private and secure against theft and tampering
- I will understand the requirements of security auditors and expectations of customers regarding the security of our services
- I can get code to production quickly and safely
- I can deploy code in the middle of the day with no downtime or customer impact
- I can pilot functionality to limited sets of users
- I will make it easy for other teams to develop apps integrated with mine
Also to capture that everyone starting out can’t and shouldn’t do all of this… We struggled over whether these were “Goals” or “Best Practices” or what… On the one hand you should only put in e.g. “as much security as you need” but on the other hand there’s value in understanding the optimal condition.
A recent blog post on DevOps by the IT Skeptic entitled DevOps and traditional ITSM – why DevOps won’t change the world anytime soon got the community a’frothing. And sure, the article is a little simmered in anti-agile hate speech (apparently the Agilistias and cloud hypesters and cowboys are behind the whole DevOps thing and are leering at his wife and daughter and dropping his property values to boot) but I believe his critiques are in general very perceptive and that they are areas we, the DevOps movement, should work on.
Go read the article – it’s really long so I won’t sum the whole thing up here.
Here’s the most germane critiques and what we need to do about them. He also has some poor and irrelevant or misguided critiques, but why would I waste time on those? Let’s take and action on the good stuff that can make DevOps better!
Lack of a coherent definition
This is a very good point. I went to the first meeting of an Austin DevOps SIG lately and was treated to the usual debate about “the definition of DevOps” and all the varied viewpoints going into that. We need to emerge more of a structured definition that either includes and organizes or excludes the various memetic threads. It’s been done with Agile, and we can do it too. My imperfect definition of DevOps on this site tries to clarify this by showing there are different levels (principles, methods, and practices) that different thoughts about DevOps slot into.
Worry about cowboys
This is a valid concern, and one I share. Here at NI, back in the day programmers had production passwords, and they got taken away for real good reasons. “Oh, let’s just give the programmers pagers and the root password” is not a responsible interpretation of DevOps but it’s one I’ve heard bandied about; it’s based on a false belief that as long as you have “really smart” developers they’ll never jack everything up.
Real DevOps shops that are uptaking practices that could be risky, like continuous deployment, are doing it with extreme levels of safeguard put into place (automated testing, etc.). This is similar to the overall problem in agile – some people say “agile? Great! I’ll code at random,” whereas really you need to have a very high percentage of unit test coverage. And sure, when you confront people with this they say “Oh, sure, you need that” but there is very little constructive discussion or tooling around it. How exactly do I build a good systems + app code integration/smoke test rig? “Uh you could write a bunch of code hooked to Hudson…” This should be one of the most discussed and best understood parts of the chain, not one of the least, to do DevOps responsibly.
We’re writing our own framework for this right now – James is doing it in Ruby, it’s called Sparta, and devs (and system folks) provide test chunks that the framework runs and times in an automated fashion. It’s not a well solved problem (and the big-dollar products that claim to do test automation are nightmares and not really automated in the “devs easily contribute tests to integrate into a continuous deploy” sense.
Working at a large corporation, I also share his concern about people’s cunning DevOps schemes that don’t scale past a 12 person company. “We’ll just hire 7 of the best and brightest and they’ll do everything, and be all crossfunctional, and write code and test and do systems and ops and write UIs and everything!” is only a legit plan for about 10 little hot VC funded Web 2.0 companies out there. The rest of us have to scale, and doing things right means some specialization and risks siloization.
For example, performance testing. When we had all our developers do their own performance testing, the limit of the sophistication of those tests was “I’ll run 1000 hits against it and time how long it takes to finish. There, 4 seconds. Done, that’s my performance testing!” The only people who think Ops, QA, etc. are such minor skill sets that someone can just do them all is someone who is frankly ignorant of those fields. Oh, P.S. The NoOps guys fall into this category, please don’t link them to DevOps.
We have struggled with this. We’ve had to work out what testing our devs do versus how we closely align with external test teams. Same with security, performace, etc. The answer is not to completely generalize or completely silo – Yahoo! had a great model with their performance team, where their is a central team of super-experts but there are also embedded folks on each product team.
Very related to the previous point – again unless you’re one of the 10 hottest Web 2.0 plays and you can really get the best of the best, you are needing to staff your organization with random folks who graduated from UT with a B average. You have to have and manage tiers as well as silos – some folks are only ready to be “level 1 support” and aren’t going to be reading some dev’s Java code.
Traditional organizations and those following ITIL very closely can definitely create structures that promote bad silos and bad tiering. But just assuming everyone will be of the same (high) skill level and be able to know everything is a fallacy that is easy to fall into, since it’s those sort of elite individuals who are the leading uptakers of DevOps. Maybe Gene Kim’s book he’s working on (“Visible DevOps” or similar) will help with that.
Definitely an issue. An enhanced focus on automation is valuable. Too many ops shops still just do the same crap by hand day after day, and should be challenged to automate and use tools. But a lot of the DevOps discussions do become “cool tool litanies” and that’s cart before the horse. In my terminology, you don’t want to drive the principles out of the practices and methods – tooling is great but it should serve the goals.
We had that problem on our team. I had to talk to our Ops team and say “Hey, why are we doing all these tool implementations? What overall goal are they serving? ” Tools for the sake of tools are worse than pointless.
It is true that with agile and with DevOps that some folks are using it as an excuse to toss out process. It should simply be a different kind of process! And you need to take into account all the stuff that should be in there.
A great example is Michael Howard et al. at Microsoft with their Security Development Lifecycle. The first version of it was waterfall. But now they’ve revamped it to have an agile security development lifecycle, so you know when to do your threat modeling etc.
Build instead of buy
Well, there are definitely some open source zealots involved with most movements that have any sysadmins involved. We would like to buy instead of build, but the existing tools tend to either not solve today’s problems or have poor ROI.
In IT, we implemented some “ITIL compliant” HP tools for problem tracking, service desk, and software deployment. They suck, and are very rigid, and cost a lot of money, and required as much if not more implementation time than writing something from scratch that actually addressed our specific requirements. And in general that’s been everyone’s experience. The Ops world has learned to fear the HP/IBM/CA/etc systems management suites because it’s just one of those niches that is expensive and bad (like medical or legal software).
But having said that, we buy when we can! Splunk gave us a lot more than cobbling together our own open source thing. Cloudkick did too. Sure, we tend to buy SaaS a lot more than on prem software now because of the velocity that gives us, but I agree that you need to analyze the hidden costs of building as part of a build/buy – you just need to also see the hidden costs and compromised benefits of a buy.
This simply goes back to the cowboy concern. It’s clearly shown that if you structure your process correctly, with the right testing and signoff gates, then agile/devops/rapid deploys are less risky.
We came to this conclusion independently as well. In IT, we ran (still do) these Web go lives once a month. Our Web site consists of 200+ applications and we have 70 or so programmers, 7 Web ops, a whole Infrastructure department, a host of third party stuff (Oracle and many more)… Every release plan was 100 lines long and the process of planning them and executing on them was horrific. The system gets complex enough, both technically and organizationally, that rollbacks + dependencies + whatnot simply turn into paralysis, and you have to roll stuff out to make money. When the IT apps director suggested “This is too painful – we should just do these quarterly instead, and tell the business they get to wait 2 more months to make their money,” the light went on in my mind. Slower and more rigorous is actually worse. It’s not more efficient to put all the product you’re shipping for the month onto a huge ass warehouse on the back of a giant truck and drive it around doing deliveries, either; this should be obvious in retrospect. Distribution is a form of risk management. “All the eggs in one big basket that we’ll do all at one time” is the antithesis of that.
We started DevOps here at NI from the operations guys. We’d been struggling for years to get the programmers to take production responsibility for their apps. We had struggled to get them access to their own logs, do their own deploys (to dev and test), let business users input Apache redirects into a Web UI rather than have us do it… We developed a whole process, the Systems Development Framework, that we used to engage with dev teams and make sure all the performance, reliability, security, manageability, provisioning, etc. stuff was getting taken care of… But it just wasn’t as successful as we felt like it could be. Realizing that a more integrated model was possible, we realized success was actually an option. Ask most sysadmin shows if they think success is actually a possible outcome of their work, and you’ll get a lot of hedging kinds of “well success is not getting ruined today” kinds of responses.
By combining ops and devs onto one team, by embedding ops expertise onto other dev teams, by moving to using the same tools and tracking systems between devs and ops, and striving for profound automation and self service, we’ve achieved a super high level of throughput within a large organization. We have challenges (mostly when management decides to totally change track on a product, sigh) but from having done it both ways – OMG it’s a lot better. Everything has challenges and risks and there definitely needs to to be some “big boy” compatible thinking on DevOps – but it’s like anything else, those who adopt early will reap the rewards and get competitive advantage on the others. And that’s why we’re all in. We can wait till it’s all worked out and drool-proof, but that’s a better fit for companies that don’t actually have to produce/achieve any more (government orgs, people with more money than God like oil and insurance…).
Hey, I’m not a sales guy, and none of us spend a lot of time on this blog pimping our company’s products, but we’re pretty proud of our work on them and I figured I’d toss them out there as use cases of what an enterprise can do in terms of cloud products if they get their act together!
Some background. Currently all the agile admins (myself, Peco, and James) work together in R&D at National Instruments. It’s funny, we used to work together on the Web Systems team that ran the ni.com Web site, but then people went their own ways to different teams or even different companies. Then we decided to put the dream team back together to run our new SaaS products.
Some background. National Instruments (hereafter, NI) is a 5000+ person global company that makes hardware and software for test & measurement, industrial control, and graphical system design. Real Poindextery engineering stuff. Wireless sensors and data acquisition, embedded and real-time, simulation and modeling. Our stuff is used to program the Lego Mindstorms NXT robots as well as control CERN’s Large Hadron Collider. When a crazed highlander whacks a test dummy on Deadliest Warrior and Max the techie looks at readouts of the forces generated, we are there.
Our main software product is LabVIEW. Despite being an electrical engineer by degree, we never used LabVIEW in school (this was a very long time ago, I’ll note, most programs use it nowadays), so it wasn’t till I joined NI I saw it in action. It’s a graphical dataflow programming language. I assumed that was BS when I heard it. I had so many companies try to sell be “graphical” programming over the years, like all those crappy 4GLs back in the ‘9o’s, that I figured that was just an unachieved myth. But no, it’s a real visual programming language that’s worked like a champ for more than 20 years. In certain ways it’s very bad ass, it does parallelism for you and can be compiled and dropped onto a FPGA. It’s remained niche-ey and hasn’t been widely adopted outside the engineering world, however, due to company focus more than anything else.
Anyway, we decided it was high time we started leveraging cloud technologies in our products, so we created a DevOps team here in NI’s LabVIEW R&D department with a bunch of people that know what they’re doing, and started cranking on some SaaS products for our customers! We’ve delivered two and have announced a third that’s in progress.
Cloud Product #1: LabVIEW Web UI Builder
First out of the gate – LabVIEW Web UI Builder. It went 1.0 late last year. Go try it for free! It’s a Silverlight-based RIA “light” version of LabVIEW – you can visually program, interface with hardware and/or Web services. As internal demos we even had people write things like “Duck Hunt” and “Frogger” in it – it’s like Flash programming but way less of a pain in the ass. You can run in browser or out of browser and save your apps to the cloud or to your local box. It’s a “freemium” model – totally free to code and run your apps, but you have to pay for a license to compile your apps for deployment somewhere else – and that somewhere else can be a Web server like Apache or IIS, or it can be an embedded hardware target like a sensor node. The RIA approach means the UI can be placed on a very low footprint target because it runs in the browser, it just has to get data/interface with the control API of whatever it’s on.
It’s pretty snazzy. If you are curious about “graphical programming” and think it is probably BS, give it a spin for a couple minutes and see what you can do without all that “typing.”
A different R&D team wrote the Silverlight code, we wrote the back end Web services, did the cloud infrastructure, ops support structure, authentication, security, etc. It runs on Amazon Web Services.
Cloud Product #2: LabVIEW FPGA Compile Cloud
This one’s still in beta, but it’s basically ready to roll. For non-engineers, a FPGA (field programmable gate array) is essentially a rewritable chip. You get the speed benefits of being on hardware – not as fast as an ASIC but way faster than running code on a general purpose computer – as well as being able to change the software later.
We have a version of LabVIEW, LabVIEW FPGA, used to target LabVIEW programs to an FPGA chip. Compilation of these programs can take a long time, usually a number of hours for complex designs. Furthermore the software required for the compilation is large and getting more diverse as there’s more and more chips out there (each pretty much has its own dedicated compiler).
So, cloud to the rescue. The FPGA Compile Cloud is a simple concept – when you hit ‘compile’ it just outsources the compile to a bunch of servers in the cloud instead of locking up your workstation for hours (assuming you’ve bought a subscription). FPGA compilations have everything they need with them, there’s not unique compile environments to set up or anything, so it’s very commoditizable.
The back end for this isn’t as simple as the one for UI Builder, which is just cloud storage and load balanced compile servers – we had to implement custom scaling for the large and expensive compile workers, and it required more extensive monitoring, performance, and security work. It’s running on Amazon too. We got to reuse a large amount of the infrastructure we put in place for systems management and authentication for UI Builder.
Cloud Product #3: Technical Data Cloud
It’s still in development, but we’ve announced it so I get to talk about it! The idea behind the Technical Data Cloud is that more and more people need to collect sensor data, but they don’t want to fool with the management of it. They want to plop some sensors down and have the acquired data “go to the cloud!” for storage, visualization, and later analysis. There are other folks doing this already, like the very cool Pachube (pronounced “patch-bay”, there’s a LabVIEW library for talking to it), and it seems everyone wants to take their sensors to the cloud, so we’re looking at making one that’s industrial strength.
For this one we are pulling our our big guns, our data specialist team in Aachen, Germany. We are also being careful to develop it in an open way – the primary interface will be RESTful HTTP Web services, though LabVIEW APIs and hardware links will of course be a priority.
This one had a big technical twist for us – we’re implementing it on Microsoft Windows Azure, the MS guys’ cloud offering. Our org is doing a lot of .NET development and finding a lot of strategic alignment with Microsoft, so we thought we’d kick the tires on their cloud. I’m an old Linux/open source bigot and to be honest I didn’t expect it to make the grade, but once we got up to speed on it I found it was a pretty good bit of implementation. It did mean we had to do significant expansion of our underlying platform we are reusing for all these products – just supporting Linux and Windows instance in Amazon already made us toss a lot of insufficiently open solutions in the garbage bin, and these two cloud worlds are very different as well.
How We Did It
I find nothing more instructive than finding out the details – organizational, technical, etc. – of how people really implement solutions in their own shops. So in the interests of openness and helping out others, I’m going to do a series on how we did it! I figure it’ll be in about three parts, most likely:
- How We Did It: People
- How We Did It: Process
- How We Did It: Tools and Technologies
If there’s something you want to hear about when I cover these areas, just ask in the comments! I can’t share everything, especially for unreleased products, but promise to be as open as I can without someone from Legal coming down here and Tasering me.
As I’ve been involved with DevOps and its approach of blending development and operations staff together to create better products, I’ve started to see similar trends develop in the security space. I think there’s some informative parallels where both can learn from each other and perhaps avoid some pitfalls.
Here’s a recent article entitled “Agile: Most security guys are useless” that states the problem succinctly. In successful and agile orgs, the predominant mindset is that if you’re not touching the product, you are semi-useless overhead. And there’s some truth to that. When people are segregated into other “service” orgs – like operations or security – the us vs. them mindset predominates and strangles innovation in its crib.
The main initial drive of agile was to break down that wall between the devs and the “business”, but walls remain that need similar breaking down. With DevOps, operations organizations faced with this same problem are innovating new approaches; a collaborative approach with developers and operations staff working together on the product as part of the same team. It’s working great for those who are trying it, from the big Web shops like Facebook to the enterprise guys like us here at NI. The movement is gathering steam and it seems clear to those of us doing it this way that it’s going to be a successful and disruptive pattern for adopters.
But let’s not pat ourselves on the back too much just yet. We still have a lot of opportunity to screw it up. Let’s review an example from another area.
In the security world, there is a whole organization, OWASP (the Open Web Application Security Project) whose goal is to promote and enable application security. Security people and developers, working together! Dev+Sec already exists! Or so the plan was.
However, recently there have been some “shots across the bow” in the OWASP community. Read Security People vs Developers and especially OWASP: Has It Reached A Tipping Point? The latter is by Mark Curphey, who started OWASP. He basically says OWASP is becoming irrelevant because it’s leaving developers behind. It’s becoming about “security professionals” selling tools and there’s few developers to be found in the community any more.
And this is absolutely true. We host the Austin OWASP chapter here at NI’s Austin campus, and two of the officers are NI employees. We make sure and invite NI developers to come to OWASP. Few do, at least not after the first couple times. I asked some of the devs on our team why not, and here’s some answers I got.
- I want to leave sessions by saying, “I need to think about this the next time I code”. I leave sessions by saying, “that was cool, I can talk about this at a happy hour”. If I could do the former, I’d probably attend most/all the sessions.
- “Security people” think “developers” don’t know what they are doing and don’t care about security. Which to developers is offensive. We like to write secure applications; sometimes we just find the bugs too late….
- I’ve gone to, I think, 4 OWASP meetings. Of those, I probably would only have recommended one of them to others – Michael Howard’s. I think it helped that he was a well-known speaker and seemed to have a developer focus. So, well-known speakers, with a compelling and relevant subject.. Even then, the time has to be weighed against other priorities. For example, today’s meeting sounds interesting, but not particularly relevant. I’ll probably skip it.
In the end, the content at these meetings is more for security pros like pen testers, or for tool buyers in security or sysadmin groups. “How do I code more securely” is the alleged point of the group but frankly 90% of the activity is around scanners and crackers and all kinds of stuff that is fine but should be simple testing steps after the code’s written securely in the first place.
As a result there have been interesting ideas coming from the security community that are reminiscent of DevOps concepts. Pen tester @atdre did a talk here to the Austin OWASP chapter about how security testers engaging with agile teams “from the outside” are failing, and shouldn’t we instead embed them on the team as their “security buddy.” (I love that term. Security buddy. I hate my “compliance auditor” without even meeting the poor bastard, but I like my security buddy already.) At the OWASP convention LASCON, Matt Tesauro delivered a great keynote similarly trying to refocus the group back on the core problem of developing secure software; in fact, they’re co-sponsoring a movement called “Rugged” that has a manifesto similar to the Agile Manifesto but is focused on security, availability, reliability, et cetera. (As a result it’s of interest to us sysadmin types, who are often saddled with somehow applying those attributes in production to someone else’s code…)
The DevOps community is already running the risk of “leaving the devs behind” too. I love all my buddies at Opscode and DTO and Puppet Labs and Thoughtworks and all. But a lot of DevOps discussions have started to be completely sysadmin focused as well; a litany of tools you can use for provisioning or monitoring or CI. And that wouldn’t be so bad if there was a real entry point for developers – “Here’s how you as a developer interact with chef to deploy your code,” “Here’s how you make your code monitorable”. But those are often fringe discussions around the core content which often mainly warms the cockles of a UNIX sysadmin’s heart. Why do any of my devs want to see a presentation on how to install Puppet? Well, that’s what they got at a recent Austin Cloud User Group meeting.
As a result, my devs have stopped coming to DevOps events. When I ask them why, I get answers similar to the ones above for why they’re not attending OWASP events any more. They’re just not hearing anything that is actionable from the developer point of view. It’s not worth the two hours of their valuable time to come to something that’s not at all targeted at them.
And that’s eventually going to scuttle DevOps if we let it happen, just as it’ll scuttle OWASP if it continues there. The core value of agile is PEOPLE over processes and tools, COLLABORATION over negotiation. If you are leaving the collaboration behind and just focusing on tools, you will eventually fail, just in a more spectacular and automated fashion.
The focus at DevOpsDays US 2010 was great, it was all about culture, nothing about tools. But that culture talk hasn’t driven down to anything more actionable, so tools are just rising up to fill the gap.
In my talk at that DevOpsDays I likened these new tools and techniques to the introduction of the Minie ball to rifles during the Civil War. In that war, they adopted new tools and then retained their same old tactics, walking up close in lines designed for weapons with much shorter ranges and much lower accuracy – and the slaughter was profound.
All our new DevOps tools are great, but in the same way, if we don’t adapt our way of thinking to them, they will make our lives worse, not better, for all their vaunted efficiency. You can do the wrong thing en masse and more quickly. The slaughter will similarly be profound.
A sysadmin suddenly deciding to code his own tools isn’t really the heart of DevOps. It’s fine and good, and I like seeing more tools created by domain experts. But the heart of DevOps, where you will really see the benefits in hard ROI, is developers and operations folks collaborating on real end-consumer products.
If you are doing anything DevOpsey, please think about “Why would a developer care about this?” How is it actionable to them, how does it make their lives easier? I’m a sysadmin primarily, so I love stuff that makes my job easier, but I’ve learned over the years that when I can get our devs to leverage something, that’s when it really takes off and gives value.
The same thing applies to the people on the security side. Why do we have this huge set of tools and techniques, of OWASP Top 10s and Live CDs and Metasploits and about a thousand wonderful little gadgets, but code is pretty much as shitty and insecure as it was 20 years ago? Because all those things try to solve the problem from the outside, instead of targeting the core of the matter, which is developers developing secure code in the first place. And to do that, it’s more of a hearts-and-minds problem than a tools-and-processes problem.
That’s a core realization that operations folks, and security folks, and testing folks, and probably a bunch of other folks need to realize, deeply internalize, and let it change the way they look at the world and how they conduct their work.
Welcome to the second installment in Scrum for Operations, a series where I talk about (and go through) the process of doing systems work as part of a DevOps team according to the Scrum methodology. Last time, I introduced the basics of Scrum as it generally is used, and its key benefit of frequently delivering useful functionality. But I already hear the objections – “How can that turn out all right?” It is so light on process that one’s initial inclination is to dismiss it as “cowboy coding”, and we already know not to be “cowboy sysadmins,” right? One’s intuition might be (and mine was in the beginning, I’ll be honest) that this would lead to a metastable process that could not be sustainable without fundamental fatal flaws overtaking it.
Well, as I learned after trying to learn more and kicking the tires with our dev team here, there are several core disciplines that are agile’s saving graces.
We ops guys are used to testing being a neglected afterthought in the development process, often tossed over the wall to a QA team that isn’t well integrated into the product. Therefore we have a hard time trusting code that’s being handed over to us given our experience – we get it handed to us and it doesn’t work!
Well, agile pretty much understands that without pervasive testing, this kind of fast cycle process is doomed. At its extreme, some practitioners use Test Driven, aka Test First, development where failing tests must be written first and then code filled in behind it till the test passes. This creates a large inherent test framework.
Even agile groups that don’t do this almost always have metrics on unit test coverage and a required bar devs must hit. Here, our desktop software group that’s newly using Scrum has the mandate that “there must be XX% unit test code coverage or you’re not ready to ship.”
Similarly, acceptance testing (automated continuous testing of user stories vs the code) is a common part of agile. Continuous ongoing testing ensures quality through the dev cycle and reduces the need for time-intensive, and mysteriously always insufficient, big-cycle regression testing.
This is a great culture. And there’s all kinds of different tests – unit test, integration/functional/regression testing, performance testing, fault testing… Starting to get interesting to you? How about monitoring? In reality application monitoring is a special case of testing – it’s a “lightweight integration regression test.” Our initial approach to DevOps includes making test coverage goals for things like monitors and performance testing, because that plugs into the existing agile mindset well.
Bonus new terminology thing – the quick acceptance test you do upon release, which we always called “critical path testing,” is now being called “smoke testing” by the hip. Update your dictionaries!
A side note on formal QA groups. Just as we are working on DevOps, there has been previous work on how QA teams interact with agile dev teams, and there are a variety of different doctrines on how to split the work – often, it’s devs that are responsible for a lot of the testing. It’s a hard balance – you want the devs to be responsible for some of the testing because the best testing is “close to the code,” but just like with Ops, a real QA team has expertise beyond what a developer can just bolt in with 10% of their attention. Here, we have a dev team and also a remote QA team; devs test their own code on the daily build and then there’s a weekly push to a more stable environment where the QA team does acceptance testing and is moving into performance testing and the like.
Anyway, this endemic focus on testing and automation of testing and testing metrics is the pin that makes this agile flywheel actually turn without just flying off. (You are correct, some agile teams don’t do this – we call those “the unsuccessful ones.”)
And this is for you to do as well! There’s a whole post or series of posts in the topic “What does a unit test mean for something infrastructurey” – it is incumbent on you to figure it out and also have high test coverage with your work.
In general, agile dev is the epitome of horizon planning. You know you can’t get all the requirements ahead of time (or if you do get them all ahead of time, what you come out with won’t serve any real human’s need) and similarly preplanned architecture and design often doesn’t survive contact with the scrum. So it’s not “don’t plan or design,” but it’s “plan and design in an ongoing manner.”
This is one of the scariest parts for an ops person – we assume that we get one “bite at the apple”, and once we’ve set up the systems and let in the developers, we’ll never be allowed to change anything without a fight. But developers have this problem internally all the time – one dev is working on a core library or API that other developers are using, and they don’t wait for core guy to get done before they start. Instead, they have adopted a concept they call refactoring. Refactoring just means that each sprint, you are open to redoing fundamental stuff that needs to change (or that you realize you did kinda ghetto in the first place).
Because this is an accepted part of the iterative approach, you get to leverage this as well.First iteration they get the basic Tomcat and mySQL install out of the repo, and they can get started – and then in the second iteration you front it with Apache, or tune the DB for security, or whatnot and they have to make some changes to fit. I’m not promising no one will ever cry about this, but it’s a part of what makes the culture successful so it’s there for you to leverage.
And for you to adhere to! Be open to refactoring your infrastructure based on the emerging project needs.
A developer might not even mention this, and most books on agile don’t, because to them it’s so fundamental a discipline that it’s like breathing air. Sadly the same can not be said of Ops folks, so I’m mentioning it. When code is changed, it is in a shared source control repository – which gives other people on the group visibility into it (a collaboration touchpoint), is a common place to source it from (a deployment touchpoint), and can be used to easily manage multiple versions, even experimental ones, and merge or roll back changes.
This is the most fundamental empowering technology of modern software development (not just agile) and you must uptake it immediately or you have lost. Fair warning. It is the stepping stone that will allow subsequent Cool DevOps Automation to happen.
These three disciplines convert agile/Scrum from dangerous free-for-all to a new technique that gets your product done both more quickly and with higher quality than a waterfall method. I’ll talk further next time about how Ops slots into all this, and how you can fit your systems admin work into a Scrum mindset.
Also note, there’s some other agile disciplines surrounding agile design and encapsulation and patterns and whatnot, which I don’t understand well enough yet to speak authoritatively on. Feel free and chime in with other core disciplines if you are!