Category Archives: Agile

Keep Austin Agile 2018 Trip Report

This Thursday, both myself and my boss (the SVP of Engineering at Alienvault) went to Keep Austin Agile, the annual conference that Agile Austin, the local Austin agile user group network, puts on!  I used to run the Agile Austin DevOps SIG till I just ran out of time to do all the community stuff I was doing and had to cut it out.

Logo-Tagline.2376.v2017.08.16

It’s super professional for a practitioner conference, and was at the JW Marriott in downtown Austin one day only.  It was sold out at 750 people. I figured I’d share my notes in case anyone’s interested.  All the presentations are online here and video is coming soon.

DevOps Archaeology

My first session was DevOps Archaeology by Lee Fox (@foxinatx), the cloud architect for Infor. The premise is that it’s an unfortunately common task in the industry to have to “go find out how that old thing works,” whether it’s code or systems or, of course, the hybrid of the two.  So he has tips and tools to help with that process.  Super practical.  Several of my engineers at work are working on projects that are exactly this. “Hey that critical old system someone pooped out 3 years ago and then moved on – go figure it out and operationalize it.”

I basically wrote down the list of cool tools that help with this process…
  • Codecity – visualizes your code as a city
  • Gource  – visualizes the evolution of your codebase over time
  • Signaturesurvey – scan for patterns in code
  • Logstalgia – visualizes historical traffic to a Web endpoint
  • Proxies – setting up proxies helps understand what’s going on, at an even deeper level than flow logs.
  • Monitoring – you know, all the usual monitoring tools.
  • Logs – you know, all the usual log aggregation tools.
Then he had a bunch of AWS-specific tools too.  All our stuff is in AWS, so super useful.
  • Cloudtrail – AWS API logs, yeah.  We pump our cloudtrail into our own USM Anywhere instance to report on weirdness.
  • Config – new service, have it report on things not tagged right, if volumes are encrypted, whatever kind of rules you want to set up.  Nice!
  • Trusted Advisor – well, don’t trust it too much, I’ve learned the hard way there’s lots of limits and stuff it doesn’t know about.  But useful.
  • Macie – “machine learning” (I always put that in scare quotes nowadays because of its overuse) to identify weirdness in your environment. Detect high risk cloudtrail events, unusual locations of activity, and so on.

And, some discussion of testing, config management, and so on.  Great talk, I will look into some of these tools!

Brewing Great Agile Team Dynamics

This talk, by Allison Pollard (@allison_pollard) and Barry Forrest (@bforrest30), wasn’t really my cup of tea. It did a basic 4-quadrant personality survey to break us up into 4 categories of Compliance, Dominance, Steadiness, or Influencer.  Then we spent most of the time wandering the room in a giant circle doing activities that each took 10 minutes longer than they needed to.

So I’m fine with the 4 quadrant thing – but I got taught a similar thing back when starting my first job at FedEx back in 1993, so it wasn’t exactly late breaking news.  (Driver, Analytical, Amiable, and Expressive were the four, IIRC.)  As a new person it was illuminating and made me realize you have to think about different personalities’ approaches and not consider other approaches automatically “bad.” So yay for the concept.

But I’m not big on the time consuming agile game thing that is at lots of these conferences. “What might turn you off about a Dominant person?  That they can be rude?” Ok, good mini-wisdom, should it take 10 minutes to get it? Maybe it’s just because I’m a Driver, but I get extremely restless in formats like this. A lot of people must like them because agile conferences have them a lot, but they’re not for me.

Modern Lean Leadership

Next up was Modern Lean Leadership by Mark Spitzer (@mspitzer), an agile coach. I love me some Deming and also am always looking to improve my leadership, so this drew me in this time slot.

First, he quoted Deming’s 14 points for total quality management.  For the record (quoted from asq.org:

  1. Create constancy of purpose for improving products and services.
  2. Adopt the new philosophy.
  3. Cease dependence on inspection to achieve quality.
  4. End the practice of awarding business on price alone; instead, minimize total cost by working with a single supplier.
  5. Improve constantly and forever every process for planning, production and service.
  6. Institute training on the job.
  7. Adopt and institute leadership.
  8. Drive out fear.
  9. Break down barriers between staff areas.
  10. Eliminate slogans, exhortations and targets for the workforce.
  11. Eliminate numerical quotas for the workforce and numerical goals for management.
  12. Remove barriers that rob people of pride of workmanship, and eliminate the annual rating or merit system.
  13. Institute a vigorous program of education and self-improvement for everyone.
  14. Put everybody in the company to work accomplishing the transformation.

His talk focused on #7 and #8 – instituting leadership and driving out fear.

Many organizations are fear driven. Even if it’s more subtle than the fear of being fired, the fear of being proven wrong, losing face, etc. is a very real inhibitor.  Moving the organization from fear to safety to awesome is the desired trajectory.

He uses “Modern Agile” (Modernagile.org) which I hadn’t heard of before, but its principles are aligned with this:

  • Make People Awesome
  • Make Safety a Prerequisite
  • Experiment & Learn Rapidly
  • Deliver Value Continuously

So how do we create safety? There’s a lot to that, but he presented a quality tool to analyze fear and its sources – who cares and why – to help.

Then the next step is to determine mitigations, and how to measure their success and timebox them. I’m a big fan of timeboxing, it is critical to making deeper improvement without being stuck down the rabbit hole.  I tell my engineers all the time when asked “well but how much do I go improve this code/process” to pick a reasonable time box and then do what you can in that window.

OK, but once you have safety, how do you make people awesome? Well, what is awesome about a job?  Focus on those things.  You can use the usual Lean techniques, like stop-work authority, making progress visible (e.g. days without an incident), using the Toyota kata for continuous improvement, using Plan-Do-Check-Act…

In terms of tangible places to start, he focused on things that disrupt people’s sleep at night, doing retros for fear/safety, and establishing metric indicators as targets for improvement.

Another good talk!

How The Marine Corps Creates High-Performing Teams

Andy McKnight gave this interesting talk – explaining how the Marines build a culture and teamwork, so that we might adapt their approach to our organizations.  I do like yelling at people, so I am all in!

Marine boot camp is partially about technical excellence, but also about steeping recruits in their organizational culture. (In business, new hire orientations have been shown to give strong benefits… And mentoring after the fact.)

What is culture?  It is the shared values, beliefs, assumptions that govern how people behave.

Most organizations have microcultures at the team level.  But how do you make a macroculture?  Culture comes first, teambuilding second.

  1. shift your org structure to align with the value stream instead of functional silos
  2. measure as a team

The 11 Marine Corps Leadership Principles:

  1. Know yourself and seek self-improvement.
  2. Be technically and tactically proficient.
  3. Develop a sense of responsibility among your subordinates.
  4. Make sound and timely decisions.
  5. Set an example.
  6. Know your people and look out for their welfare.
  7. Keep your people informed.
  8. Seek responsibility and take responsibility for your actions.
  9. Ensure assigned tasks are understood, supervised, and accomplished.
  10. Train your people as a team.
  11. Employ your team in accordance with its capabilities.

On the scrum team – those necessary to get the work done

The two Leadership Objectives – mission accomplishment and team welfare, a balance.

Discussion of Commanders Intent and delegating decisions down to the lowest effective level.

Good discussion, loads of takeaways. At my work I would say we are working on developing a macroculture but don’t currently have one, so I’ll be interested to put some of this into practice.

Agile for Distributed Teams

And finally, Agile for Distributed Teams by Paul Brownell (@paulbaustin). At my work we have distributed teams and it’s a challenge. Lots of stuff in the slides, my takeaways are:

  • People’s biggest concern – not understanding enough context, not sharing values
  • Use multiple communication channels – video, chat, email.
  • Get F2F time.  Quarterly.  Make it happen. Use ambassadors.
  • Expose the team to Other parts of the org, get users involved
  • Establish rules of engagement – hours, channels, etc. for clarity.
  • Teams will have local subcultures – make a space for shared learning, encourage lateral communication, emphasize early progress.
  • Use icebreakers in standups etc – something about your week
  • Teambuilding- slack channels, scavenger hunts
  • Sprint planning – one or two meetings? Involve the team.
  • Standups – try all on headsets to level the playing field for in room/out of room.
  • Online whiteboards
  • Retros – be creative, get written feedback ahead of time

All right!  4 of 5 sessions made me happy, which is a good ratio. Check out these talks and more on the Keep Austin Agile 2018 Web site!  It’s a large and well run conference; consider attending it even if you’re not an “agile coach”!

 

2 Comments

Filed under Agile, Conferences, Security

Long live ChatOps, RIP AOL IM!

I grew up in Muscat, Oman, and it was an exciting time when we got Internet at home in 1996. By 1998, all of my friends who had Internet at home were first on ICQ and then on AOL IM. AOL IM was huge when I went to college in the early 2000’s and was the primary way to connect friends together to chat. Back then, it was rare to have chat rooms, and the rooms that existed were usually long-running things set up to talk about general topics.

The first time I saw value in a chat room in a professional setting was when I got invited to a Basecamp “deploy room” by fellow Agile Admin Peco (or was it Ernest?) at NI when our quarterly release cycle was going super poorly, and all of us (100 other people) were waiting around at hour #34 trying to figure out why some random enterprise application was holding up the rest of the release process. Post invitation to the room, I was able to look at the past messages between the ops team about application failures, and then realized pretty quickly that our databases weren’t actually responding like they should. It took all of 10 minutes to ask someone on the ops side with credentials to run a database query, and figure out that the db creds were all wrong. 2 hours later, the release was all done…

That moment made me realize that 1×1 chats were great, but having a persistent chat rooms with teams of people added value to an organization.

Recently, a colleague asked me a simple question that made me reflect. He asked, “What’s the big deal about Slack?”. At work, there’s been a big push to move towards Slack, when we’ve had 1×1 chat forever. Here are my 5 most compelling reasons for doing so:

1) Collaboration++: 15 years ago, software was a simpler, and there was no cloud/microservices. You’d have 1 large binary to deploy for a platform, and typically have a few folks who understood the overall workings of platform. Today, with microservices, you require a bunch of applications to deploy, and each of these have specific owners who understand specifics. Thus, you’re going to have to have conversations with multiple folks to figure out any issues. Having this in a room setting versus a 1×1 setting gets you to a resolution faster.

2) Chat metadata: Chat is less about words, and more about conversations that include images, links, slash commands, workflows etc. Chatops tools make pasting these much easier than before, and looking at formatted code in Slack is so much easier to read than looking at the same in pidgin.

3) Chat History: Chat apps now give you history – even from when you were not online or in the chat room. This is valuable from the perspective that you can see everything from when you weren’t around, and don’t have to ask someone to keep repeating the problem over and over again. You can just scroll up, read the context, and be ready to help if you can. This is my one knock against IRC (or at least the implementation of IRC at a company I worked at); it was nice to have everyone in a spot, but it only worked when we were VPN’ed in, and had no history.

4) Pipelining with chatbots: Continuous Integration/Delivery is all the rage these days! Having a chat system that allows for your devops systems to push data is a primary requirement in order to build a pipeline of this sort. Responses to broken builds, tests, alerts are quicker when the data associated with these are transmitted to a chatroom that you’re looking at, than having to look at Jenkins all the time. Chatbots are invaluable in this scenario, and help you with information flow.

5) The new normal: A new generation of engineers already do this. It’s already part of the culture for the next generation of engineers who work on open source (for example, kubernetes slack) and there’s even chatter about slack at Universities now. The world is evolving towards broader conversation, and not having chatops tools will hurt your company in terms of hiring and retention.

 

Agree/Disagree, or have a different perspective? Let me know by commenting below!

4 Comments

Filed under Agile, DevOps

DevOps Foundations: Lean and Agile

Well you’re in for a treat – we’re getting all of the Agile Admins in on making DevOps courses, and Karthik and I did a course that’s just released today – DevOps Foundations: Lean and Agile.

It’s available both on LinkedIn Learning:
https://www.linkedin.com/learning/devops-foundations-lean-and-agile

and on Lynda.com:
https://www.lynda.com/JIRA-tutorials/DevOps-Foundations-Lean-Agile/622078-2.html

After James and I did DevOps Foundations, the “101” course, we were focused on building out courses for the three major practice areas of DevOps – Continuous Deployment, Infrastructure Automation, and Site Reliability Engineering (in progress now). But our lynda.com content manager said there was interest in us also expanding on the use of Agile and Lean especially as it relates to DevOps.

Karthik is our agile admin Agile expert; he’s presented at several Agile conferences and the like, so he and I decided to take it on.  But how would we bring a DevOps specific take to it?  We started outlining a course and realized it could turn into a giant boring encyclopedia of every Lean and Agile term ever. Most of what we have to add isn’t reading definitions, it’s sharing our experiences actually doing this (my Scrum for Operations series on this blog is perennially popular).

So we decided to take a tip from both Eliahu Goldratt’s The Goal and Gene Kim’s The Phoenix Project by framing the course as a fictional story!  By stitching a narrative together of a Lean, Agile, DevOps transformation of a hypothetical company out of our real world stories from a variety of implementations, we figured we could explain the concepts in context and make them more interesting.  Let us know what you think!

Lynda Course Description:

By applying lean and agile principles, engineering teams can deliver better systems and better business outcomes—both of which are crucial to the success of DevOps. In this course, instructors Ernest Mueller and Karthik Gaekwad discuss the theories, techniques, and benefits of agile and lean. Learn how they can be applied to operations teams to create a more effective flow from development into operations and accelerate your path of “concept to cash.” In addition to key concepts, you can hear in-the-trenches examples of implementing lean and agile in real-world software organizations.

Topics Include:
What is agile?
What is lean?
Measuring success
Learning and adapting
Building a culture of metrics
Continuous learning
Advanced concepts

Duration:
1h 26m

Leave a comment

Filed under Agile

Agile Organization: Project Based Organization

This is the fourth in the series of deeper-dive articles that are part of Agile Organization Incorporating Various Disciplines.

The Project Based Organization Model

You all know this one.  You pick all the resources needed to accomplish a project (phase of development on a product), have them do it, then reassign them!

project

Benefits Of The Project Based Model

  • The beancounters love it. You are assigning the minimum needed resource to something “only as long as it’s needed.”

Drawbacks Of The Project Based Model

Where to even begin?

  • First of all, teams don’t do well if not allowed to go through the Tuckman’s stages of development (forming/storming/norming/performing); engineer satisfaction plummets.
  • Long term ownership isn’t the team’s responsibility so there is a tendency to make decisions that have long term consequences – especially bearing on stability, performance, scalability – because it’s clear that will be someone else’s problem. Even when there is a “handoff” planned, it’s usually rushed as the project team tries to ‘get out of there’ from due date or expenditure pressures. More often there is a massive generation of “orphans” – services no one owns. This is immensely toxic – it’s a problem with shipping software, but with a live service it’s awful, as even if there’s some “NOC” type ops org somewhere that can get it running again if there’s an issue, chronic issues can’t be fixed and problems cause more and more load on service consumers, support, and NOC staff.
  • Mentoring, personnel development, etc. are hard and tend to just be informal (e.g. “take more classes from our LMS”).

Experience With The Project Based Model

At Bazaarvoice, we got to where we were getting close to doing this with continued reorganization to gerrymander just the right number of people onto the projects with need every month. Engineer satisfaction tanked to the degree that it became an internal management crisis we had to do all kinds of stuff to dig ourselves back out of.

Of course, many consulting relationships work this way. It’s the core behind many of the issues people have with outsourced development. There are a lot of mitigations for this, most of which are “try not to do it” – like I’ve worked with outsourcers trying to ensure lower churn on outsource teams, try to keep teams stable and working on the same thing longer.

It does have the merit of working if you just don’t care about long term viability.  Developing free giveaway tools, for example – as long as they’re not so bad they reflect poorly on your company, they can be problematic and unowned in the long term.

Otherwise, this model is pretty terrible from a quality results perspective and it’s really only useful when there’s hard financial limitations in place and not a strong culture of responsibility otherwise. It’s not very friendly to agile concepts or devops, but I am including it here because it’s a prevalent model.

1 Comment

Filed under Agile, DevOps

Agile Organization: Fully Integrated Service Teams

This is the third article in the series of deeper-dive articles that are part of Agile Organization Incorporating Various Disciplines.

The Fully Integrated Service Team Model

The next step along the continuum of decentralization is complete integration of the disciplines into one service team. You simply have an engineering manager, and devs, operations staff, QA engineers, etc. all report to them. It’s similar to the Embedded Crossfunctional Team model but you do away with the per-discipline reporting structure altogether.

integrated

Benefits Of Integrated Service Teams

This has the distinct benefit of end to end ownership. Engineers of every discipline have ownership for the overall product. It allows them to break out of their single-discipline shell, as well – if you are good at regression testing but also can code, or are a developer but strong in operations, great!  There’s no fence saying whose job is whose, you all pull tasks off the same backlog. In general you get the same benefits as the Crossfunctional Team model.

Drawbacks of Integrated Service Teams

This is theoretical nirvana, but has a number of challenges.

First, a given team manager may not have the knowledge or experience in each of those areas. While you don’t need deep expertise in every area to manage a team, it can be easy to not understand how to evaluate or develop people from another discipline. I have seen dev managers, having been handed ops engineers, fail to understand what they really do or they value, and lose them as a result.

Even more dangerous is when that happens and the manager figures they didn’t need that discipline in the first place and just backfills with what they are comfortable with. For a team to really own a service from initiation to maintenance, the rest of the team has to understand what is involved. It’s very easy to slip back into the old habits of considering different teams first class vs second class vs third class citizens, just making classes of engineer within your team. And obviously, disenfranchising people works directly against energizing them and giving them ownership and responsibility.

Mitigations for that include:

  1. Time – over time, a team learns the basics of the other branches and what is required of them.
  2. Discipline “user groups” (aka “guilds”) – having a venue for people from a horizontal discipline to meet and share best practices and support each other. When we did this with our ops team we always intended to set up a “DevOps user group” but between turnover and competing priorities, it never happened – which reduced the level of success.

A second issue is scaling. Moving from “zone” to “man” coverage, as this demands, is more resource intensive. If you have nine product teams but five operations engineers, then it seems like either you can’t do this or you can but have to “share” between several teams.  Such sharing works but directly degrades the benefits of ownership and impedance matching that you intend to gain from this scheme. In fact, if you want to take the prudent step of having more than one person on a team know how to do something – which you probably should – then you’d need 18 and not just nine ops engineers.

Mitigations for this include:

  1. Do the math again. If the lack of close integration with that discipline is holding back your rate of progress, then you’re losing profits to reduce expenditures – a bad bet for all but the most late-stage companies.
  2. Crosstraining. You may have one ops, or QA, or security expert, but that doesn’t (and, to be opinionated, shouldn’t) mean that they are the only ones who know how to perform that function.  When doing this I always used the rule “if you know how to do it, you’re one of the people that should pull that task – and you should learn how to do it.” This can be as simple as when someone wants the QA or ops or whatever engineer to do something, to instead walk the requestor through how to do it.

Experience with Integrated Service Teams

Our SaaS team at NI was fully integrated. That worked great, with experienced and motivated people in a single team, and multiple representatives of each discipline to help reinforce each other and keep developing.

We also fully integrated DevOps into the engineering teams at Bazaarvoice.  That didn’t work as well, we saw attrition from those ops engineers from the drawbacks I went over above (managers not knowing what to do with/how to recruit, retain, develop ops engineers). In retrospect we should not have done it and should have stayed with an embedded crossfunctional team in that environment – the QA team did so and while collaboration on the team was slightly impeded they didn’t see the losses the ops side did.

Leave a comment

Filed under Agile, DevOps

Agile Organization: Embedded Crossfunctional Service Teams

This is the second in the series of deeper-dive articles that are part of Agile Organization Incorporating Various Disciplines. Previously I wrote about the “traditional” siloed model as Agile Organization: Separate Teams By Discipline.

Basic agile doctrine, combined with ITSM thinking, strongly promulgate a model where a team owns a business service or product.  I’ll call this a “service team” for the sake of argument, but it applies to products as well.

The Embedded Crossfunctional Team Model

Simply enough, a service team has specialists from other groups – product, QA, Ops, whatever – assigned to it on an exclusive basis. They go sit with (where possible) and participate with the group in their process. The team then contains all the resources required to bring their service or product to completion (in production, for services). The mix of required skills may vary – in this diagram, product #4 is probably a pure back end service and product #3 is probably a very JavaScriptey UI product, for example.

embedded

Benefits of Embedded Crossfunctional Teams

This model has a lot of benefits.

  • Since team members are semi-permanently assigned into the team, they don’t have conflicting work and priorities. They learn to understand what the needs are of that specific product and collaborate much more effectively with all the others working on it.
  • This leads to removal of many bottlenecks and delays as the team, if it has all the components it needs to deliver its service, can organically assign work in an optimal way.
  • It is amazing how much faster this model is in practice than the separate teams by discipline model in terms of time to delivery.
  • Production uptime and performance are improved because the team is “eating its own dog food” and is aware of production issues directly, not as some ticket from some other team that gets routed, ignored, finger-pointed…

Drawbacks of Embedded Crossfunctional Teams

  • Multiple masters is always an issue for an engineer.  Are you trying to please the service team’s manager or your “real” manager, especially when their values seem to conflict? You have to do some political problem-solving then, and engineers hate doing that. It also provides some temptation to double-resource or otherwise have the embedded engineer “do something else” the ops team needs done, violating the single service focus.
  • It’s more expensive.  Yes, if you do this, you need at least one specialist per service team. You have to play man-to-man, not zone, to get the benefits of the approach. This should make you think about how you create your service teams; if you define a service as one specific microservice and you have teams with e.g. 2 devs on them, then embedding specialists is way more expensive. Consider basing things on the business concept of a service and having those devs working on more than one widget, targeting the 2-pizza team size (also those devs will be supporting/sustaining whichever services aren’t brand new). (Note that this is only “more expensive” in the sense that doesn’t bother with ROI and just takes raw costs as king – you get stuff out and start making revenue faster, so it’s really less expensive from a whole-company POV, but from the IT beancounter perspective “what’s profit?”)
  • While you get benefits of crosstraining across the disciplines, the e.g. ops folks don’t have regular all-day contact with other ops folks and so you need to take care to set up opportunities for the “ops team” to get together, share info, mentor each other, etc. as well. Folks call these “tribes” or “guilds” or similar.

Experience with Crossfunctional Teams

It can be hard at the beginning, when teams don’t understand each others’ discipline or even language yet.  I had a lengthy discussion with an application architect on one team – he felt that having ops people in the design reviews he was holding was confusing and derailing. The ops people spoke some weird moon-man language to him and it made the reviews go longer and require a lot more explanation.  I said “Yes, they do.  But we have two choices – keep doing it together, have people learn each others’ concerns and language, and start a virtuous cycle of collaboration, or split them apart and propagate the vicious cycle we know all too well where we have difficulty working together.” So we powered through that and stayed together, and it all worked out well in the end.

When we piloted this at Bazaarvoice, one of the first ops to embed got in there, worked with the devs, and put all his work into their JIRA project.  The devs got sticker shock very quickly once they saw how much work there was in delivering a reliable service – and when they dug into those tickets, they realized that they weren’t BS or busywork, but that when they thought about it they said “yeah… We certainly do need that, all this is for real!” The devs then started pulling tickets on monitoring, backups, provisioning, etc. because they realized all that workload on one person would put their delivery date behind. It was nice to see devs realize all the work that really went into doing ops – “out of sight, out of mind,” and too often devs assume ops don’t do anything except move their files to production on occasion. The embedding allowed them to rally to control their own delivery date instead of just “be blocked on the ops team.”

No one approach is “best,” but in general my experiences so far lead me to consider this one of the better models to use if you can get the organizational buy-in to the fundamental “you built it, you run it” concept.

Leave a comment

Filed under Agile, DevOps

Agile Austin DevOps SIG June Meeting – Change Patterns

aalogoAgile Austin has a DevOps SIG that meets monthly over lunch at BancVue, and I help Dan Zentgraf out with it. This month’s meeting is on Wednesday June 24 and is called:

Change Patterns – Don’t Get Mowed Down with the Grassroots

Come on out!  RSVP on Eventbrite; lunch is provided.

Leave a comment

Filed under Agile, DevOps

Scrum for Operations: Just Add DevOps

OK, so it’s been a while since the last installment in this series. I had talked about how we’d brought Scrum to our operations team, which worked fine, and then added in the developers as well, which didn’t.  Our first attempt at dividing the whole product up into four integrated DevOps service teams collapsed quickly. Here’s how we got past that to do it again and succeed, with a fully integrated DevOps setup managing both proactive/project/feature work and reactive/support/tactical work in an effective way.

The first challenge we had was just that I couldn’t manage four new scrum teams totally on my own. We had gotten the ops team working well on Scrum but the development team hadn’t been. We didn’t have any scrum masters other than me, and were low on people designated as tech leads as well. So step one, we just mushed everyone into a 30+-person Scrum team, which sucked but was the least of the evils, and I immediately worked on mentoring people up as both Scrum masters and tech leads. I basically led by example and asked for volunteers for who was interested in Scrum mastering and/or being a tech lead for each of the planned sub-teams.

This was interesting – people self-selected in ways I would not have predicted. Some of the willing were employees and others were contractors – fine, as far as I’m concerned. “From each according to his ability, to each according to his need.” I then maximally had them start to lead standups and take ownership over sprints and such, coaching them in the background.  In that way, it only took a little while to re-form up into those four teams. This gave some organic burn-in time as well, so that the ops folks got more used to not being “the ops team” and leaning on people that were supposed to be on other sprint teams. As I empowered the new leaders this became self-correcting (“Stop working with that other guy on their stuff, we have things that need doing for our service!”).

The second and largest problem was managing the work.

Part 1: “Planned” Work

Product Management

We had practically zero product manager support at the beginning, so it was up to us to try to balance planned work from many sectors – feature requests from PM, tooling requests from Customer Service and Technical Support, security stuff from Security, random requests from Legal and Compliance, brainwaves from other products’ engineering managers. It quickly began to take a full person worth of time to handle, because if not managed proactively it turned into attending meetings with people complaining as a reactive result. I went to the PM team and said “come on man, give us more PM support,” and once we got one, I worked with her on being able to manage the whole overall package.

One of the chronic product manager problems in a SaaS environment is the “not my domain” syndrome.  PMs want to talk about features, but when it comes to balancing stakeholders like Security and internal operational projects, they wash their hands of it. At best you get a “Well… You guys get 50% of your time to figure out what to do with that stuff and then 50% of your bandwidth will be my features, OK?”

For the record, that’s not OK.  As a product manager, you need to be able to balance all the work against all the work. Maybe you don’t have an ops background, that’s fine – you probably didn’t have a <business domain here> background when you came to work either.  Learn.  A lot of the success of a SaaS product is in the balancing of features against stability/scalability work against compliance work… If you want to take the “I’m the CEO of the product” role, then you need to step up and own all of it, otherwise you’re just that product’s Director of Wishful Thinking.

Anyway, luckily our PM, even though she didn’t have experience doing that, was willing to learn and take it on. So we could reason about spending a month putting in a new cluster versus adding a new display feature – they all went into the same roadmap.

Work Management in JIRA

We managed that in the standard, simple way of a quarterly spreadsheet roadmap correlating to epics in an Agile rapid board in JIRA; we’d do stakeholder meetings and stack rank them and then move them into the team backlogs and flesh them out when they were ready for execution. (It was important to keep the clutter out of the backlogs till then – even if it’s supposed to be future stuff, the more items sitting in a team’s backlog, the worse their focus is.)

We kept each service as one JIRA project but combined them into rapid boards as necessary – a given team might own a couple (like the “Workbench and CMS Team” had those two and a smaller tooling one). This way when we transferred a piece of tech around we could just move the JIRA project as well, and incorporate it into the target team’s rapid board.

Portfolio Management

Some people say that the ideal world is each team owning one microservice. I don’t agree with this – we had a number of teams under other parts of the org that were only like 2 people because they owned some little service. This was difficult to sustain and transition; when things like Black Friday came up that required 24×7 support from each team for a week it was brutal on them, and even worse once development eased up on a service it just got orphaned.

If you don’t keep a service portfolio and tightly manage who is tasked with supporting each one, you are opening yourself up to a world of hurt.  And that’s where we started. Departmentwide, we’d have teams work on something and then wander off and if there was a problem they’d call on the dev who was somewhere working on some other deadline now. This worked terribly. I got the brunt of this earlier when I had joined the company as a release manager and asked “So, those core services, who owns those?” “Well… No one.  I mean everyone!”

So for my teams, we put together a tracking list, used both as a service registry but also for cross-training. It was a simple three-level outline, with major services, service components, and low level elements. We had the whole team work together on filling out the list – since we were managing a 7-year-old legacy system we ended up with a list of 275 or so leaf items.  Every one had to have an owning team (team, not individual), and unless you retired a service, there was no dropping it on the floor. You owned it, you retired or transitioned it, and I didn’t care if “the dev that worked on that moved to some other team.” Everything was included, from end user facing services like “the user portal” to internal services like our CI servers.

Team Management

This transitions into how we managed the teams. Teams were standard “two-pizza size” – a team of 5-7 people is optimal and we would combine services up till there was enough work for a team of that size.  This avoided the poor coverage of mini-teams for micro-services.

Knowledge Management

Then we also used the service registry as the “merit badge system.” We had a simple qualification procedure – to get qualified in an element, you had to roll out code on it and you could sign off for yourself to say that you were qualified on one of those leaf elements.  To get your “merit badge” in a service component, you needed an existing subject matter expert (SME) to sign off that you knew what you were doing, and you needed to understand how the service was written, deployed, tested, monitored. To become a SME in a service, the existing SMEs would have a board of review. SMEs were then understood to be qualified to make architectural changes on their own initiative, those with merit badges were understood to be qualified to make code changes with nothing more than the usual code review and CI testing.

This was very important for us because we were starting in a place where people had been allowed to specialize too much.  Only one person was the SME on any given service and if the person who didn’t understand the Workbench was out, or quit, or whatever, suddenly no one knew a whole lot about it.  That’s a terrible velocity burden even without attrition and it’s a clear and present danger to your service’s viability when there is.  We started tracking the “merit badges” and had engineers set goals around how many they’d earn (or how many they’d review, for the more experienced). We used a lot of contract programmers and I told the contractor manager that I wanted to use that to rate the expertise of people on our account and that I wanted to see the numbers rise on each person over time.

Part 2 – “Unplanned” Work

Our team was only doing planned work 40% of the time, however.  Since we were integrated DevOps teams working on a service with thousands of paying end customers, and that service was custom-integrated with customer Web sites and import/export of customer data, there was a continuous load of work coming in from Customer Support and from our own operations (alerts, etc.). All this “flow” work was interrupt-driven, and urgent to one degree or another.

The usual techniques for handling this have a lot of problems.  “Let’s make a separate sustaining team!” Well, this turns into a set of second class developers.  Thus you get worse devs there on that team. And those devs are more utilized one week, less the next; when things are quiet they get seconded to other efforts by people that haven’t read the Principles of Product Development Flow and think having everyone highly utilized is valuable, and then the load ramps up and something gives… Plans emerge to make it a training ground for new devs, till you realize that even new devs don’t want to put up with that and just quit… I’ve seen this happen in several places. I am a firm believer in dog fooding – if you are writing a service, you need to handle its problems.  If there are a lot of problems, fix it!!!  If you are writing the next version of the service – well, doing that in isolation from the previous service dooms you to making the same errors, adding in some new ones as well. No sustaining teams, only evergreen teams.

So we had the rule that each integrated team of devs and ops handled their own dirty laundry. And at 60% of the workload, that was a lot of laundry.  How did we do it? Everyone was worried about how we could deliver decent feature velocity while overwhelmed by flow work.  Therefore…

Triage

In addition to the teams’ daily standups, we had a daily triage meeting where the engineering manager, team leads, PM, and representatives from Support, Sales, and whoever had a crisis item that they felt needed to be handled on an expedited basis (not waiting on the next sprint, in other words) would come to. Each new intake would be reviewed. In the beginning we had to kick a lot of requests back for insufficient detail, but that corrected itself fast. We’d all agree on priority – “that’s affecting one customer but they’re one of our top customers, we’ll work it as P2” or the like.

For customer reported issues, we had SLAs we committed to in contracts (1 day for P1, etc). Ops issues could be super urgent – “P1, one of our services is down!” – or doable later on. So what we did was to create a separate Kanban board in JIRA for all these kinds of issues. Anything that really could wait or was large/long term would get migrated into the Scrum backlogs, but anything where “we need someone to do this stat” went in here.  It served the same purpose as an “expedite lane,” it was just a jumbo four-lane highway of an expedite lane because so much of the work was interrupt driven.

But does this mean “give up on Scrum?”  No. Without the sprint cadence it was hard to hit a release cadence, and also it was easy for engineers to just get lost in the soup and stop delivering at a good rate, frankly. So we still ran sprints, and here was the rules of engagement we provided the engineers.

Need More Work?

  1. Pull things for your team/service off the triage queue
  2. If there’s nothing in the triage queue, pull the next item from the sprint backlog
  3. If there’s nothing in the sprint backlog or the triage queue, pull something off the top of the backlog. Or relax, either one.

Then for standups, we had a master Agile board that contained everything – all projects, the triage board, everything.  So when you looked at a given engineer’s swimlane, you could see the sprint work they had, the flow work they had, and anything they were working on from someone else’s project (“Hey, why are you doing that?”). Again, via JIRA agile board composition that’s easy to do. Sometimes teams would try to do standups just looking at swimlanes containing “their” projects and it always ended up with things falling in the gap, so each time that happened I reiterated the value of seeing everything that person is assigned, not just what you think/hope they are assigned, since they are exclusively attached to that one team.

At first, everyone fretted about the conflict between the flow work and the sprint work.  “What if there’s a lot of support work today?!?” But as we went on sprint by sprint, it became evident that over the course of a two week sprint, the amount of support and operations work evened out.  Sprint velocity was regular despite the team working flow work as well.  Having full-sized sprint teams and two-week iterations meant that even if one day there was a big production issue and whoever grabbed it was slammed – over the rest of the time and people it evened out. This shouldn’t be too surprising, it’s called “flow” for a reason – there are small ebbs and surges but in general the amount over time at scale is the same. Was it as perfectly “efficient” as having some people working flow and others working sprints?  No, there is definitely a % overhead that incurs. But maximum utilization of each person’s time, as I mentioned before, is a golden calf. Lean principles show us that we want the best overall outcome – and this worked.

Flow work was addressed quickly, and our customer ticket SLA attainment percentage whipped up from a starting level of less than 50% over several quarters to sit at 100%, meaning every single support ticket that came in was addressed within its advertised SLA. Once that number hit 100% the support ticket time to live started to fall as well.

At the same time, sprint velocity for each of the four sprint teams went up over time – that is, just the story points they were delivering out of the feature backlog improved, in addition to the improvements in flow work. We’d modify the process based on engineer feedback and try alterations for a sprint to see how they panned out, but in general by keeping the overall frame in place long enough that people got used to it, they became more and more productive. Flow work and planned work both improved at the same time, not at each others’ expense. 

The Dynamic Duo

This scheme had two issues.  One was that engineers were sometimes confused about what they should be working on, flow tickets with SLAs or their sprint tasks. Our Scrum masters were engineers on the teams too, they weren’t full time PMs that could afford to be manually managing everyone’s work. The second was that operational issues that came in and required sub-day response time couldn’t wait for triage and ended up with either frantic searches for anyone who could help (which often became “everyone sitting in the team area”) or missed items.

I have always been inspired by Tom Limoncelli’s Time Management for System Administrators. He advocates an “interrupt shield” where someone is designated as the person to handle walkups and crises.  At NI I had instituted this in process, at BV in the previous Ops team there had been a “The Dude” role (complete with Jeff Bridges bobblehead) that had been that. Thus the Dynamic Duo were born.

The teams each had one dev and one DevOps on call at a given time; we managed the schedule in PagerDuty.  Whoever was on call that week became “on the Dynamic Duo” during the days as well.  When we went into a sprint, they would not pull sprint tasks and would be dedicated to operational and urgent support issues. It was the Dynamic Duo because we needed someone with ops and someone with dev expertise on this – one without the other meant problems were not effectively solved. I even made a cool wiki page with Batman and Robin stuff all over it and we got Bat-phones painted red. I evangelized this inside the company.  “Don’t walk up and grab a dev with your question. Don’t chat your favorite engineer to get them to work on something. Come turn on the Bat-signal, or call the Bat-phone, and the Dynamic Duo will come help you.”

This was good because it blunted the sharp tip of very urgent requests.  The remaining flow work (2 people couldn’t handle the 60% of the load that was interrupt driven) was easier for the sprinting devs to pull in as they finished sprint tasks without worrying about timeliness – the real crises were handled by the Dynamic Duo. The Duo also ran the triage meeting and even if they weren’t working on all the triage work, they bird dogged it as a kind of scrum/kanban/project manager over the flow work. In the rare case there wasn’t anything for them to jump on, they could slack – time to read, learn, etc. both as compensation for the oncall and adrenaline rushes that week but also because it’s hard to fit time for that into sprints… And as we know from the Principles of Product Development Flow, running teams at too high of a utilization is actually counterproductive.

Conclusion

That’s the short form of how we did it – I wanted to do a lot more but I realized that since it’s been a year since I intended to write this, I’d better shake and bake and get it out and then I’m happy to follow up with stuff any of you are curious about!

This got us to a pretty happy place.  As time went on we tweaked the sprint process, the triage process, and the oncall/Duo process but for a set of teams of our size with our kind of workload it was close to an optimal solution. With largely the same team on the same product, the results of these process changes were:

  • Flow work improved as measured by customer ticket SLA attainment and other metrics
  • Sprint work velocity improved as measured by JIRA reports
  • Engineering satisfaction improved as measured by internal NPS surveys

Improvement of all these factors was not slight, but was instead 50% or more in all cases.

Feel free and ask me about parts of this you find interesting and I’ll try to expand on them. It wasn’t as simple as “add Agile” or “add DevOps,” it definitely took some custom wrangling to balance our specific SaaS service’s needs in the best manner.

3 Comments

Filed under Agile, DevOps

Agile Organization: Separate Teams By Discipline

This is the first in the series of deeper-dive articles that are part of Agile Organization Incorporating Various Disciplines. It’s very easy to keep reorganizing and trying different models without actually learning from the process. I’ve worked with all of these so am trying to condense the pros and cons to help people understand the implications of the type of organizational model they choose.

The Separate Team By Discipline Model

Separate teams striated by discipline is the traditional method of organizing technical teams – segmented horizontally by technical skill.  You have one or more development teams, one or more operations teams, one or more QA teams.  In larger shops you have even more horizontal subdivisions – like in an enterprise IT shop, under the banner of Infrastructure you might have a data center team, a UNIX admin team, a SAN team, a Windows admin team, a networking team, a DBA team, a telecom team, applications administration team(s), and so on. It’s more unusual to have the dev side specifically segmented horizontally by tech as well (“Java programmers,” “COBOL programmers,” “Javascript programmers”) but not unheard of; it is more commonly seen as “UX team, services team, backend team…”

separateteamsIn this setup in its purest form, each team takes on tasks for a given product or project inside their team, works on them, and either returns them to the requester or passes them through to yet another team. Team members are not dedicated to that product or effort in any way except inasmuch as the hours they spend working on the to-do(s). Usually this is manifested as a waterfall approach, as a product or feature is conceived, handed to developers to develop, handed to QA to test, and finally handed to Operations to deploy and maintain.

This model dates back to the mainframe days, where it works pretty well – you’re not innovating on the infrastructure side, the building’s been built, you’re moving your apps into the pre-built apartment units. It also works OK when you have heavy regulation requirements or are constrained to extensively documenting requirements, then design, etc. (government contracts, for example).

It works a lot less well when you need to move quickly or need any kind of change or innovation from the other teams to achieve your goal. Linking up prioritization across teams is always hard but that’s the least of the issues. Teams all have their own goals, their own cadences, and even their own cultures and languages. The oft-repeated warning that “devs are motivated to make changes and ops is motivated by system stability” is a trivial example of this mismatch of goals. If the shared teams are supporting a limited number of products it can work. When there are competing priorities, I’ve seen it be extremely painful.  I worked in a shop where the multiple separate dev teams were vertical (line of business organized) but the operations teams were horizontal (technical specialty organized) – and frankly, the results of trying to produce results with the impedance mismatch generated by that setup was the nightmare that sent me down the Agile and DevOps path initially.

Benefits of Disciplinary Teams

The primary benefit of this approach is that you tend to get stable teams of like individuals, which allows them to bond with those of similar skills and experience organizational stability and esprit de corps. Providing this sense of comfort ends up being the key challenge of the other organizational approaches.

The second benefit is that it provides a good degree of standardization across products – if one ops team is creating the infrastructure for various applications, then there will be some efficiencies there.  This is almost always at least partially, and sometimes more than entirely, counteracted in value by the fact that not all apps need the same thing and that centralized teams bottleneck delivery and reduce velocity. I remember the UNIX team that would only provide my Web team expensive servers, even though we were highly horizontally scaled and told them we’d rather have twice as many $3500 servers instead of half as many $7000 servers as it would serve uptime, performance, etc. much better. But progress on our product was offered up upon the altar of nominal cost savings from homogeneity.

The third benefit is that if the horizontal teams are correctly cross-trained, it is easier avoid having single points of failure; by collecting the workers skilled in something into one group, losses are more easily picked up by others in the group. I have to say though, that this benefit is often more honored in the breach in my experience – teams tend to naturally divide up until there’s one expert on each thing and managers who actively maintain a portfolio and drive crosstraining are sadly rare.

Drawbacks of Disciplinary Teams

Conway’s Law is usually invoked to worry about vertical divisions in a product – one part of the UI written by one team, another by another, such that it looks like a Frankenstein’s monster of a product to the end user. However, the principle applies to horizontal divisions as well – these produce more of a Human Centipede, with the issue of one phase becoming the input of the next. The front end may not show any clear sign of division, but the seams in quality, reliability, and agility of the system can grow like a cancer underneath, which users certainly discover over time.

This approach promotes a host of bad behaviors. Pushing work to other people is always tempting, as is taking shortcuts if the results of those shortcuts fall on another’s shoulders. With no end to end ownership of a product, you get finger pointing and no one taking responsibility for driving the excellence of the service – and without an overall system thinking perspective, attempts at one of the teams in that value chain to drive an improvement in their domain often has unintended effects on the other teams in that chain that may or may not result in overall improvement. If engineers don’t eat their own dog food, but pass it on to someone else, then chronic quality problems often result. I personally spent years trying to build process and/or relationships to try to mitigate the dev->QA->ops passing of issues downstream with only mixed success.

Another way of stating this is that shared services teams always provide a route to the tragedy of the commons. Competing demands from multiple customers and the need for “nonfunctional” requirements (performance, availability, etc.) could potentially be all reconciled in priority via a strong product organization – but in my experience this is uncommon; product orgs tend to not care about prioritization of back end concerns and are more feature driven. Most product orgs I have dealt with have been more or less resistant to taking on platform teams, managing nonfunctional requirements, and otherwise interacting with that half of demands on the product. Without consistent prioritization, shared teams then become the focus of a lot of lobbying by all their internal customers trying to get resource. These teams are frequently understaffed and thus a bottleneck to overall velocity.

Ironically, in some cases this can be beneficial – technically, focusing on cost efficiency over delivering new value is always a losing game, but some organizations are self-unaware enough that they have teams continuing to churn out “stuff” without real ROI associated (our team exists therefore we must make more), in which case a bottleneck is actually helpful.

Mitigations for the weaknesses of this approach (abdication of responsibility and bottlenecking constraints) include:

  1. Very strong process guidance. “If only every process interface is 100% defined, then this will work”, the theory goes, just as it works on a manufacturing line.  Most software creation, however, is not similar to piecing components together to make an iPod. In one shop we worked for years on making a system development process that was up to this task, but it was an elusive goal. This is how, for example, Microsoft makes the various Office products look the same in partial defiance of Conway’s Law – books and books of standards.
  2. Individuals on shared teams with functional team affinities. Though not going as far as embedding into a product team, you can have people in the shared teams who are the designated reps for various client teams. Again, this works better when there is a few-to-one instead of a many-to-one relationship.  I had an ops team try this, but it was a many-to-one environment and each individual engineer ended up with three different ownership areas, which was overwhelming. In addition, you have to be careful not to simply dedicate all of one sort of work to one person, as you then create many single points of failure.
  3. Org variation: Add additional crossfunctional teams that try to bridge the gap.  At one place I worked, the organization had accepted that trying to have the systems needs of their Web site fulfilled by six separate infrastructure teams was not working well, so they created a “Web systems” team designed to sit astride those, take primary responsibility, and then broker needs to the other infrastructure teams. This was an improvement, and led to the addition of a parallel team responsible for internal apps, but never really got to the level of being highly effective. In addition those were extremely high-stress roles, as they bore responsibility but not control of all the results.

Conclusion

Though this is the most typical organization of technology teams historically, that history comes from a place much different than many situations we find ourselves in today. The rapid collaboration approach that Agile has brought us, and the additional understanding that Lean has given us in the software space, tells us that though this approach has its merits it is much overused and other approaches may be more effective, especially for product development.

Next, we’ll look at embedded crossfunctional service teams!

2 Comments

Filed under Agile, DevOps

Agile Austin asked me to help re-launch their blog, so I’ve contributed a piece on “What Is DevOps?” for them!

Leave a comment

by | March 16, 2014 · 9:58 am