Tag Archives: Management

by Ernest Mueller | January 29, 2025 · 5:20 pm

The Right Way To Use Tagging In The Cloud

This came up today at work and I realized that over my now-decades of cloud engineering, I have developed a very specific way of using tags that sets both infra dev teams and SRE teams up for success, and I wanted to share it.

Who cares about tags? I do. They are the only persistent source of information you can trust (as much as you can trust anything in this fallen world) to communicate information about an infrastructure asset beyond what the cloud or virtualization fabric it’s running in knows. You may have a terraform state, you may have a database or etcd or something that knows what things are – but those systems can go down or get corrupted. Tags are the one thing that if someone can see the infrastructure – via console or CLI or API or integrated tool – that they can always see. Server names are notoriously unreliable – ideally in a modern infrastructure you don’t reuse servers from one task to another or put multiple workloads on one, but that’s a historical practice that pops up all to often, and server names have character limits (even if they don’t, the management systems around them usually enforce one).

Many powerful tools like Datadog work by exclusively relying on tags. It simplifies operation and prevents errors if, when you add a new production app server, that automatically gets pulled into the right monitoring dashboards and alerting schemes because it is tagged right.

I’ve run very large complex cloud environments using this scheme as the primary means to drive operations.

Top level tag rules:

Tag everything. Tagging’s not just for servers. Every cloud element that can take a tag, tag. Network, disk images, snapshots, lambdas, cloud services, weird little cloud widgets (“S3 VPC endpoint!”).
Use uniform tags. It’s best to specify “all lower case, no spaces” and so on. If people decide to word a tag slightly differently in two places, the value is lost. Both the key and the value, but especially the key – teach people that if you say “owner” that means “owner” not “Owner” and “owning party” and whatever else.
Don’t overtag with attributes you can easily see. Instance size, what AZ it’s in, and so on is already part of the cloud metadata so it’s inefficient to add tags for it.
Use standard tags. This is what I’ll cover in the rest of this article.

At the risk of oversimplifying, you need two things out of your systems environment – compliance and management. And tags are a great way to get it.

Compliance

Attribution! Cost! Security! You need to know where infrastructure came from, who owns it, who’s paying for it, and if it’s even supposed to be there in the first place.

Who owns it?

Tag all cloud assets with an owner (email address) basically whatever is required to uniquely identify who owns an asset. Should be a team email for persistent assets, if it’s a personal email then the assumption should be if that person leaves the company those assets get deleted (good for sandboxes etc).

The amount of highly paid engineer time I’ve seen wasted over the last decade of people having to go out and do cattle calls of “Hey who owns these… we need to turn some off for cost or patch them for security or explain them for compliance… No really, who owns these…” is shocking.

owner:myteam@mycompany.com

Who’s paying for it

This varies but it’s important. “Owner” might not be sufficient in an environment – often some kind of cost allocation code is required based on how your company does finances. Is it a centralized expense or does it get allocated to a client? Is it a production or development expense, those are often handled differently from a finance perspective. At scale you may need a several-parter – in my current consulting job there’s a contract number but also a specific cost code inside that contract number that we need all expenses divvied up between.

billing:CUCT30001

Where did it come from

Traceability both “up” and “down” the chain. When you go look at a random cloud instance, even if you know who it belongs to you can’t tell how it got there. Was it created by Terraform? If so where’s the state file? Was it created via some other automation system you have? Github? Rundeck? Custom python app #25?

Some tools like Cloudformation do this automatically. Otherwise, consider adding a source tag or set of tags with sufficient information to trace the live system back to the automation. Developers love tagging git commits and branches with versions and JIRA tickets and release dates and such, same concept applies here. Different things make sense depending on your tech stack – if you GitOps everything then the source might be a specific build, or you want to say which s3 bucket your tfstate is in… Here as an example, I’m working with a system that is terraform instantiated from a gitops pipeline so I’ve made a source tag that says github and then the repo name and then the action name. And for the tfstate I have it saved in an s3 bucket named “mystatebucket.”

source:github/myapp/deploy-action
sourcestate:s3/mystatebucket

When does it go

OK, I know the last two sound like the lyrics to “Cotton-Eyed Joe”, which is a bonus. But a major source of cost creep is infrastructure that was intended to be there for a short time – a demo, a dev cycle – that ends up just living forever. And sure, you can just send nag-o-grams to the owner list, but it’s better to tag systems with an expires tag in date format (ideally YYYY-MM-DD-HH-MM as God intended). “expires:never” is acceptable for production infrastructure, though I’ve even used it on autoscaling prod infrastructure to make sure systems get turned over and don’t live too long.

expires:2025-02-01-00-00-00
or
expires:never

Management

Operations! Incidents! Cost and security again! Keep the entire operational cycle, including “during bad production incidents”, in mind when designing tags. People tear down stacks/clusters, or go into the console and “kill servers”, and accidentally leave other infrastructure – you need to be able to identify and clean up orphaned assets. Hackers get your AWS key and spin up a huge volume of bitcoin miners. Identifying and actioning on infrastructure accurately and efficiently is the goal.

As in any healthy system, the “compliance” tags above aren’t just useful to the beancounters, they’re helpful to you as a cloud engineer or SRE. But beyond that, you want a taxonomy of your systems to use to manage them by grouping operations, monitoring, and so on.

This scheme may differ based on your system’s needs, but I’ve found a general formula that fits in most cases I come across. Again, it assumes virtual systems where servers have one purpose – that’s modern best practice. “Sharing is the devil.”

EARFI

I like to pronounce this “errr-feee.” It’s a hierarchy to group your systems.

environment – What environment does this represent to you, e.g. dev, test, production, as this is usually the primary element of concern to an operator. “environment:uat” vs “environment:prod”.
application – What application or system is this hosting? The online banking app? The reporting system? The security monitoring server? The mobile game backend? GenAI training? “application:banking”.
role – What function does this specific server perform? Webserver dbserver, appserver, kafka – systems in an identical role should have identical loadouts. “role:apiserver” vs “role:dbserver”. Keep in mind this is a hierarchy and you won’t have guaranteed uniqueness across it – for example, “application:banking,role:dbserver” may be quite different from “application:mobilegame,role:dbserver” so you would usually never refer to just “role:dbserver.”
flavor – Optional, but useful in case you need to differentiate something special in your org that is a primary lever of operation (Windows vs Linux? CPU vs GPU nodes in the same k8s cluster? v2 vs v2?). I usually find there’s only one of these (besides of course region and things you shouldn’t tag because they are in other metadata). For our apiserver example, consider that maybe we have the same code running on all our api servers but via load balancer we send REST queries to one set and SOAP queries to another set for caching and performance reasons. “flavor:rest” vs “flavor:soap”.
instance – A unique identifier among identical boxes in a specific EARF set, most commonly just an integer. “instance:2”. You could use a GUID if you really need it but that’s a pain to type for an operator.

This then allows you to target specific groups of your infrastructure, down to a single element or up to entire products.

“Run this week’s security patches on all the environment:uat, application:banking, role:apiserver, flavor:rest servers.” Once you verify, you can do the same on environment:prod.”
“The second of the three servers in that autoscaling group is locked up. Terminate environment:uat, application:banking, role:apiserver, flavor:rest, instance:2“
“We seem to be having memory problems on the apiservers. Is it one or all of the boxes? Check the average of environment:prod, application:banking, role:apiserver, flavor:rest and then also show it broken down by instance tag. It’s high on just some of the servers but not all? Try flavor:rest vs flavor:soap to see if it’s dependent on that functionality. Is it load do you think? Compare to the aggregate of environment:uat to see if it’s the same in an idle system.”
“Set up an alert for any environment:prod server that goes down. And one for any environment:prod, application:banking, role:apiserver that throws 500 errors.”
“Security demands we check all our DB servers for a new vulnerability. Try sending this curl payload to all role:dbservers, doesn’t matter what application. They say it won’t hurt anything but do it to environment:uat before environment:prod for safety.”

So now a random new operator gets an alert about a system outage and logs into the AWS console and sees not just “i-123456 started 2 days ago,” they see

owner:myteam@mycompany.com
billing:CUCT30001
source:github/myapp/deploy-action
sourcestate:s3/mystatebucket
expires:never
environment:prod
application:mobilegame
role:dbserver
flavor:read-only
instance:2

That operator now has a huge amount of information to contextualize their work, that at best they’d have to go look up in docs or systems and at worst they’d have to just start serially spamming. They know who owns it, what generates it, what it does and has hints at how important it is. (prod – probably important. A duplicate read secondary – could be worse.) And then runbooks can be very crisp about what to do in what situation by also using the tags. “If the server is environment:prod then you must initiate an incident <here>… If the server is a role:dbserver and a role:read-only it is OK to terminate it and bring up a new one but then you have to go run runbook <X> and run job <y> to set it up as a read secondary…”

Feel free and let me know how you use tags and what you can’t live without!

Leave a comment

Filed under Cloud, DevOps, Monitoring

Tagged as aws, azure, Cloud, DevOps, labels, Management, Operations, SRE, tagging, technology, terraform

by Ernest Mueller | May 6, 2023 · 12:34 pm

The Value Of Gratefulness

I have been in technology management for more than 20 years now and have worked in a wide variety of shops, and I think I’ve identified a key element that creates a good leader, and that is gratefulness.

Gratefulness Empowers Recognition

Everyone knows that “employee recognition” is important for morale; any company cites it as a priority whether they are really doing it or not. Sometimes it just gets forgotten – but sometimes there’s excuses given not to do it, concerns that it “sounds artifical” or that “they get thanks in form of their salary” or “people will be uncomfortable or jealous.” And some people honestly have a hard time doing it.

I’ve found that those that cultivate an actual spirit of gratefulness within them for other peoples’ work, especially for those who work for you and the sweat of their brow contributes to your success and growth, have an easier time of it.

This shouldn’t be a surprise. The classic Dale Carnegie book How To Win Friends and Influence People is often categorized as a “sales book.” It’s not, it’s way more profound than that and deserves a place in any leader’s library. In its introduction there’s explicitly a story of a man with 314 employees who did nothing but criticize them, then studied the book’s principles, and subsequently turned around his management strategy so he had 314 friends and not 314 enemies, leading to both increased happiness and increased profitability. And Part 2 of the book quickly gets to the “how” – it starts with “Become genuinely interested in other people” and ends with “Make the other person feel important – and do it sincerely.”

I don’t think it’s a shocking revelation that gratefulness leads to better recognition and therefor to better morale, but what I want to get across here is that even if you’re not good at that out of the gate, it can be learned.

And once you learn it, you get more help from other people.

I personally grew up as a very introverted person who was happy alone and on the computer, and not being very interested in others. But in my early career I quickly saw that was holding me back. I wanted to change it so I read How To Win Friends and Influence People and tried to put it into action. Awkwarly and self-consciously at first, of course.

Then something strange started happening to me. People I didn’t know would turn and talk to me in the elevator! I was, frankly, shocked. Generally in my life up to that point, in public I left people alone and they left me alone. I came to the realization that even my demeanor had changed and was more open somehow, and it was causing people I didn’t even know and wasn’t intending to interact with to feel like they could interact with me. And not to hassle me, but to help me.

Ungratefulness Leads To Bad Decisions

For many years I thought that gratefulness was just something that made you friendlier and made recognition easier and so was good in the long term. But then I worked at a startup where the CEO had a deep, fundamental lack of gratefulness, and I saw how that leads to critically bad decisonmaking.

Because people, and people’s work, have value – not in some hug-filled hippie sense, but in a very tangible sense. At the company in question the CEO came to me several times wanting to fire an engineer who had legit written 80% of the working product code in the shop “because he doesn’t think architect level.” He ousted a co-founder who was the only person who had actually brought in sales for the company. So years later it was a startup that had trouble even creating a shipping product and certainly wasn’t growing revenue, and had – seriously estimating – about 300% employee turnover in its lifetime. He sabotaged his own company because he couldn’t look at even objective value creation (working code! shipping product! sales revenue!) and value those who generate it at all.

That really made me stop and think. The stereotype of the ungrateful leader is one that only values hard objective results and “is mean” to people otherwise, but my experience has led me to the conclusion that’s a false dichotomy – if you are unable to see value you’re going to be unable to see it whether it’s in a person or in github or on a ledger book. Especially in a sector where that value is being created by the skilled workers!

Instead, you want to train yourself to see value so that you can gather more of it and help it grow! It’s not just being a kind leader because that’s “in” this decade, gratefulness is actually a strength you can develop that helps you make effective decisions.

Leave a comment

Filed under Management

Tagged as gratefulness, Management, recognition

by Ernest Mueller | October 18, 2019 · 1:53 pm

Ask A Tech Manager

We had a really interesting joint CloudAustin/Austin DevOps meeting this week – a tech manager panel! Ask them anything! We had 92 people attend and we had to herd them all out of the building with sticks at the end because everyone had so many questions that even at two hours it was still going strong.

My takeaways:

Understanding managers’ viewpoint and goals is critical to an individual engineer when looking to get hired or promoted or whatever.
Managers want you to be successful – you being successful vs you not being successful (and/or leaving or being let go) is a huge win for them in all ways.
Looking for a job?
- Your resume should be tuned to a job and either through its top section or a cover letter tell a story. Hiring managers get 100-200 applications/resumes per job. They don’t expect you to have everything on the job application – “50% is fine” was the consensus. But they’re doing a first cut before they talk to you, and a lot of people spam resumes, so if your resume doesn’t clearly say why you, add something that does.
- For a Cloud Engineer position, for example, there’s a big difference between a random UNIX admin resume and a random UNIX admin resume that says “I’m excited about cloud and working towards an AWS certification on the side and really want to find a new job I can do cloud in.” The first one is discarded without comment, the second one can move ahead if they’re willing for people to learn – and given the “50%” thing, they are generally willing for people to lean, if those people are willing to learn!
Interviewing for a job?
- Everyone hates every different interviewing tactic (see Hiring is Broken, and we have the Ultimate Fix) – individual interviews, panel interviews. whiteboard design, online coding, takehome projects, poring over your resume, checking your github, looking at your social meda/blog/whatever. But in the end hiring managers are just trying a handful of things to see if they can figure out if you know what you need to know to do the job.
- They can’t just take your word for it – their reputation is on the line when they bring people in, and you reflect on them. They want to understand if you’ll be successful. Recruiting, hiring, onboarding cost a lot of money so they can only take so much of a risk – they’re not real particular what the form “proof you can do the job” takes, they’re just fishing for something.
Performance issues?
- If you’re having some outside problem, talk to your manager. People we work with are a cross-section of just plain people, and we expect every medical, psychological, marital, criminal, etc. issue to show up at some point.
- Communicate. Again, it’s in their best interest you succeed. No one “wants to get rid of you.”
- Unreasonable demands? Communicate. Lots of technical staff work long hours or do the wrong things because they don’t “manage up” well and communicate. “Hey, with this change I have way too much to get done in 40-50 hours, this is what I think my priority list is, this is what will fall off, what can we do about it?” Bosses don’t know what you’re doing every minute of the day and can’t read your mind. I’ve personally worked with engineers burning themselves out while their same-team colleagues aren’t because they aren’t managing themselves – though they think it’s outside forces bullying them.
How to get ahead?
- Understand the options. What are career paths there? How do raises and promotions work? What are the cycles, pools, etc – you can only work the system if you understand the system. Managers are happy to explain.
- Communicate. No one knows if you want to move into management or not, or feel like you’re due a promotion or not, if you don’t talk about it with them over time. You have to take charge of your own development – companies want to help you develop but a manager has N reports and a lot of things to worry about, they aren’t going to drive it for you.
- Listen. What is needed to get to that Lead Engineer position? “Slingin’ code” and “being here for 3 years” are, I guarantee, not much of that list of requirements. Usually things like leadership, image, communication, and exposure have a role. Read “How To Win Friends And Influence People” and stuff if you need to. “I just grind code by myself all day” is fine but there’s a max level to which you will rise doing so.

Anyway, thanks to all the managers that participated and all the attendees that grilled them! I hope it helped people understand better how to guide their own careers.

Leave a comment

Filed under Management

Tagged as cloudaustin, hiring, interviews, Management, promotion, raises, resume

by Ernest Mueller | October 22, 2015 · 10:00 am

Agile Organization: Project Based Organization

This is the fourth in the series of deeper-dive articles that are part of Agile Organization Incorporating Various Disciplines.

The Project Based Organization Model

You all know this one. You pick all the resources needed to accomplish a project (phase of development on a product), have them do it, then reassign them!

Benefits Of The Project Based Model

The beancounters love it. You are assigning the minimum needed resource to something “only as long as it’s needed.”

Drawbacks Of The Project Based Model

Where to even begin?

First of all, teams don’t do well if not allowed to go through the Tuckman’s stages of development (forming/storming/norming/performing); engineer satisfaction plummets.
Long term ownership isn’t the team’s responsibility so there is a tendency to make decisions that have long term consequences – especially bearing on stability, performance, scalability – because it’s clear that will be someone else’s problem. Even when there is a “handoff” planned, it’s usually rushed as the project team tries to ‘get out of there’ from due date or expenditure pressures. More often there is a massive generation of “orphans” – services no one owns. This is immensely toxic – it’s a problem with shipping software, but with a live service it’s awful, as even if there’s some “NOC” type ops org somewhere that can get it running again if there’s an issue, chronic issues can’t be fixed and problems cause more and more load on service consumers, support, and NOC staff.
Mentoring, personnel development, etc. are hard and tend to just be informal (e.g. “take more classes from our LMS”).

Experience With The Project Based Model

At Bazaarvoice, we got to where we were getting close to doing this with continued reorganization to gerrymander just the right number of people onto the projects with need every month. Engineer satisfaction tanked to the degree that it became an internal management crisis we had to do all kinds of stuff to dig ourselves back out of.

Of course, many consulting relationships work this way. It’s the core behind many of the issues people have with outsourced development. There are a lot of mitigations for this, most of which are “try not to do it” – like I’ve worked with outsourcers trying to ensure lower churn on outsource teams, try to keep teams stable and working on the same thing longer.

It does have the merit of working if you just don’t care about long term viability. Developing free giveaway tools, for example – as long as they’re not so bad they reflect poorly on your company, they can be problematic and unowned in the long term.

Otherwise, this model is pretty terrible from a quality results perspective and it’s really only useful when there’s hard financial limitations in place and not a strong culture of responsibility otherwise. It’s not very friendly to agile concepts or devops, but I am including it here because it’s a prevalent model.

1 Comment

Filed under Agile, DevOps

Tagged as agile, DevOps, formation, Management, org, organization, structure, team

by Ernest Mueller | October 20, 2015 · 10:00 am

Agile Organization: Fully Integrated Service Teams

This is the third article in the series of deeper-dive articles that are part of Agile Organization Incorporating Various Disciplines.

The Fully Integrated Service Team Model

The next step along the continuum of decentralization is complete integration of the disciplines into one service team. You simply have an engineering manager, and devs, operations staff, QA engineers, etc. all report to them. It’s similar to the Embedded Crossfunctional Team model but you do away with the per-discipline reporting structure altogether.

Benefits Of Integrated Service Teams

This has the distinct benefit of end to end ownership. Engineers of every discipline have ownership for the overall product. It allows them to break out of their single-discipline shell, as well – if you are good at regression testing but also can code, or are a developer but strong in operations, great! There’s no fence saying whose job is whose, you all pull tasks off the same backlog. In general you get the same benefits as the Crossfunctional Team model.

Drawbacks of Integrated Service Teams

This is theoretical nirvana, but has a number of challenges.

First, a given team manager may not have the knowledge or experience in each of those areas. While you don’t need deep expertise in every area to manage a team, it can be easy to not understand how to evaluate or develop people from another discipline. I have seen dev managers, having been handed ops engineers, fail to understand what they really do or they value, and lose them as a result.

Even more dangerous is when that happens and the manager figures they didn’t need that discipline in the first place and just backfills with what they are comfortable with. For a team to really own a service from initiation to maintenance, the rest of the team has to understand what is involved. It’s very easy to slip back into the old habits of considering different teams first class vs second class vs third class citizens, just making classes of engineer within your team. And obviously, disenfranchising people works directly against energizing them and giving them ownership and responsibility.

Mitigations for that include:

Time – over time, a team learns the basics of the other branches and what is required of them.
Discipline “user groups” (aka “guilds”) – having a venue for people from a horizontal discipline to meet and share best practices and support each other. When we did this with our ops team we always intended to set up a “DevOps user group” but between turnover and competing priorities, it never happened – which reduced the level of success.

A second issue is scaling. Moving from “zone” to “man” coverage, as this demands, is more resource intensive. If you have nine product teams but five operations engineers, then it seems like either you can’t do this or you can but have to “share” between several teams. Such sharing works but directly degrades the benefits of ownership and impedance matching that you intend to gain from this scheme. In fact, if you want to take the prudent step of having more than one person on a team know how to do something – which you probably should – then you’d need 18 and not just nine ops engineers.

Mitigations for this include:

Do the math again. If the lack of close integration with that discipline is holding back your rate of progress, then you’re losing profits to reduce expenditures – a bad bet for all but the most late-stage companies.
Crosstraining. You may have one ops, or QA, or security expert, but that doesn’t (and, to be opinionated, shouldn’t) mean that they are the only ones who know how to perform that function. When doing this I always used the rule “if you know how to do it, you’re one of the people that should pull that task – and you should learn how to do it.” This can be as simple as when someone wants the QA or ops or whatever engineer to do something, to instead walk the requestor through how to do it.

Experience with Integrated Service Teams

Our SaaS team at NI was fully integrated. That worked great, with experienced and motivated people in a single team, and multiple representatives of each discipline to help reinforce each other and keep developing.

We also fully integrated DevOps into the engineering teams at Bazaarvoice. That didn’t work as well, we saw attrition from those ops engineers from the drawbacks I went over above (managers not knowing what to do with/how to recruit, retain, develop ops engineers). In retrospect we should not have done it and should have stayed with an embedded crossfunctional team in that environment – the QA team did so and while collaboration on the team was slightly impeded they didn’t see the losses the ops side did.

Leave a comment

Filed under Agile, DevOps

Tagged as agile, DevOps, formation, Management, org, organization, structure, team

by Ernest Mueller | October 15, 2015 · 10:00 am

Agile Organization: Embedded Crossfunctional Service Teams

This is the second in the series of deeper-dive articles that are part of Agile Organization Incorporating Various Disciplines. Previously I wrote about the “traditional” siloed model as Agile Organization: Separate Teams By Discipline.

Basic agile doctrine, combined with ITSM thinking, strongly promulgate a model where a team owns a business service or product. I’ll call this a “service team” for the sake of argument, but it applies to products as well.

The Embedded Crossfunctional Team Model

Simply enough, a service team has specialists from other groups – product, QA, Ops, whatever – assigned to it on an exclusive basis. They go sit with (where possible) and participate with the group in their process. The team then contains all the resources required to bring their service or product to completion (in production, for services). The mix of required skills may vary – in this diagram, product #4 is probably a pure back end service and product #3 is probably a very JavaScriptey UI product, for example.

Benefits of Embedded Crossfunctional Teams

This model has a lot of benefits.

Since team members are semi-permanently assigned into the team, they don’t have conflicting work and priorities. They learn to understand what the needs are of that specific product and collaborate much more effectively with all the others working on it.
This leads to removal of many bottlenecks and delays as the team, if it has all the components it needs to deliver its service, can organically assign work in an optimal way.
It is amazing how much faster this model is in practice than the separate teams by discipline model in terms of time to delivery.
Production uptime and performance are improved because the team is “eating its own dog food” and is aware of production issues directly, not as some ticket from some other team that gets routed, ignored, finger-pointed…

Drawbacks of Embedded Crossfunctional Teams

Multiple masters is always an issue for an engineer. Are you trying to please the service team’s manager or your “real” manager, especially when their values seem to conflict? You have to do some political problem-solving then, and engineers hate doing that. It also provides some temptation to double-resource or otherwise have the embedded engineer “do something else” the ops team needs done, violating the single service focus.
It’s more expensive. Yes, if you do this, you need at least one specialist per service team. You have to play man-to-man, not zone, to get the benefits of the approach. This should make you think about how you create your service teams; if you define a service as one specific microservice and you have teams with e.g. 2 devs on them, then embedding specialists is way more expensive. Consider basing things on the business concept of a service and having those devs working on more than one widget, targeting the 2-pizza team size (also those devs will be supporting/sustaining whichever services aren’t brand new). (Note that this is only “more expensive” in the sense that doesn’t bother with ROI and just takes raw costs as king – you get stuff out and start making revenue faster, so it’s really less expensive from a whole-company POV, but from the IT beancounter perspective “what’s profit?”)
While you get benefits of crosstraining across the disciplines, the e.g. ops folks don’t have regular all-day contact with other ops folks and so you need to take care to set up opportunities for the “ops team” to get together, share info, mentor each other, etc. as well. Folks call these “tribes” or “guilds” or similar.

Experience with Crossfunctional Teams

It can be hard at the beginning, when teams don’t understand each others’ discipline or even language yet. I had a lengthy discussion with an application architect on one team – he felt that having ops people in the design reviews he was holding was confusing and derailing. The ops people spoke some weird moon-man language to him and it made the reviews go longer and require a lot more explanation. I said “Yes, they do. But we have two choices – keep doing it together, have people learn each others’ concerns and language, and start a virtuous cycle of collaboration, or split them apart and propagate the vicious cycle we know all too well where we have difficulty working together.” So we powered through that and stayed together, and it all worked out well in the end.

When we piloted this at Bazaarvoice, one of the first ops to embed got in there, worked with the devs, and put all his work into their JIRA project. The devs got sticker shock very quickly once they saw how much work there was in delivering a reliable service – and when they dug into those tickets, they realized that they weren’t BS or busywork, but that when they thought about it they said “yeah… We certainly do need that, all this is for real!” The devs then started pulling tickets on monitoring, backups, provisioning, etc. because they realized all that workload on one person would put their delivery date behind. It was nice to see devs realize all the work that really went into doing ops – “out of sight, out of mind,” and too often devs assume ops don’t do anything except move their files to production on occasion. The embedding allowed them to rally to control their own delivery date instead of just “be blocked on the ops team.”

No one approach is “best,” but in general my experiences so far lead me to consider this one of the better models to use if you can get the organizational buy-in to the fundamental “you built it, you run it” concept.

Leave a comment

Filed under Agile, DevOps

Tagged as agile, DevOps, formation, Management, org, organization, structure, team

by Ernest Mueller | September 30, 2014 · 4:16 pm

Scrum for Operations: Just Add DevOps

OK, so it’s been a while since the last installment in this series. I had talked about how we’d brought Scrum to our operations team, which worked fine, and then added in the developers as well, which didn’t. Our first attempt at dividing the whole product up into four integrated DevOps service teams collapsed quickly. Here’s how we got past that to do it again and succeed, with a fully integrated DevOps setup managing both proactive/project/feature work and reactive/support/tactical work in an effective way.

The first challenge we had was just that I couldn’t manage four new scrum teams totally on my own. We had gotten the ops team working well on Scrum but the development team hadn’t been. We didn’t have any scrum masters other than me, and were low on people designated as tech leads as well. So step one, we just mushed everyone into a 30+-person Scrum team, which sucked but was the least of the evils, and I immediately worked on mentoring people up as both Scrum masters and tech leads. I basically led by example and asked for volunteers for who was interested in Scrum mastering and/or being a tech lead for each of the planned sub-teams.

This was interesting – people self-selected in ways I would not have predicted. Some of the willing were employees and others were contractors – fine, as far as I’m concerned. “From each according to his ability, to each according to his need.” I then maximally had them start to lead standups and take ownership over sprints and such, coaching them in the background. In that way, it only took a little while to re-form up into those four teams. This gave some organic burn-in time as well, so that the ops folks got more used to not being “the ops team” and leaning on people that were supposed to be on other sprint teams. As I empowered the new leaders this became self-correcting (“Stop working with that other guy on their stuff, we have things that need doing for our service!”).

The second and largest problem was managing the work.

Part 1: “Planned” Work

Product Management

We had practically zero product manager support at the beginning, so it was up to us to try to balance planned work from many sectors – feature requests from PM, tooling requests from Customer Service and Technical Support, security stuff from Security, random requests from Legal and Compliance, brainwaves from other products’ engineering managers. It quickly began to take a full person worth of time to handle, because if not managed proactively it turned into attending meetings with people complaining as a reactive result. I went to the PM team and said “come on man, give us more PM support,” and once we got one, I worked with her on being able to manage the whole overall package.

One of the chronic product manager problems in a SaaS environment is the “not my domain” syndrome. PMs want to talk about features, but when it comes to balancing stakeholders like Security and internal operational projects, they wash their hands of it. At best you get a “Well… You guys get 50% of your time to figure out what to do with that stuff and then 50% of your bandwidth will be my features, OK?”

For the record, that’s not OK. As a product manager, you need to be able to balance all the work against all the work. Maybe you don’t have an ops background, that’s fine – you probably didn’t have a <business domain here> background when you came to work either. Learn. A lot of the success of a SaaS product is in the balancing of features against stability/scalability work against compliance work… If you want to take the “I’m the CEO of the product” role, then you need to step up and own all of it, otherwise you’re just that product’s Director of Wishful Thinking.

Anyway, luckily our PM, even though she didn’t have experience doing that, was willing to learn and take it on. So we could reason about spending a month putting in a new cluster versus adding a new display feature – they all went into the same roadmap.

Work Management in JIRA

We managed that in the standard, simple way of a quarterly spreadsheet roadmap correlating to epics in an Agile rapid board in JIRA; we’d do stakeholder meetings and stack rank them and then move them into the team backlogs and flesh them out when they were ready for execution. (It was important to keep the clutter out of the backlogs till then – even if it’s supposed to be future stuff, the more items sitting in a team’s backlog, the worse their focus is.)

We kept each service as one JIRA project but combined them into rapid boards as necessary – a given team might own a couple (like the “Workbench and CMS Team” had those two and a smaller tooling one). This way when we transferred a piece of tech around we could just move the JIRA project as well, and incorporate it into the target team’s rapid board.

Portfolio Management

Some people say that the ideal world is each team owning one microservice. I don’t agree with this – we had a number of teams under other parts of the org that were only like 2 people because they owned some little service. This was difficult to sustain and transition; when things like Black Friday came up that required 24×7 support from each team for a week it was brutal on them, and even worse once development eased up on a service it just got orphaned.

If you don’t keep a service portfolio and tightly manage who is tasked with supporting each one, you are opening yourself up to a world of hurt. And that’s where we started. Departmentwide, we’d have teams work on something and then wander off and if there was a problem they’d call on the dev who was somewhere working on some other deadline now. This worked terribly. I got the brunt of this earlier when I had joined the company as a release manager and asked “So, those core services, who owns those?” “Well… No one. I mean everyone!”

So for my teams, we put together a tracking list, used both as a service registry but also for cross-training. It was a simple three-level outline, with major services, service components, and low level elements. We had the whole team work together on filling out the list – since we were managing a 7-year-old legacy system we ended up with a list of 275 or so leaf items. Every one had to have an owning team (team, not individual), and unless you retired a service, there was no dropping it on the floor. You owned it, you retired or transitioned it, and I didn’t care if “the dev that worked on that moved to some other team.” Everything was included, from end user facing services like “the user portal” to internal services like our CI servers.

Team Management

This transitions into how we managed the teams. Teams were standard “two-pizza size” – a team of 5-7 people is optimal and we would combine services up till there was enough work for a team of that size. This avoided the poor coverage of mini-teams for micro-services.

Knowledge Management

Then we also used the service registry as the “merit badge system.” We had a simple qualification procedure – to get qualified in an element, you had to roll out code on it and you could sign off for yourself to say that you were qualified on one of those leaf elements. To get your “merit badge” in a service component, you needed an existing subject matter expert (SME) to sign off that you knew what you were doing, and you needed to understand how the service was written, deployed, tested, monitored. To become a SME in a service, the existing SMEs would have a board of review. SMEs were then understood to be qualified to make architectural changes on their own initiative, those with merit badges were understood to be qualified to make code changes with nothing more than the usual code review and CI testing.

This was very important for us because we were starting in a place where people had been allowed to specialize too much. Only one person was the SME on any given service and if the person who didn’t understand the Workbench was out, or quit, or whatever, suddenly no one knew a whole lot about it. That’s a terrible velocity burden even without attrition and it’s a clear and present danger to your service’s viability when there is. We started tracking the “merit badges” and had engineers set goals around how many they’d earn (or how many they’d review, for the more experienced). We used a lot of contract programmers and I told the contractor manager that I wanted to use that to rate the expertise of people on our account and that I wanted to see the numbers rise on each person over time.

Part 2 – “Unplanned” Work

Our team was only doing planned work 40% of the time, however. Since we were integrated DevOps teams working on a service with thousands of paying end customers, and that service was custom-integrated with customer Web sites and import/export of customer data, there was a continuous load of work coming in from Customer Support and from our own operations (alerts, etc.). All this “flow” work was interrupt-driven, and urgent to one degree or another.

The usual techniques for handling this have a lot of problems. “Let’s make a separate sustaining team!” Well, this turns into a set of second class developers. Thus you get worse devs there on that team. And those devs are more utilized one week, less the next; when things are quiet they get seconded to other efforts by people that haven’t read the Principles of Product Development Flow and think having everyone highly utilized is valuable, and then the load ramps up and something gives… Plans emerge to make it a training ground for new devs, till you realize that even new devs don’t want to put up with that and just quit… I’ve seen this happen in several places. I am a firm believer in dog fooding – if you are writing a service, you need to handle its problems. If there are a lot of problems, fix it!!! If you are writing the next version of the service – well, doing that in isolation from the previous service dooms you to making the same errors, adding in some new ones as well. No sustaining teams, only evergreen teams.

So we had the rule that each integrated team of devs and ops handled their own dirty laundry. And at 60% of the workload, that was a lot of laundry. How did we do it? Everyone was worried about how we could deliver decent feature velocity while overwhelmed by flow work. Therefore…

Triage

In addition to the teams’ daily standups, we had a daily triage meeting where the engineering manager, team leads, PM, and representatives from Support, Sales, and whoever had a crisis item that they felt needed to be handled on an expedited basis (not waiting on the next sprint, in other words) would come to. Each new intake would be reviewed. In the beginning we had to kick a lot of requests back for insufficient detail, but that corrected itself fast. We’d all agree on priority – “that’s affecting one customer but they’re one of our top customers, we’ll work it as P2” or the like.

For customer reported issues, we had SLAs we committed to in contracts (1 day for P1, etc). Ops issues could be super urgent – “P1, one of our services is down!” – or doable later on. So what we did was to create a separate Kanban board in JIRA for all these kinds of issues. Anything that really could wait or was large/long term would get migrated into the Scrum backlogs, but anything where “we need someone to do this stat” went in here. It served the same purpose as an “expedite lane,” it was just a jumbo four-lane highway of an expedite lane because so much of the work was interrupt driven.

But does this mean “give up on Scrum?” No. Without the sprint cadence it was hard to hit a release cadence, and also it was easy for engineers to just get lost in the soup and stop delivering at a good rate, frankly. So we still ran sprints, and here was the rules of engagement we provided the engineers.

Need More Work?

Pull things for your team/service off the triage queue
If there’s nothing in the triage queue, pull the next item from the sprint backlog
If there’s nothing in the sprint backlog or the triage queue, pull something off the top of the backlog. Or relax, either one.

Then for standups, we had a master Agile board that contained everything – all projects, the triage board, everything. So when you looked at a given engineer’s swimlane, you could see the sprint work they had, the flow work they had, and anything they were working on from someone else’s project (“Hey, why are you doing that?”). Again, via JIRA agile board composition that’s easy to do. Sometimes teams would try to do standups just looking at swimlanes containing “their” projects and it always ended up with things falling in the gap, so each time that happened I reiterated the value of seeing everything that person is assigned, not just what you think/hope they are assigned, since they are exclusively attached to that one team.

At first, everyone fretted about the conflict between the flow work and the sprint work. “What if there’s a lot of support work today?!?” But as we went on sprint by sprint, it became evident that over the course of a two week sprint, the amount of support and operations work evened out. Sprint velocity was regular despite the team working flow work as well. Having full-sized sprint teams and two-week iterations meant that even if one day there was a big production issue and whoever grabbed it was slammed – over the rest of the time and people it evened out. This shouldn’t be too surprising, it’s called “flow” for a reason – there are small ebbs and surges but in general the amount over time at scale is the same. Was it as perfectly “efficient” as having some people working flow and others working sprints? No, there is definitely a % overhead that incurs. But maximum utilization of each person’s time, as I mentioned before, is a golden calf. Lean principles show us that we want the best overall outcome – and this worked.

Flow work was addressed quickly, and our customer ticket SLA attainment percentage whipped up from a starting level of less than 50% over several quarters to sit at 100%, meaning every single support ticket that came in was addressed within its advertised SLA. Once that number hit 100% the support ticket time to live started to fall as well.

At the same time, sprint velocity for each of the four sprint teams went up over time – that is, just the story points they were delivering out of the feature backlog improved, in addition to the improvements in flow work. We’d modify the process based on engineer feedback and try alterations for a sprint to see how they panned out, but in general by keeping the overall frame in place long enough that people got used to it, they became more and more productive. Flow work and planned work both improved at the same time, not at each others’ expense.

The Dynamic Duo

This scheme had two issues. One was that engineers were sometimes confused about what they should be working on, flow tickets with SLAs or their sprint tasks. Our Scrum masters were engineers on the teams too, they weren’t full time PMs that could afford to be manually managing everyone’s work. The second was that operational issues that came in and required sub-day response time couldn’t wait for triage and ended up with either frantic searches for anyone who could help (which often became “everyone sitting in the team area”) or missed items.

I have always been inspired by Tom Limoncelli’s Time Management for System Administrators. He advocates an “interrupt shield” where someone is designated as the person to handle walkups and crises. At NI I had instituted this in process, at BV in the previous Ops team there had been a “The Dude” role (complete with Jeff Bridges bobblehead) that had been that. Thus the Dynamic Duo were born.

The teams each had one dev and one DevOps on call at a given time; we managed the schedule in PagerDuty. Whoever was on call that week became “on the Dynamic Duo” during the days as well. When we went into a sprint, they would not pull sprint tasks and would be dedicated to operational and urgent support issues. It was the Dynamic Duo because we needed someone with ops and someone with dev expertise on this – one without the other meant problems were not effectively solved. I even made a cool wiki page with Batman and Robin stuff all over it and we got Bat-phones painted red. I evangelized this inside the company. “Don’t walk up and grab a dev with your question. Don’t chat your favorite engineer to get them to work on something. Come turn on the Bat-signal, or call the Bat-phone, and the Dynamic Duo will come help you.”

This was good because it blunted the sharp tip of very urgent requests. The remaining flow work (2 people couldn’t handle the 60% of the load that was interrupt driven) was easier for the sprinting devs to pull in as they finished sprint tasks without worrying about timeliness – the real crises were handled by the Dynamic Duo. The Duo also ran the triage meeting and even if they weren’t working on all the triage work, they bird dogged it as a kind of scrum/kanban/project manager over the flow work. In the rare case there wasn’t anything for them to jump on, they could slack – time to read, learn, etc. both as compensation for the oncall and adrenaline rushes that week but also because it’s hard to fit time for that into sprints… And as we know from the Principles of Product Development Flow, running teams at too high of a utilization is actually counterproductive.

Conclusion

That’s the short form of how we did it – I wanted to do a lot more but I realized that since it’s been a year since I intended to write this, I’d better shake and bake and get it out and then I’m happy to follow up with stuff any of you are curious about!

This got us to a pretty happy place. As time went on we tweaked the sprint process, the triage process, and the oncall/Duo process but for a set of teams of our size with our kind of workload it was close to an optimal solution. With largely the same team on the same product, the results of these process changes were:

Flow work improved as measured by customer ticket SLA attainment and other metrics
Sprint work velocity improved as measured by JIRA reports
Engineering satisfaction improved as measured by internal NPS surveys

Improvement of all these factors was not slight, but was instead 50% or more in all cases.

Feel free and ask me about parts of this you find interesting and I’ll try to expand on them. It wasn’t as simple as “add Agile” or “add DevOps,” it definitely took some custom wrangling to balance our specific SaaS service’s needs in the best manner.

3 Comments

Filed under Agile, DevOps

Tagged as agile, development, DevOps, Management, Operations, scrum, scrum4ops, system administration, systems

by Ernest Mueller | August 29, 2014 · 4:02 pm

The Cloud Procurement Pecking Order

I was planning to go to this meeting here in town about “Preparing for the post-IaaS phase of cloud adoption” and it brought home to me how backwards many organizations are when they start thinking about cloud options. So now you get Ernest’s Cloud Procurement Pecking Order.

What many people are doing is moving in order of comfort, basically, as they start moving from old school on prem into the cloud. “I’ll start with private cloud… Then maybe public IaaS… Eventually we’ll look at that other whizbang stuff.” But here’s what your decision path should be instead. It’s the logical extension of the basic buy vs build strategy decision you’re used to doing.

Cloud Procurement Flowchart

Look at the functionality you are trying to fulfull. Now ask in order:

Is it available as a SaaS solution? If so, use that. You shouldn’t need to host servers or write code for many of your needs – everything from email to ERP is commoditized nowadays. This is the modern equivalent of “buy, don’t build.” You don’t get 100% control over the functionality if you buy it, but unless the function is super core to your business you should simply get over that.
[Optional] Does it fit the functional profile to do it serverless? Serverless is basically “second gen PaaS with less fiddly IaaS in it” so this would be your second step. Amazon has Lambda and Azure and Google have shipped competitors already. Right this moment serverless tech is still pretty bleeding edge, so you’d be forgiven for skipping this step if you don’t have pretty high caliber techies on staff.
Can I do it in a public PaaS? Then use a public PaaS (Heroku/Beanstalk/Google App Engine/Azure), unless you have some real (not FUD) requirements to do it in house.
Can I do it in a private PaaS? Then use Cloudfoundry or similar. Or do you really (for non-FUD reasons) need access to the hardware?
Can I do it in public IaaS? Then use Amazon, or Azure. Or do you really (for non-FUD reasons) need it “on premise” (probably not really on premise, but in some datacenter you’re leasing – which is different from being outsourced in the cloud why)? Even hardcore hardware render is done in the cloud nowadays (you can get GPU driven instances, SSDs, etc.)
Can I do it in a private cloud? Use VMWare Cloud or Openstack. This is your final recourse before doing it the old fashioned way – unless you have extremely unique hardware requirements, you probably can. Also, you can do hybrid cloud – basically private cloud plus public cloud (IaaS only really). This gets you some of the IaaS benefits while complicating your architecture.

What About Compliance?

Very few compliance requirements exist that cannot be satisfied in the cloud. There are large financials operating in the cloud, people with SOX and PCI and FISMA and NIST and ISO compliance needs… If your reason for running on prem is “but compliance” there’s a 90% chance you are just plain wrong, and coasting on decade-old received wisdom instead of being well informed about the modern state of cloud technology and security and compliance. I’ve personally helped pure-cloud solutions hit ISO and TUV and various other compliance goals.

What About The Cost?

This ordering seems to be inverted from how people are inching into the cloud. But the lower on this list you are, the less additional value you are getting from the solution (assuming the same price point). You should instead be reluctantly dragged into the lower levels on this list – which require more effort and often (though not always) more expense. A higher level needs to be a lot more expensive to justify the additional complexity and lag of doing more of the work yourself.

“But what about the cost,” you say, “the cloud gets more expensive than me running a couple servers?” It’s easy to be penny wise but pound foolish when making cloud cost decisions.

You need to keep in mind the real costs of your infrastructure when you do this – I see a lot of people spending a lot of work on private cloud that they really shouldn’t be. If you simply compare “buying servers” with “cost per month in Amazon” it can seem, using a naive analysis, like you need to go hybrid on prem after a couple hundred thousand dollars appear on your bill. But:

1. Make sure you are taking into account your fully loaded cost (includes data center, power cooling, etc.) of all assets (servers, storage, network…) you are using to do this private. Use the real numbers, not the “funny money” numbers – at a previous company we allocated network and other shared costs across the entire company, while “our IT budget” had to pay for servers, so that was the only number used in a comparison since it was our own department’s costs only that were considered – don’t be a goon (technical term for a local optimizer), you should consider what it’s costing your entire company. Storage especially is way cheaper in the cloud versus enterprise SANs.

2. Make sure you are taking into account the cost of the manpower to run it. And that’s not just the techies’ salary (fully loaded with benefits/bonuses), and the proportion of each layer of management going up that has to deal with their concerns (Even if the director only has to spend 30% of his time messing with the data center team, and the VP 10%, and the CTO 5%, and the CEO 1% – that’s a lot of freaking money you need to account for). It’s also the opportunity cost of having people (smart technical people) doing your plumbing instead of doing things to forward your company. I would argue that instead of putting in the employee’s salary in this calculation, you’d do better to put in your revenue per employee! Why? Because for that same money you could have someone improving product, making sales, etc. and making you additional revenue. If all you are looking at is “cost reduction” you are probably divorced enough from the business goals of your organization that you are not making good decisions. This isn’t to say you don’t need any of that manpower, but ideally with more plumbing being outsourced you can turn their technical skills to something of more productive use.

3. Make sure you are taking into account the additional lag time and the cost of that time to market delay from DIYing. Some people couch this as just for purposes of innovation – “well, if you’re a small, quick moving, innovative firm or startup, then this velocity matters to you – if you’re a larger enterprise, with yearly budget cycles, not so much.” That’s not true. Assuming you are implementing all this stuff with some end goal in mind, you are burning value along with time the longer it takes you to deliver it – we like to call that cost of delay. Heck, just plain cost of money over that period is significant – I’ve seen companies go through quite a set of gyrations to be able to bill 30 days earlier to get that additional benefit; if you can deliver projects a month earlier from leveraging reusable work (which is all that SaaS/PaaS/IaaS solutions are) then you accelerate your cashflow. If you have to wait 12 months for the IT group to get a private cloud working, you are effectively losing the benefit of your deliverable * 12 months. “We saved $10k/year on hosting costs!” “Great, can we deliver our product that will make us $10k/month now, or do we get to continue to put ourselves out of business with cost cutting?”

4. Account for complexity. The problem with “hybrid cloud,” in most implementations, is that it’s not seamless from on prem to public, and therefore your app architecture has to be doubly complicated. In a previous position where I ran a large SaaS service, we were spread across AWS (virtual everything) and Rackspace (vserver, F5 LBs, etc.) and it was a total nightmare – we were trying to migrate all the way out to the cloud just so we could delete half of the cruft in all our code that touched the infrastructure – complexity that caused production issues (frequently) and slowed our rate of delivering new functionality. The KISS principle is wrathful when ignored.

I’m not saying hybrid cloud, private cloud, etc. are never the answer – but I would say that on average they are usually not the right answer, and if you are using them as your default approach then it’s better than even money you’re being inefficient. Furthermore, using SaaS and PaaS requires less expertise (and thus money) than IaaS which uses less than private cloud – people justify “starting with private” because you are “leveraging skill sets” or whatever – and then 6 months later you have a whole team still trying to bake off OpenStack vs Eucalyptus when you could have had your app (you know, the thing you actually need to fulfill a business goal) already running in a public PaaS. I’m not sure why I need to say out loud “delivering the most amount of value with the least amount of effort, time, and expenditure is good” – but apparently I do. Just because you *can* do something does not mean you *should* do it. You need to carefully shepherd your time to delivery and your costs, and not just let things float in a morass of IT because “these things take time…”

5 Comments

Filed under Cloud

Tagged as Cloud, cost, decision, IaaS, infrastructure, IT, Management, paas, SaaS, value

by Ernest Mueller | April 21, 2014 · 2:58 pm

Agile Organization: Separate Teams By Discipline

This is the first in the series of deeper-dive articles that are part of Agile Organization Incorporating Various Disciplines. It’s very easy to keep reorganizing and trying different models without actually learning from the process. I’ve worked with all of these so am trying to condense the pros and cons to help people understand the implications of the type of organizational model they choose.

The Separate Team By Discipline Model

Separate teams striated by discipline is the traditional method of organizing technical teams – segmented horizontally by technical skill. You have one or more development teams, one or more operations teams, one or more QA teams. In larger shops you have even more horizontal subdivisions – like in an enterprise IT shop, under the banner of Infrastructure you might have a data center team, a UNIX admin team, a SAN team, a Windows admin team, a networking team, a DBA team, a telecom team, applications administration team(s), and so on. It’s more unusual to have the dev side specifically segmented horizontally by tech as well (“Java programmers,” “COBOL programmers,” “Javascript programmers”) but not unheard of; it is more commonly seen as “UX team, services team, backend team…”

In this setup in its purest form, each team takes on tasks for a given product or project inside their team, works on them, and either returns them to the requester or passes them through to yet another team. Team members are not dedicated to that product or effort in any way except inasmuch as the hours they spend working on the to-do(s). Usually this is manifested as a waterfall approach, as a product or feature is conceived, handed to developers to develop, handed to QA to test, and finally handed to Operations to deploy and maintain.

This model dates back to the mainframe days, where it works pretty well – you’re not innovating on the infrastructure side, the building’s been built, you’re moving your apps into the pre-built apartment units. It also works OK when you have heavy regulation requirements or are constrained to extensively documenting requirements, then design, etc. (government contracts, for example).

It works a lot less well when you need to move quickly or need any kind of change or innovation from the other teams to achieve your goal. Linking up prioritization across teams is always hard but that’s the least of the issues. Teams all have their own goals, their own cadences, and even their own cultures and languages. The oft-repeated warning that “devs are motivated to make changes and ops is motivated by system stability” is a trivial example of this mismatch of goals. If the shared teams are supporting a limited number of products it can work. When there are competing priorities, I’ve seen it be extremely painful. I worked in a shop where the multiple separate dev teams were vertical (line of business organized) but the operations teams were horizontal (technical specialty organized) – and frankly, the results of trying to produce results with the impedance mismatch generated by that setup was the nightmare that sent me down the Agile and DevOps path initially.

Benefits of Disciplinary Teams

The primary benefit of this approach is that you tend to get stable teams of like individuals, which allows them to bond with those of similar skills and experience organizational stability and esprit de corps. Providing this sense of comfort ends up being the key challenge of the other organizational approaches.

The second benefit is that it provides a good degree of standardization across products – if one ops team is creating the infrastructure for various applications, then there will be some efficiencies there. This is almost always at least partially, and sometimes more than entirely, counteracted in value by the fact that not all apps need the same thing and that centralized teams bottleneck delivery and reduce velocity. I remember the UNIX team that would only provide my Web team expensive servers, even though we were highly horizontally scaled and told them we’d rather have twice as many $3500 servers instead of half as many $7000 servers as it would serve uptime, performance, etc. much better. But progress on our product was offered up upon the altar of nominal cost savings from homogeneity.

The third benefit is that if the horizontal teams are correctly cross-trained, it is easier avoid having single points of failure; by collecting the workers skilled in something into one group, losses are more easily picked up by others in the group. I have to say though, that this benefit is often more honored in the breach in my experience – teams tend to naturally divide up until there’s one expert on each thing and managers who actively maintain a portfolio and drive crosstraining are sadly rare.

Drawbacks of Disciplinary Teams

Conway’s Law is usually invoked to worry about vertical divisions in a product – one part of the UI written by one team, another by another, such that it looks like a Frankenstein’s monster of a product to the end user. However, the principle applies to horizontal divisions as well – these produce more of a Human Centipede, with the issue of one phase becoming the input of the next. The front end may not show any clear sign of division, but the seams in quality, reliability, and agility of the system can grow like a cancer underneath, which users certainly discover over time.

This approach promotes a host of bad behaviors. Pushing work to other people is always tempting, as is taking shortcuts if the results of those shortcuts fall on another’s shoulders. With no end to end ownership of a product, you get finger pointing and no one taking responsibility for driving the excellence of the service – and without an overall system thinking perspective, attempts at one of the teams in that value chain to drive an improvement in their domain often has unintended effects on the other teams in that chain that may or may not result in overall improvement. If engineers don’t eat their own dog food, but pass it on to someone else, then chronic quality problems often result. I personally spent years trying to build process and/or relationships to try to mitigate the dev->QA->ops passing of issues downstream with only mixed success.

Another way of stating this is that shared services teams always provide a route to the tragedy of the commons. Competing demands from multiple customers and the need for “nonfunctional” requirements (performance, availability, etc.) could potentially be all reconciled in priority via a strong product organization – but in my experience this is uncommon; product orgs tend to not care about prioritization of back end concerns and are more feature driven. Most product orgs I have dealt with have been more or less resistant to taking on platform teams, managing nonfunctional requirements, and otherwise interacting with that half of demands on the product. Without consistent prioritization, shared teams then become the focus of a lot of lobbying by all their internal customers trying to get resource. These teams are frequently understaffed and thus a bottleneck to overall velocity.

Ironically, in some cases this can be beneficial – technically, focusing on cost efficiency over delivering new value is always a losing game, but some organizations are self-unaware enough that they have teams continuing to churn out “stuff” without real ROI associated (our team exists therefore we must make more), in which case a bottleneck is actually helpful.

Mitigations for the weaknesses of this approach (abdication of responsibility and bottlenecking constraints) include:

Very strong process guidance. “If only every process interface is 100% defined, then this will work”, the theory goes, just as it works on a manufacturing line. Most software creation, however, is not similar to piecing components together to make an iPod. In one shop we worked for years on making a system development process that was up to this task, but it was an elusive goal. This is how, for example, Microsoft makes the various Office products look the same in partial defiance of Conway’s Law – books and books of standards.
Individuals on shared teams with functional team affinities. Though not going as far as embedding into a product team, you can have people in the shared teams who are the designated reps for various client teams. Again, this works better when there is a few-to-one instead of a many-to-one relationship. I had an ops team try this, but it was a many-to-one environment and each individual engineer ended up with three different ownership areas, which was overwhelming. In addition, you have to be careful not to simply dedicate all of one sort of work to one person, as you then create many single points of failure.
Org variation: Add additional crossfunctional teams that try to bridge the gap. At one place I worked, the organization had accepted that trying to have the systems needs of their Web site fulfilled by six separate infrastructure teams was not working well, so they created a “Web systems” team designed to sit astride those, take primary responsibility, and then broker needs to the other infrastructure teams. This was an improvement, and led to the addition of a parallel team responsible for internal apps, but never really got to the level of being highly effective. In addition those were extremely high-stress roles, as they bore responsibility but not control of all the results.

Conclusion

Though this is the most typical organization of technology teams historically, that history comes from a place much different than many situations we find ourselves in today. The rapid collaboration approach that Agile has brought us, and the additional understanding that Lean has given us in the software space, tells us that though this approach has its merits it is much overused and other approaches may be more effective, especially for product development.

Next, we’ll look at embedded crossfunctional service teams!

2 Comments

Filed under Agile, DevOps

Tagged as agile, DevOps, formation, Management, org, organization, structure, team

by Ernest Mueller | June 14, 2010 · 2:00 pm

Log Management Tools

We’re researching all kinds of tools as we set up our new cloud environment, I figure I may as well share for the benefit of the public…

Most recently, we’re looking at log management. That is, a tool to aggregate and analyze your log files from across your infrastructure. We love Splunk and it’s been our tool of choice in the past, but it has two major drawbacks. One, it’s quite expensive. In our new environment where we’re using a lot of open source and other new-format vendors, Splunk is a comparatively big line item for a comparatively small part of an overall systems management portfolio.

Two, which is somewhat related, it’s licensed by amount of logs it processes per day. Which is a problem because when something goes wrong in our systems, it tends to cause logging levels to spike up. In our old environment, we keep having to play this game where an app will get rolled to production with debug on (accidentally or deliberately) or just be logging too much or be having a problem causing it to log too much, and then we have to blacklist it in Splunk so it doesn’t run us over our license and cause the whole damn installation to shut off. It took an annoying amount of micromanagement for this reason.

Other than that, Splunk is the gold standard; it pulls anything in, graphs it, has Google-like search, dashboards, reports, alerts, and even crazier capabilities.

Now on the “low end” there are really simple log watchers like swatch or logwatch. But we’d really like something that will aggregate ALL our logs (not just syslog stuff using syslog-ng – app server logs, application logs, etc.), ideally from both UNIX and Windows systems, and make them usefully searchable. Trying to make everything and everyone log using syslog is an ever receding goal. It’s a fool’s paradise.

There’s the big appliance vendors on the “high end” like LogLogic and LogRhythm, but we looked at them when we looked at Splunk and they are not only expensive but also seem to be “write only solutions” – they aggregate your logs to meet compliance requirements, and do some limited pattern matching, but they don’t put your logs to work to help you in your actual work of application administration the dozen ways Splunk does. At best they are “SIEM”s – security information and event managers – that alert on naughty intruders. But with Splunk I can do everything from generate a report of 404s to send to our designers to fix their bad links/missing images to graph site traffic to make dashboards for specific applications for their developers to review. Plus, as we’re doing this in the cloud, appliances need not apply. (Ooo, that’s a catchy phrase, I’ll have to use that for a separate post!)

I came across three other tools that seem promising:

Logscape from Liquidlabs – does graphing and dashboards like Splunk does. And “live tail” – Splunk mysteriously took this out when they revved from version 3 to 4! Internet rumor is that it’s a lot cheaper. Seems like a smaller, less expensive Splunk, which is a nice thing to be, all considered.
Octopussy – open source and Perl based (might work on Windows but I wouldn’t put money on it). Does alerting and reporting. Much more basic, but you can’t beat the price. Don’t think it’ll meet our needs though.
Xpolog – seems nice and kinda like Splunk. Most of the info I can find on it, though, are “What about xpolog, is good!” comments appended to every forum thread/blog post about Splunk I can find, which is usually a warning sign – that kind of guerrilla marketing gets old quick IMO. One article mentions looking into it and finding it more expensive but with some nice features like autodiscovery, but not as open as Splunk.

Anyone have anything to add? Used any of these? We’ve gotten kind of addicted to having our logs be immediately accessible, converted into metrics, etc. I probably wouldn’t even begrudge Splunk the money if it weren’t for all the micromanagement you have to put into running it. It’s like telling the fire department “you’re licensed for a maximum of three fires at a time” – it verges on irresponsible.

21 Comments

Filed under DevOps

Tagged as aggregation, analysis, log, logfile, Management, splunk, tools

Tag Archives: Management

Compliance

Management

Gratefulness Empowers Recognition

Ungratefulness Leads To Bad Decisions

The Project Based Organization Model

Benefits Of The Project Based Model

Drawbacks Of The Project Based Model

Experience With The Project Based Model

The Fully Integrated Service Team Model

Benefits Of Integrated Service Teams

Drawbacks of Integrated Service Teams

Experience with Integrated Service Teams

The Embedded Crossfunctional Team Model

Benefits of Embedded Crossfunctional Teams

Drawbacks of Embedded Crossfunctional Teams

Experience with Crossfunctional Teams

Part 1: “Planned” Work

Product Management

Work Management in JIRA

Portfolio Management

Team Management

Knowledge Management

Part 2 – “Unplanned” Work

Triage

The Dynamic Duo

Conclusion

Cloud Procurement Flowchart

What About Compliance?

What About The Cost?

The Separate Team By Discipline Model

Benefits of Disciplinary Teams

Drawbacks of Disciplinary Teams

Conclusion

Subscribe

Recent Comments

Recent Posts

Austinites

Cloud

DevOps

Archives