Tag Archives: SRE

The Right Way To Use Tagging In The Cloud

This came up today at work and I realized that over my now-decades of cloud engineering, I have developed a very specific way of using tags that sets both infra dev teams and SRE teams up for success, and I wanted to share it.

Who cares about tags? I do. They are the only persistent source of information you can trust (as much as you can trust anything in this fallen world) to communicate information about an infrastructure asset beyond what the cloud or virtualization fabric it’s running in knows. You may have a terraform state, you may have a database or etcd or something that knows what things are – but those systems can go down or get corrupted. Tags are the one thing that if someone can see the infrastructure – via console or CLI or API or integrated tool – that they can always see. Server names are notoriously unreliable – ideally in a modern infrastructure you don’t reuse servers from one task to another or put multiple workloads on one, but that’s a historical practice that pops up all to often, and server names have character limits (even if they don’t, the management systems around them usually enforce one).

Many powerful tools like Datadog work by exclusively relying on tags. It simplifies operation and prevents errors if, when you add a new production app server, that automatically gets pulled into the right monitoring dashboards and alerting schemes because it is tagged right.

I’ve run very large complex cloud environments using this scheme as the primary means to drive operations.

Top level tag rules:

  1. Tag everything. Tagging’s not just for servers. Every cloud element that can take a tag, tag. Network, disk images, snapshots, lambdas, cloud services, weird little cloud widgets (“S3 VPC endpoint!”).
  2. Use uniform tags. It’s best to specify “all lower case, no spaces” and so on. If people decide to word a tag slightly differently in two places, the value is lost. Both the key and the value, but especially the key – teach people that if you say “owner” that means “owner” not “Owner” and “owning party” and whatever else.
  3. Don’t overtag with attributes you can easily see. Instance size, what AZ it’s in, and so on is already part of the cloud metadata so it’s inefficient to add tags for it.
  4. Use standard tags. This is what I’ll cover in the rest of this article.

At the risk of oversimplifying, you need two things out of your systems environment – compliance and management. And tags are a great way to get it.

Compliance

Attribution! Cost! Security! You need to know where infrastructure came from, who owns it, who’s paying for it, and if it’s even supposed to be there in the first place.

Who owns it?

Tag all cloud assets with an owner (email address) basically whatever is required to uniquely identify who owns an asset. Should be a team email for persistent assets, if it’s a personal email then the assumption should be if that person leaves the company those assets get deleted (good for sandboxes etc). 

The amount of highly paid engineer time I’ve seen wasted over the last decade of people having to go out and do cattle calls of “Hey who owns these… we need to turn some off for cost or patch them for security or explain them for compliance… No really, who owns these…” is shocking.

owner:myteam@mycompany.com

Who’s paying for it

This varies but it’s important. “Owner” might not be sufficient in an environment – often some kind of cost allocation code is required based on how your company does finances. Is it a centralized expense or does it get allocated to a client? Is it a production or development expense, those are often handled differently from a finance perspective. At scale you may need a several-parter – in my current consulting job there’s a contract number but also a specific cost code inside that contract number that we need all expenses divvied up between.

billing:CUCT30001

Where did it come from

Traceability both “up” and “down” the chain. When you go look at a random cloud instance, even if you know who it belongs to you can’t tell how it got there. Was it created by Terraform? If so where’s the state file? Was it created via some other automation system you have? Github? Rundeck? Custom python app #25?

Some tools like Cloudformation do this automatically. Otherwise, consider adding a source tag or set of tags with sufficient information to trace the live system back to the automation. Developers love tagging git commits and branches with versions and JIRA tickets and release dates and such, same concept applies here. Different things make sense depending on your tech stack – if you GitOps everything then the source might be a specific build, or you want to say which s3 bucket your tfstate is in… Here as an example, I’m working with a system that is terraform instantiated from a gitops pipeline so I’ve made a source tag that says github and then the repo name and then the action name. And for the tfstate I have it saved in an s3 bucket named “mystatebucket.”

source:github/myapp/deploy-action
sourcestate:s3/mystatebucket

When does it go

OK, I know the last two sound like the lyrics to “Cotton-Eyed Joe”, which is a bonus. But a major source of cost creep is infrastructure that was intended to be there for a short time – a demo, a dev cycle – that ends up just living forever. And sure, you can just send nag-o-grams to the owner list, but it’s better to tag systems with an expires tag in date format (ideally YYYY-MM-DD-HH-MM as God intended). “expires:never” is acceptable for production infrastructure, though I’ve even used it on autoscaling prod infrastructure to make sure systems get turned over and don’t live too long.

expires:2025-02-01-00-00-00
or
expires:never

Management

Operations! Incidents! Cost and security again! Keep the entire operational cycle, including “during bad production incidents”, in mind when designing tags. People tear down stacks/clusters, or go into the console and “kill servers”, and accidentally leave other infrastructure – you need to be able to identify and clean up orphaned assets. Hackers get your AWS key and spin up a huge volume of bitcoin miners. Identifying and actioning on infrastructure accurately and efficiently is the goal.

As in any healthy system, the “compliance” tags above aren’t just useful to the beancounters, they’re helpful to you as a cloud engineer or SRE. But beyond that, you want a taxonomy of your systems to use to manage them by grouping operations, monitoring, and so on.

This scheme may differ based on your system’s needs, but I’ve found a general formula that fits in most cases I come across. Again, it assumes virtual systems where servers have one purpose – that’s modern best practice. “Sharing is the devil.”

EARFI

I like to pronounce this “errr-feee.” It’s a hierarchy to group your systems.

  • environment – What environment does this represent to you, e.g. dev, test, production, as this is usually the primary element of concern to an operator. “environment:uat” vs “environment:prod”.
  • application – What application or system is this hosting? The online banking app? The reporting system? The security monitoring server? The mobile game backend? GenAI training? “application:banking”.
  • role – What function does this specific server perform? Webserver dbserver, appserver, kafka – systems in an identical role should have identical loadouts. “role:apiserver” vs “role:dbserver”. Keep in mind this is a hierarchy and you won’t have guaranteed uniqueness across it – for example, “application:banking,role:dbserver” may be quite different from “application:mobilegame,role:dbserver” so you would usually never refer to just “role:dbserver.”
  • flavor – Optional, but useful in case you need to differentiate something special in your org that is a primary lever of operation (Windows vs Linux?  CPU vs GPU nodes in the same k8s cluster? v2 vs v2?). I usually find there’s only one of these (besides of course region and things you shouldn’t tag because they are in other metadata). For our apiserver example, consider that maybe we have the same code running on all our api servers but via load balancer we send REST queries to one set and SOAP queries to another set for caching and performance reasons. “flavor:rest” vs “flavor:soap”.
  • instance – A unique identifier among identical boxes in a specific EARF set, most commonly just an integer. “instance:2”. You could use a GUID if you really need it but that’s a pain to type for an operator.

This then allows you to target specific groups of your infrastructure, down to a single element or up to entire products.  

  • “Run this week’s security patches on all the environment:uat, application:banking, role:apiserver, flavor:rest servers.” Once you verify, you can do the same on environment:prod.”
  • “The second of the three servers in that autoscaling group is locked up. Terminate environment:uat, application:banking, role:apiserver, flavor:rest, instance:2
  • “We seem to be having memory problems on the apiservers. Is it one or all of the boxes? Check the average of environment:prod, application:banking, role:apiserver, flavor:rest and then also show it broken down by instance tag. It’s high on just some of the servers but not all? Try flavor:rest vs flavor:soap to see if it’s dependent on that functionality. Is it load do you think? Compare to the aggregate of environment:uat to see if it’s the same in an idle system.”
  • “Set up an alert for any environment:prod server that goes down. And one for any environment:prod, application:banking, role:apiserver that throws 500 errors.”
  • “Security demands we check all our DB servers for a new vulnerability. Try sending this curl payload to all role:dbservers, doesn’t matter what application. They say it won’t hurt anything but do it to environment:uat before environment:prod for safety.”

So now a random new operator gets an alert about a system outage and logs into the AWS console and sees not just “i-123456 started 2 days ago,” they see

owner:myteam@mycompany.com
billing:CUCT30001
source:github/myapp/deploy-action
sourcestate:s3/mystatebucket

expires:never
environment:prod

application:mobilegame
role:dbserver
flavor:read-only
instance:2


That operator now has a huge amount of information to contextualize their work, that at best they’d have to go look up in docs or systems and at worst they’d have to just start serially spamming. They know who owns it, what generates it, what it does and has hints at how important it is. (prod – probably important. A duplicate read secondary – could be worse.) And then runbooks can be very crisp about what to do in what situation by also using the tags. “If the server is environment:prod then you must initiate an incident <here>… If the server is a role:dbserver and a role:read-only it is OK to terminate it and bring up a new one but then you have to go run runbook <X> and run job <y> to set it up as a read secondary…”

Feel free and let me know how you use tags and what you can’t live without!

Leave a comment

Filed under Cloud, DevOps, Monitoring

SRE: The Biggest Lie Since Kanban

There is a lot of discussion lately about how SRE fits into or competes with or whatever-s with DevOps.  I’m scheduled to speak on a “SRE vs DevOps Smackdown” panel today here at Innotech Austin, and at the exact same time I see Bridget tweeting Liz Fong-Jones’ slides from Velocity on using SRE to implement DevOps. And the more I think about it, and see what people are doing, the more I’m getting worried.

The Big Lie

Just to get the easily provoked to put up their pitchforks, I don’t dislike SRE and I don’t dislike Kanban.  The reason I call Kanban a “Big Lie” is because really doing Kanban correctly and getting the value out of it requires even more discipline that doing something like Scrum.  But it looks so close to doing nothing new that many lazy teams out there say “they’re doing Kanban” and by that they mean they’re doing nothing, but they’ve turned on the Kanban view in JIRA for your convenience.  They have no predictability, they’re not managing WIP, they’re not identifying bottlenecks – they just have a visible board now and that’s it. I strongly believe from my experience that most teams “doing Kanban” are really doing mostly nothing.  There’s articles on this blog about how I make my teams I’m teaching Agile do Scrum first if they want to get to Kanban to build up the required discipline.  And I’m not just a crank, David Hawks from Agile Velocity just told our management team the same thing yesterday, which brought this back to mind for me and spurred this article.

Because I’m starting to see the same thing with SRE.  It’s not surprising – there was and is plenty of “DevOps-washing” of existing teams out there.  Rename your ops team DevOps, done. Well, at least DevOps was able to say “it’s a methodology not a job description or group name stop it” to force deeper thought – it’s why my team at work is the “Engineering Operations” team not “DevOps”, Lee Thompson insisted on that when he set it up! But SRE – yeah, it’s a team just like your own ops team, from an “org chart” viewpoint it looks the same. So doing SRE can – and in many shops does – mean doing nothing new. You just call your existing ops team SRE and figure you’re done.

A brief personal history lesson – my last job before DevOps hit was running the Web Systems team at National Instruments, an ops team.  That’s where we agile admins met, Peco and James were both ops engineers on that team! (Karthik was a dev we worked with.) We had smart people and did ops all right.  We had automation, monitoring, we had “definition of done” standards for new services. You wouldn’t have to squint too hard to just call that team a SRE team and call it a day. But, I wouldn’t wish that job on my worst enemy. It was brutal trying to do ops for just 4-5 dev teams, and that’s with business support, some shared goals, and so on. Our quality of life was terrible, we weren’t empowered, and no matter how hard we tried, success was always right out of our grasp. When we actually started a team using DevOps thinking at NI after that, the difference was night and day, and we actually began to enjoy our jobs as ops engineers. I would hate for anyone to deceive themselves into thinking they’re getting the goodness they should be able to get from a DevOps/”real” SRE approach while still just doing it the way we were doing it.

I have a friend at a local legal software firm, who told me they’re going through and just renaming all the QA folks to SWET (Software Engineer In Test), whether they can code or not, and all the Ops folks to SREs in this manner. One might be charitable and say they’re leaning forward and they intend to loop around and back that up with retraining or something, but… will they? Probably not, it’s just a rename to the hot new term without any of the changes to help those engineers succeed more in their jobs!

SRE isn’t “an implementation of DevOps” if you just apply it as a name for a hopped-up ops team.  Properly understood, it can be an implementation of one of the three parts of DevOps, Infrastructure As Code, Continuous Integration/Deployment, and Site Reliability Engineering. But note that reliability engineering doesn’t start with deploy to production; so much of it is Michael Nygard-esque techniques to write your app reliably in the first place; reliability engineering, in usual DevOps fashion, requires dev and ops work both way back in the dev cycle and out in production to work right. It doesn’t need to be a different team.  If it is, and that team doesn’t get to decide if it takes over ops for a given app, and it’s not allowed to spend 50% of its time on reducing toil and you’re not comping SREs like you do dev engineers – it’s not SRE and you’re a liar for calling it SRE. If you don’t keep DevOps principles in mind, you’re just going to get your old ops team with its old problems again.

That’s why SRE is a Big Lie – because it enables people to say they’re doing a thing that could help their organization succeed, and their dev and ops engineers to have a better career and life while doing so – but not really do it.  Yes, there have been Big Lies before, which is why I cite Kanban as another example – but even if the new criminal is pretty much like the old criminal, you still put their picture up on the post office wall.

Frankly, anyone pushing SRE that doesn’t put warning labels on it is contributing to the problem.  “Well but it mentions in chapter 20 of the second book,” said someone responding to the first version of this article on Twitter.  Not good enough. If something you’re selling is profoundly misused it’s your responsibility to be more up front about the issues.

The Little Problems

Now there are legitimate issues to have even with the “real SRE” model, at least the way that it’s usually being described.  The Google books kinda try to have it both ways, describing it as an engineering practice (how I describe it above and in the SRE course I did for LinkedIn) and describing it as “a team that works this way.”  Even among those not SRE-washing classical ops, the generally understood model is that SRE is a org/job title for a production operations team.

There’s an issue here, the problem of specialization.  If you are Google scale, well, then you’re going to have to specialize and a separate ops team makes sense.  But – first of all, you are not Google scale.  In my opinion, if you are under 100 engineers, you are committing an error by having a separate ops team. You need your product teams to own their products. Second of all – I don’t want to make an enemy of all the lovely Google engineers out there, but is your experience with Google services that they evolve quickly and get better once they go to wide release?  It’s not mine.  They rot.  Have you used Google Hangouts lately without it ending up with cursing and moving off to someone’s Zoom? That kind of specialization still has its downsides in terms of hindering your feedback loops that let you improve (the Second Way). Is SRE just Google-ese for “sustaining?”

I get that the Google folks say they still get feedback and innovation using the SRE model, I’m sure they do and they work hard at that, but that doesn’t change the fact that running a separate ops team is making a deliberate tradeoff between innovation and efficiency. There is no way in that you get as much feedback or improve as quickly with a separate team, you can compensate for it, but you’re still saying “look… Not as important.” Which is fine if that’s your situation, I worked at many companies with 200 abandoned apps in production and you had to do something.  But “not getting there in the first place” is better.

Some of the draw of the model, and why Google is highly aligned with it, is Kubernetes itself. k8s is very complex to run drives people back a little bit to the old priest-in-the-tabernacle model of “someone maintains the infrastructure and you write the app and then you have them deploy it,” but now there’s some standards (like deploying as a container) that make that OK – I guess? But if you think reliability, and observability, are the primary responsibility of an ops team that is not involved in constructing the application, you either have deep and profound company standards that allow seamless plugging of the one into the other or you’re fooling yourself. 90% of you are fooling yourself.

At this conference I heard “Service meshes!  They get you observability so your devs don’t have to think about it.” Do you not see how dangerous that mindset is?

SRE, as interpreted as “a separate newfangled ops team,” may work for some but you need to be realistic about the issues and tradeoffs you’re making.  Consider whether product teams supporting their product, maybe with aid from a platform team making tooling and an enabling/consulting/center of excellence team that can give expert advice?  DevOps helped us see how the “throw it over the wall from dev to ops” model was profoundly harming our industry.  Throwing over the wall from dev to SRE doesn’t improve that, it’s profoundly regressive. Doing SRE “right” to compensate for this, like doing Kanban right, requires more skill and discipline, not less – be realistic about whether you have Google levels of skill and discipline in your org, eh?

Conclusion

SRE (and Kanban) aren’t bad, they have their pros and cons, but they are easy to “pretend to do” in some minimal, cargo cult-ey way that gets you little of the benefits. And if you think spinning up an ops team and calling it SRE is “an implementation of DevOps” you’ve swallowed the worst poison pill the DevOps talk circuit can deal to you.

15 Comments

Filed under DevOps