There is a lot of discussion lately about how SRE fits into or competes with or whatever-s with DevOps. I’m scheduled to speak on a “SRE vs DevOps Smackdown” panel today here at Innotech Austin, and at the exact same time I see Bridget tweeting Liz Fong-Jones’ slides from Velocity on using SRE to implement DevOps. And the more I think about it, and see what people are doing, the more I’m getting worried.
The Big Lie
Just to get the easily provoked to put up their pitchforks, I don’t dislike SRE and I don’t dislike Kanban. The reason I call Kanban a “Big Lie” is because really doing Kanban correctly and getting the value out of it requires even more discipline that doing something like Scrum. But it looks so close to doing nothing new that many lazy teams out there say “they’re doing Kanban” and by that they mean they’re doing nothing, but they’ve turned on the Kanban view in JIRA for your convenience. They have no predictability, they’re not managing WIP, they’re not identifying bottlenecks – they just have a visible board now and that’s it. I strongly believe from my experience that most teams “doing Kanban” are really doing mostly nothing. There’s articles on this blog about how I make my teams I’m teaching Agile do Scrum first if they want to get to Kanban to build up the required discipline. And I’m not just a crank, David Hawks from Agile Velocity just told our management team the same thing yesterday, which brought this back to mind for me and spurred this article.
Because I’m starting to see the same thing with SRE. It’s not surprising – there was and is plenty of “DevOps-washing” of existing teams out there. Rename your ops team DevOps, done. Well, at least DevOps was able to say “it’s a methodology not a job description or group name stop it” to force deeper thought – it’s why my team at work is the “Engineering Operations” team not “DevOps”, Lee Thompson insisted on that when he set it up! But SRE – yeah, it’s a team just like your own ops team, from an “org chart” viewpoint it looks the same. So doing SRE can – and in many shops does – mean doing nothing new. You just call your existing ops team SRE and figure you’re done.
A brief personal history lesson – my last job before DevOps hit was running the Web Systems team at National Instruments, an ops team. That’s where we agile admins met, Peco and James were both ops engineers on that team! (Karthik was a dev we worked with.) We had smart people and did ops all right. We had automation, monitoring, we had “definition of done” standards for new services. You wouldn’t have to squint too hard to just call that team a SRE team and call it a day. But, I wouldn’t wish that job on my worst enemy. It was brutal trying to do ops for just 4-5 dev teams, and that’s with business support, some shared goals, and so on. Our quality of life was terrible, we weren’t empowered, and no matter how hard we tried, success was always right out of our grasp. When we actually started a team using DevOps thinking at NI after that, the difference was night and day, and we actually began to enjoy our jobs as ops engineers. I would hate for anyone to deceive themselves into thinking they’re getting the goodness they should be able to get from a DevOps/”real” SRE approach while still just doing it the way we were doing it.
I have a friend at a local legal software firm, who told me they’re going through and just renaming all the QA folks to SWET (Software Engineer In Test), whether they can code or not, and all the Ops folks to SREs in this manner. One might be charitable and say they’re leaning forward and they intend to loop around and back that up with retraining or something, but… will they? Probably not, it’s just a rename to the hot new term without any of the changes to help those engineers succeed more in their jobs!
SRE isn’t “an implementation of DevOps” if you just apply it as a name for a hopped-up ops team. Properly understood, it can be an implementation of one of the three parts of DevOps, Infrastructure As Code, Continuous Integration/Deployment, and Site Reliability Engineering. But note that reliability engineering doesn’t start with deploy to production; so much of it is Michael Nygard-esque techniques to write your app reliably in the first place; reliability engineering, in usual DevOps fashion, requires dev and ops work both way back in the dev cycle and out in production to work right. It doesn’t need to be a different team. If it is, and that team doesn’t get to decide if it takes over ops for a given app, and it’s not allowed to spend 50% of its time on reducing toil and you’re not comping SREs like you do dev engineers – it’s not SRE and you’re a liar for calling it SRE. If you don’t keep DevOps principles in mind, you’re just going to get your old ops team with its old problems again.
That’s why SRE is a Big Lie – because it enables people to say they’re doing a thing that could help their organization succeed, and their dev and ops engineers to have a better career and life while doing so – but not really do it. Yes, there have been Big Lies before, which is why I cite Kanban as another example – but even if the new criminal is pretty much like the old criminal, you still put their picture up on the post office wall.
Frankly, anyone pushing SRE that doesn’t put warning labels on it is contributing to the problem. “Well but it mentions in chapter 20 of the second book,” said someone responding to the first version of this article on Twitter. Not good enough. If something you’re selling is profoundly misused it’s your responsibility to be more up front about the issues.
The Little Problems
Now there are legitimate issues to have even with the “real SRE” model, at least the way that it’s usually being described. The Google books kinda try to have it both ways, describing it as an engineering practice (how I describe it above and in the SRE course I did for LinkedIn) and describing it as “a team that works this way.” Even among those not SRE-washing classical ops, the generally understood model is that SRE is a org/job title for a production operations team.
There’s an issue here, the problem of specialization. If you are Google scale, well, then you’re going to have to specialize and a separate ops team makes sense. But – first of all, you are not Google scale. In my opinion, if you are under 100 engineers, you are committing an error by having a separate ops team. You need your product teams to own their products. Second of all – I don’t want to make an enemy of all the lovely Google engineers out there, but is your experience with Google services that they evolve quickly and get better once they go to wide release? It’s not mine. They rot. Have you used Google Hangouts lately without it ending up with cursing and moving off to someone’s Zoom? That kind of specialization still has its downsides in terms of hindering your feedback loops that let you improve (the Second Way). Is SRE just Google-ese for “sustaining?”
I get that the Google folks say they still get feedback and innovation using the SRE model, I’m sure they do and they work hard at that, but that doesn’t change the fact that running a separate ops team is making a deliberate tradeoff between innovation and efficiency. There is no way in that you get as much feedback or improve as quickly with a separate team, you can compensate for it, but you’re still saying “look… Not as important.” Which is fine if that’s your situation, I worked at many companies with 200 abandoned apps in production and you had to do something. But “not getting there in the first place” is better.
Some of the draw of the model, and why Google is highly aligned with it, is Kubernetes itself. k8s is very complex to run drives people back a little bit to the old priest-in-the-tabernacle model of “someone maintains the infrastructure and you write the app and then you have them deploy it,” but now there’s some standards (like deploying as a container) that make that OK – I guess? But if you think reliability, and observability, are the primary responsibility of an ops team that is not involved in constructing the application, you either have deep and profound company standards that allow seamless plugging of the one into the other or you’re fooling yourself. 90% of you are fooling yourself.
At this conference I heard “Service meshes! They get you observability so your devs don’t have to think about it.” Do you not see how dangerous that mindset is?
SRE, as interpreted as “a separate newfangled ops team,” may work for some but you need to be realistic about the issues and tradeoffs you’re making. Consider whether product teams supporting their product, maybe with aid from a platform team making tooling and an enabling/consulting/center of excellence team that can give expert advice? DevOps helped us see how the “throw it over the wall from dev to ops” model was profoundly harming our industry. Throwing over the wall from dev to SRE doesn’t improve that, it’s profoundly regressive. Doing SRE “right” to compensate for this, like doing Kanban right, requires more skill and discipline, not less – be realistic about whether you have Google levels of skill and discipline in your org, eh?
SRE (and Kanban) aren’t bad, they have their pros and cons, but they are easy to “pretend to do” in some minimal, cargo cult-ey way that gets you little of the benefits. And if you think spinning up an ops team and calling it SRE is “an implementation of DevOps” you’ve swallowed the worst poison pill the DevOps talk circuit can deal to you.
15 responses to “SRE: The Biggest Lie Since Kanban”
So, what does SRE stand for?
I could have sworn I linked the online google SRE book with my first mention of SRE. Hmm. Added it now! “Site Reliability Engineering/Engineers”, see link for more.
Pingback: Link: SRE: The Biggest Lie Since Kanban – Coté
I maybe never read a so true article.
So true it depressed some of my friends. They were thinking there experience here in France was unique… sad news if it’s everywhere…
I’m very interested by you experience with “DevOps thinking”. Do you have any link / article / talk I can use as reference? 🙂
Well, we try to define DevOps here on the blog: https://theagileadmin.com/what-is-devops/. Also the DevOps Handbook is a good book on the subject. And I and the other Agile Admins have a series of DevOps courses on lynda.com/LinkedIn Learning, you can start with DevOps Foundations and go from there: https://www.linkedin.com/learning/devops-foundations – it will point you at CALMS, the Three Ways, and other foundational DevOps work.
Between me, Peco, Karthik, and James we have probably a dozen courses in the DevOps 101/201 kind of space.
Pingback: SRE: The Biggest Lie Since Kanban – Automize.org
Thank you for expressing this so cogently. The habit of using attractive terms (e.g. ‘resilience’) without understanding them and without any strong commitment to the principles that underlie them is widespread. Adopting language without fulling embracing the duties and responsibilities that it entails is what Bonhoeffer called “cheap grace”.
To describe this as a ‘big lie’ is, however, hyperbole and hyperbolic blogging potentially undermines the importance of your observations . The people who are behaving this way are not necessarily lying. To lie requires knowledge and the desire to mislead. Neither of these are prominent in your description. It is likely that the behaviors you so rightly condemn arise not from an intention to deceive but from powerlessness and confusion.
Just hearing that $NEWTERM is important or even crucial to success can lead management to insist that the organization must embrace $NEWTERM and pressure the technical staff to adopt it, often (usually?) without either a clear understanding of what $NEWTERM actually requires or the resources and commitment needed to achieve it. Instead of empowering reform and innovation, programmatic approaches generate yet-another-set-of-demands for the technical workforce — turning up the heat inside the cooker that so many devops live within.
Of course this is bad management but not unusual and many orgs are fairly well defended against it. As you point out, relabeling is a common response. [It is possible to imagine the organization as a vast associative array that simply substitutes $NEWTERM for $OLDTERM. But I digress.] Doing this allows everyone to imagine progress and recovers some of the organization’s ‘face’ and does so at a low cost. Cheap grace indeed.
My concern with this is two-fold. First, the approach generates cynicism. All organizations are learning organizations; every person incorporates new experiences into her frame of reference; learning is continuous and unstoppable. But the learning from experiences like those you have observed is mostly about how stupid and uninterested most of the organization is about the real work of making and running systems and how unlikely it is that any new program will produce anything of value.
Second, the approach makes it even harder to gain the value that the $NEWTERM approaches and ideas actually has. Labeling $OLDTERM as $NEWTERM, giving lip service to the substance robs $NEWTERM of its deeper meaning. The meaning of devops and SRE may not be entirely clear — they are both relatively ill-defined collections of observations about how people work to make systems and make them work. That fragility of meaning makes them susceptible to the renaming sleight-of-hand trick you so rightly condemn. But the sleight-of-hand trick itself makes it even harder to discern the important substantive features of these nascent phenomena.
The incentive to engage in such shenanigans is amplified by the unwarranted claims of so many that they have been able to achieve astonishing results from $NEWTERM. Conference presentations routinely take the form of “$COMPANY made enormous gains by adopting $NEWTERM”, implying that those who do not adopt $NEWTERM will quickly fall behind. I am sure that you have attended such presentations and from your independent experience with $COMPANY recognize that these claims are exaggerated or even, in a few cases, fabricated. There is a small industry around ginning up such claims and presenting them as ‘evidence’ that this phenomenal performance is available to your org if only you will adopt $NEWTERM or, as a cynic would describe it, drink the kool aid.
I am confident that there is much to be gained from $NEWTERM just as there is from kanban and just as there was from structured walkthroughs (Yourdon, 1977). Devops and SRE are powerful ideas, albeit difficult to grasp. You are absolutely correct that a lot of what is being labeled as SRE, devops, kanban, scrum, standup, etc. is “old wine in new bottles”. But the eagerness to adopt, the pent up desire to step away from brittle ways of doing things and always-teetering-on-the-brink-of-failure systems are both quite real. The unscrupulous prey on that desire by encouraging cheap grace. Your understanding of how difficult real grace is to achieve is deep and powerful. But the difficulty is not, I think, that we are being lied to but that we are so reluctant to give up the idea that cheap grace will get us to where we would like to be.
Thank you for writing this interesting post!
Thanks for the response Dr. Cook!
Here’s my thoughts. While I agree that the misuse of SRE isn’t born entirely out of a desire to mislead, I deliberately used the “big lie” term to highlight what I think frankly is both active and passive deception around the topic that should be confronted.
Not having the resources or know-how to do SRE – ok, fair enough.
However, the relabeling and saying “we’re doing SRE!” when you’re not – it takes a mix of some people actively and knowingly being misleading (usually in management – I’ve worked in places where “lean forward” is code for “lie about it”), and other people going along with it, for all the reasons you cite. And while we can understand why each of those actors does it – it’s that intersection of deliberate misuse, lack of understanding, and passive acceptance because it’s easier that characterizes any real-world Big Lie, right? It’s not “all the people, deliberately lying all the time,” it’s a mix of those factors that does the trick.
And I also wanted to be clear that while those teaching about SRE aren’t lying per se – they are being negligent, at least, in my opinion, in that they see the same thing I do about organizations deceiving themselves about what they’re doing but content themselves with saying “well, but we say in chapter 20 of the second SRE book that you shouldn’t do that…” instead of perhaps being a lot more forward with the “warning label.” There’s always a subjective discussion about how much responsibility a party holds for this kind of user misuse, but I’m putting a line in the sand and saying it’s gone past “none” to “some” in this particular case, for sure. It’s here I think your “cheap grace” analogy works best, because there’s a natural tension between wanting to spread the word and get more folks on the SRE bandwagon and lionize the good work Google’s done, and confronting new adherents with the harder truths.
With SRE in particular, I think that the “stones to make men stumble,” to torture this analogy, are that
1. If your SREs are not spending ~50% of their time automating to reduce toil,
2. If they are not able to accept or reject services to support of their own free will, and
3. If there is not an error budget that, if exceeded, compels the development team to cease feature dev and release to instead work on bugs/reliability
Then you are not doing SRE, you are doing standard old school ops the way we used to do it and hoping that applying the SRE label is making it more DevOpsey somehow. The whole big ol’ SRE book (2 books, now) talks about a bunch of other stuff which is all fine but isn’t the really transformative part. So if you read those books and, understandably, can’t do all of it, you’re at real risk of just doing the bits that are frankly not transformative (J. Paul is right to point out the Release Management chapter on this point) and ignoring the couple pieces that *are* the great part.
Oh, a related thought of Google doing it right – the “Measure What Matters” book on OKRs. On page 9 of that book, right up front John Doerr cites the Harvard Business School “Goals Gone Wild” article about the harm using goals badly can have on organizations and even reproduces a big “WARNING!” label graphic about how they can be misused. This understands, acknowledges, and educates people about how the OKR technique can be misused as well as used from the very beginning. If he didn’t include that in his book, would it be as honest a book? No it wouldn’t.
The SRE book (and as a result the industry) could benefit a lot from having the same warning up front, and focusing on the few elements that makes SRE an actual advancement and not just a regressive step back to siloed ops. As with OKRs, people will still misuse it, but at least then, as Doerr did, you can say “Hey, I told you right up front how not to misuse this.”
By the way, I’ve been very impressed with the thoughtful responses here and on Twitter, from people who have clearly bothered to read the article and understand it and not just tweet responses to the title; it heartens me that deeper discourse is still possible (some weeks I doubt, to be honest).
Pingback: DevOps Enterprise Summit 2018 Planning - Liatrio
I take your point. “Don’t drink the same old wine in a brand new bottle.” However my Kanban experience is to break out of Sprint and transition to continuous delivery. As a “Developer Advocate” formerly known as DevOpsian, I do want metrics, not toil. Automate all the things! I also work every ticket I’m assigned (or create) as its own “Sprint.” I am a toolsmith. As soon as I test my solution, I deliver it to production, which in my case is frequently Jenkins. That Jenkins instance is a production server to the developers I support. It also happens to be my dev environment. 😉
Pingback: DevOps Enterprise Summit 2018 Planning – Ravi Kalaga
We’re all learners, if a team says they’re doing SRE, they might have just begun, they might not be doing it properly at first and it might take them months if not a year to put all the practices in place. I’m someone who brought SRE into my organization, we might not have google like skills but it doesn’t mean we can’t learn or hire the talent.
Right, but my whole point is that it’s not about the skills, it’s about how you set up the “SRE”-ing. The secret sauce isn’t that Google hires mysterious special people. It’s that they set up their SRE org to not work like an old school ops org. But most people miss this and set it up like an old school ops org. You don’t hire your way out of that (all the same people in SRE orgs now used to work in older orgs, except the kids), and it’s hard to “learn your way out” of a defective setup.