We talk a lot about communication in the DevOps space. Often the discussion is about how to “talk nice” and be collaborative, but I wanted to take a minute to talk about something even more profound – the precision of your communication in the first place.
I was reminded of this because over the holidays I’ve been doing some nonfiction reading – first, Thomas J. Cutler’s The Battle of Leyte Gulf, and after that, Cornelius Ryan’s classic A Bridge Too Far.
Both books are rife with examples of how imprecise language – often at the highest levels – caused terrible errors and loss of life.
The example that I’ll unpack is the infamous (to WWII wonks, at least) example of Admiral “Bull” Halsey’s naval charge northward leaving the landing forces at Leyte Gulf completely unprotected. He had previously sent a message indicating his intent to split his forces from three into four task forces. When he later sent a message that he was proceeding north with “three task forces” to pursue Japanese aircraft carriers, Admiral Nimitz and other commanders assumed that the fourth was being left behind to defend the landing forces. But this was not the case – he had meant “I plan to reform into four groups, you know, sometime in the future, but not now.” So he went north with his entire fleet, with no one behind to defend the landing forces.
By the time the others began to suss out that something was wrong, it was too late – Halsey was too far north to return in time when a large Japanese fleet came upon the much lighter landing forces and forced them into a lengthy fighting retreat. Many American ships sacrificed themselves against an overwhelming opponent to buy time for the doomed task force, until for reasons still not made clear to posterity, Japanese Admiral Kurita broke off the attack for no good reason. (Extensive historical analysis hasn’t clearly determined whether he had bad intel, suffered from battle fatigue, or simply lost his nerve. The after action report of Admiral Sprague of the landing forces simply stated that “The failure of the enemy… to wipe out all vessels of this task unit can be attributed to our successful smoke-screen, our torpedo counterattack, continuous harassment of the enemy by bomb, torpedo, and strafing air attacks, timely maneuvers, and the definite partiality of Almighty God.”) A happier ending than it could have been, but still, ships sank and men died due to a lack of clear and precise communication.
The same kind of miscommunication happens at work all the time. A recent real example – I was incident commander on a production incident where a customer-facing server couldn’t be connected to by a support server used to perform interactive customer debugging sessions. It was being coordinated through HipChat. One engineer was looking at the problem from the the customer-facing server. One of the engineers who administers the support server came on to chat and a discussion ensued. The support engineer indicated that there were hundreds of interactive customer support sessions in progress on that support server. Another senior engineer, an expert on the customer servers, declared, “Well, let’s try rebooting the server. It’s the only thing to do.” “Really?” says the support engineer? “Yes, let’s do it,” says the senior engineer.
Of course he meant the customer-facing server, which we knew didn’t have any traffic at the moment, and not the support server. But the support engineer assumed he meant the support server, and proceeded to begin to reboot it – at this point I intervened and had to say “STOP, HE IS NOT TALKING ABOUT YOUR SERVER.” But it was a reasonable mistake – if you talk about “the server,” then everyone is going to interpret that as “the server most important to me at this time.” Anyway, we stopped the support engineer from disrupting 100 customer support sessions and we restarted the other server instead. (Which didn’t help of course, “reboot the server” is something that went out with Windows NT, it’s a desperate attempt to not perform quality troubleshooting – but that’s material for another post.)
If at any time you find yourself talking to someone else about something there’s clearly more than one of:
- “the” server – or even “the” support server or “the” bastion server
- “the” incident/problem
- “that” ticket
- “that” email I sent you
- “the” bug
- “the” build
You are being a dumbass. I can’t be blunt enough about this. You are deliberately courting miscommunication that in the best case adds time and friction and in the worst case causes major errors to be made. I have seen costly errors happen because of real world exchanges just like this:
“Are you working on the problem?” “Yes.” (And by this they mean the other problem the asker doesn’t know about.)
“It’s critical we fix that big bug from this morning!” “Yes, someone’s on it!” (And by this they mean the bug that got ranked as high criticality in the backlog, but the asker means their own special favorite bug.)
“Is the server back up?” “Yes.” “OK I will now begin a lengthy operation… Hey what the hell?!?” (Oh, you meant another one of the 5 servers being restarted this hour.)
Everyone in an environment should know, in a sober moment, that there are often multiple incidents going on in a day (and even at the same time), that there are multiple servers and multiple tickets (and apps and environments and builds and developers and customers…).
If you hit me in chat and say “the build broke what could be wrong” – and our build server has somewhere around 100 builds in it, all of which are breaking and being fixed continuously by a large group of developers, at best you are being super inconsiderate and forcing me to do diagnostic work that could be forestalled by you cut-and-pasting an URL or whatnot into your browser, and potentially I’m going to go look at some other build and you get help much later than you wanted it.
And that, if you’ve sent me 10 emails that morning, asking me “if you got that last email I sent you” is pointless verbal masturbation of an excessively time-wasting variety. Either I’m going to assume and potentially give you the wrong answer, or I have to spend time trying to elicit identifying information from you. Don’t do it.
I expect crisp and exact communication out of my engineers, and it’s as simple as hearkening back to grade school and using proper nouns.
You are working on bug BUG-123, “Subject line of the bug.” You are logged into server i-123456 as ec2-user. You are working on incident I-23, “Customer X’s system is down.” You sent me an email around 10 AM with the subject line “Emergency audit response needed.” SAY THAT.
Being in chat vs email vs in person/phone/video call is not an excuse. This is not a “tone formality” issue, it is a precision and correctness issue.
Having people with different primary languages is not an excuse. Unless you’re from El-Adrel, your language has proper nouns and you know how to use them, and you know which server you’re logged into. (Well, OK, you might not, but that’s a different problem.) Slang is a language problem, other things are a language problem, but being able to clearly state the proper noun at hand is NOT a language problem. Proper nouns are the one thing that never need translating!
Detect Fuzzy Language And Make It Specific
But what if you’re not the one initiating the discussion? What if someone else is already giving you unclear direction or questions? Well, I’m going to let you in on a little secret – managers are often terrible at this. Like with Admiral Halsey, sometimes being “hard charging” gets confused with “lack of clarity.” Sometimes it seems like the higher level of management someone’s in, the more they think that the general rules of discourse and process don’t apply to them. You must not be “in tune with the business needs” if you aren’t sure exactly which thing they are vaguely referring to. Frustrating, but you just need to manage up by starting to be precise yourself. Don’t compound the possible error by making an assumption; if there’s not a proper noun in play you need to state one yourself to ensure precision.
“Customer X is angry, are you working on that bug?!?” Obviously “yes” or “no” is probably an answer that, if you’re not on the same page already, will continue to cause confusion. Reply “I am working on bug BUG-123, ‘Subject line of the bug.’ Is that the one you’re talking about?”
“Did you get that last email, I expect a response!” says the call or text. Well, just saying “yes” because I saw an email from that guy an hour ago I just responded to isn’t right – it might not be the right one. Get clarification.
Whatever happens, try to verify information you’re being given. “Reboot the server” – “To confirm, you want me to reboot server i-123436, the customer support server, right?”
Ask fast. In the example above, Nimitz and the other commanders just let the imprecision slide, thinking “Well… I’m sure he meant the fourth task group is staying… I don’t want to look dumb by asking or make him look bad by having to clarify. It’ll all be fine.” And by the time they got nervous enough to ask him directly it was too late for him to return in time and people died as a result. Now sometimes issues are timely and the person sending the request/demand doesn’t respond to your request for clarification in a timely manner, even if you respond immediately. What do you do? Well, you have to do what you think is best/safest, but try to get things back on the right path ASAP by using all the communication paths at hand to clarify. If the request was unclear, do not let your response be unclear. “Restart the server!” “Done.” You’re both culpable at this point.”I restarted server i-123456 at 10:50 AM” puts you in the right.
Then afterwards, send the person this article so they can understand how important precision is. Luckily, in IT it is usually less likely to cause deaths than in the military, but poor communication can and has lost many dollars, hours, and jobs.