Sorry it’s been so long between blog posts. We’ve been caught in a month of Release Hell, where we haven’t been able to release because of supplier performance/availability problems. And I wanted to take a minute to vent about something I have heard way too much from suppliers over the last nine years working at NI, which is that “none of our other customers are having that problem.”
Because, you see, the vast majority of the time that ends up not being the case. And that’s a generous estimate. It’s an understandable mistake on the part of the person we are talking to – we say “Hey, there’s a big underlying issue with your service or software.” The person we’re talking to hasn’t heard that before, so naturally assumes it’s “just us,” something specific to our implementation. because they’re big, right, and have lots of customers, and surely if there was an underlying problem they’d already know from other people, right? And we move on, wasting lots of time digging into our use case rather than the supplier fixing their stuff.
But this requires a fundamental misunderstanding of how problems get identified. What has to happen to get an issue report to a state where the guy we’re talking to would have heard about it? If it’s a blatant problem – the service is down – then other people report it. But what if it’s subtle? What if it’s that 10% of requests fail? (That seems to happen a lot.)
Steps Required For You To Know About This Problem
- First, a customer has to detect the problem in the first place. And most don’t have sufficient instrumentation to detect a sudden performance degradation, or an availability hit short of 100%. Usually they have synthetic monitoring at most, and synthetic monitors are usually poor substitutes for real traffic – plus, usually people don’t have them alert except on multiple failures, so any “33% of the time” problem will likely slip through. No, most other customers do what the supplier is doing – relying on their customers in turn to report the problem.
- Second, let’s say a customer does think they see a problem. Well, most of the time they think, “I’m sure that’s just one of those transient issues.” If it’s a SaaS service, it must be our network, or the Internet, or whatever. If it’s software, it must be our network, or my PC, or cosmic rays. 50% of the time the person who detects the problem wanders off and probably never comes back to look for it again – if the way they detected it isn’t an obvious part of their use of the system, it’ll fly under the radar.Or they assume the supplier knows and is surely working on it.
- Third, the customer has to care enough to report it. Here’s a little thing you can try. Count your users. You have 100k users on your site? 20k visitors a day? OK, take your site down. Or break something obvious. Hell, break your home page lead graphic. Now wait and see how many problem reports you get. If you’re lucky, a dozen. Out of THOUSANDS of users. Do the math. What percentage is that. Less than 1%. OK, so even a brutally obvious problem gets less than 1% reports sent in. So now apply that percentage to the already-diminished percentages from the previous two steps. To many users, unless using your product is a huge part of their daily work, they’re going to go use something else, figuring someone else will take care of it eventually.
- Now, the user has to report the problem via a channel you are watching. Many customers, and this happens to us too, have developed through long experience an aversion to your official support channel. Maybe it costs money and they aren’t paying you. Maybe they have gotten the “reboot your computer dance” from someone over VoIP in India too many times to bother with it. Instead they report it on your forums, which you may have someone monitoring and responding to, but how many of those problems then become tickets or make their way to product development or someone important enough to hear/care? Or they report it on Server Fault, or any number of places they’ve learned they get better support than your support channel. Or they email their sales rep or a technical guy they know at your company or accidentally go to your customer service email – all channels which may eventually lead to your support channel, but every one of those paths being lossy.
- So let’s say they do report it to your support channel. Look, your front line support sucks. Yes, yours. Pretty much anyone’s. Those same layers that are designed to appropriately escalate issues each have friction associated with them. As a customer you have to do a lot of work to get through level 1 to level 2 to level 3. If the problem isn’t critical to you, often you’re just going to give up. How many of your tickets are just abandoned by a customer without any meaningful explanation or fix provided by you? Some percentage of those are people who have realized they were doing something wrong, but many just don’t have the time to mess with your support org any more – you’re probably one of 20 products they use/support. They usually have to invest days of dedicated effort to prove to each level of support (starting over each time, as no one seems to read tickets) that they’re not stupid and know what they’re talking about.
- Let’s say that the unlikely event has happened – a customer detects the problem, cares about it, comes to your support org about it, and actually makes it through front line support. Great. But have YOU heard about it, Supplier Guy Talking To Me? Maybe you’re a sales guy or SE, in which case you have a 1% chance of having heard about it. Maybe you’re an ops lead or escalation guy, in which case there’s a slight chance you may have heard of other tickets on this problem. Your support system probably sucks, and searching on it for incidents with the same symptoms is unlikely to work right. It’s happened to me many times that on the sixth conference call with the same vendor about an issue, some new guy is on the phone this time and says “Oh, yes, that’s right. That’s the way it works, it won’t do that.” All the other “architects” and everyone look at each other in surprise. But not me, I got over being surprised many iterations of this ago.
So let’s look at the Percentage Chance A Big Problem With Your Product Has Been Brought To your Attention.
%CABPWYPHBBTYA = % of users that detect the problem * % of users that don’t disregard it * % of users that care enough to report it * % of users that report it in the right channel * % of users that don’t get derailed by your support org * % of legit problems in support tickets about this same problem that you personally have seen or heard about.
This lets you calculate the Number of Customers Reporting A Real Problem With Your Product.
NOCRARPWYP= Your User Base * %CABPWYPHBBTYA
Example
Let’s run those numbers for a sample case with reasonable guesses at real world percentages, starting with a nice big 100,000 customer user base (customers using the exact same product or service on the same platform, that is, not all your customers of everything).
NOCRARPWYP =100,000 * (5% of users that detect it * 20% of users don’t wander off * 20% care enough to report it to you * 20% bring it to your support channel you think is “right” * 20% brought to clear diagnosis by your support * 5% of tickets your support org lodges that you ever hear about) = 0.4 customers. In other words, just us, and you should consider yourselves lucky to be getting it from us at that.
And frankly these percentage estimates are high. Fill in percentages from your real operations. Tweak them if you don’t believe me. But the message here is that even an endemic issue with your product, even if it’s only mildly subtle, is not going to come to your attention the majority of the time, and if it does it’ll only be from a couple people.
So listen to us for God’s sake!
We’ve had this happen to us time and time again – we’re pretty detail oriented here, and believe in instrumentation. Our cloud server misbehaving? Well, we bring up another in the same data center, then copies in other data centers, then put monitoring against it from various international locations, and pore over the logs, and run Wireshark traces. We rule out the variables and then spend a week trying to explain it to your support techs. The sales team and previous support guys we’ve worked with know that we know what we’re talking about – give us some cred based on that.
In the end, I’ve stopped being surprised when we’re the ones to detect a pretty fundamental issue in someone’s software or service, even when we’re not a lead user, even when it’s a big supplier. And my response now to a supplier’s incredulous statement that “But none of our other customers [that I have heard of] are having this problem, I can only reply:
But, you see, your other customers are dumber than we are.
100% agree on this. It is a classic. The customer is always right, even if you don’t know why. If they are reporting an issue, look into it, don’t assume they are crazy and this is an issue with their client. On the web , customers sentiment and perception of your brand may change based on how your site behaves, but people generally wont take action. Instead, they will look for an alternative, so having good ways to capture and act on these issues is imperative.
So, at the end of the day the burden of proof is who’s burden ? Customers’ or Service Providers’ ? How much $$$ does the customer needs to pay to shift the burden of proof back to the service provider ?
Exactly! There have been many times where we’ve spent man-months and money out of pocket to “prove to the supplier” there’s a problem. I’m bringing our expensive instrumentation tools to the table because people who make software or run cloud services apparently don’t buy them themselves.
Funny story, in my old team we used Opnet Panorama for deep dive Java application diagnostics. We were installing an early version of Vignette V7 and it was a pig. Just a crushingly poorly performing piece of software. Getting no traction via support, we analyzed their app and discovered the problem down to the method level and sent it in. Vignette was so impressed that they became Opnet customers themselves! And that made us happy – sure, we had a problem, but the supplier realized that such problems are legit and took active steps to square their shit away and improve going forward! That’s all I could ask for. Sadly, many companies seem to never learn.
This continues at my new Bazaarvoice gig… Just had a developer log a Tomcat7 bug ticket for a common use case (your servlet can load with the wrong classloader if you try to poll JMX stats while a Spring WebMVC app is initializing) – they fixed it fast, which is great, but with a dismissive “Given that the code concerned hasn’t changed in quite some time, if this were a common use case then you wouldn’t be the first person to report it. There is no need to try and make a bug sound more important than it really is.”
The ticket only got filed after one of our best developers stayed on top of it tenaciously for a while. He writes, “Even the rest of the team was trying to come up with ways to workaround the issue without actually understanding it. So, had I not been dependent on their team, the issue would probably have been resolved with “oh, well, it seems like there is some problem with Tomcat, or OpenJDK, or Spring — we don’t know what it is… we’ll just split up our servers since it seems to occur less in those cases.” And they wouldn’t have had a bug report.”
Hey ASF – read above. Just say “thanks” to well characterized bugs, don’t fall into the classic “well if it was bad someone would have reported it already” error.