Tag Archives: bug

Is It a Bug Or A Feature? Who Cares?

Today I’ve been treated to the about 1000th hour of my life debating whether something someone wants is a “bug” or a “feature.”  This is especially aggravating because in most of these contexts where it’s being debated, there is no meaningful difference.

A feature, or bug, or, God forbid, an “enhancement” or other middle road option, is simply a difference between the product you have and the product you want. People try to declare something a “bug” because they think that should justify a faster fix, but it doesn’t and it shouldn’t. I’ve seen so many gyrations of trying to qualify something as a bug. Is it a bug because the implementation differs from the (likely quite limited and incomplete) spec or requirements presented?  Is it a bug because it doesn’t meet client expectation?

In a backlog, work items should be prioritized based on their value.  There’s bugs that are important to fix first and bugs it’s important to fix never.  There’s features it’s important to have soon and features it’s important to have never.  You need (and your product people) need to be able to reconcile the cost/benefit/risk/etc across any needed change and to single stack-rank prioritize them for work in that order regardless of the imputed “type” of work it is.  This is Lean/Agile 101.

Now, something being a bug is important from an internal point of view, because it exposes issues you may have with your problem definition, or coding, or QA processes. But from a “when do we fix it” point of view, it should have absolutely no relation. Fixing a bug first because it’s “wrong” is some kind of confused version of punishment theory. If you’re distinguishing between the two meaningfully in prioritization, it’s just a fancy way of saying you like to throw good money after bad without analysis.

So stop wasting your life arguing and philosophizing about whether something in your backlog is a bug or enhancement or feature.  It’s a meaningless distinction, what matters is the value that change will convey to your users and the effort it will take to perform it.

I’m not saying one shouldn’t fix bugs – no one likes a buggy product.  But you should always clearly align on doing the highest leverage work first, and if that’s a bug that’s great but if it’s not, that’s great too.  What label you hang on the work doesn’t alter the value of the work, and you should be able to evaluate value, or else what are you even doing?

We have a process for my product team – if you want something that’s going to take more than a week of engineer time, it needs justification and to be prioritized amongst all the other things the other stakeholders want.  Is it a feature?  A bug?  A week worth of manual labor shepherding some manual process? It doesn’t matter.  It’s all work consuming my high value engineers, and we should be doing the highest value work first.  It’s a simple principle, but one that people manage to obscure all too often.

5 Comments

Filed under General

But, You See, Your Other Customers Are Dumber Than We Are

Sorry it’s been so long between blog posts.  We’ve been caught in a month of Release Hell, where we haven’t been able to release because of supplier performance/availability problems. And I wanted to take a minute to vent about something I have heard way too much from suppliers over the last nine years working at NI, which is that “none of our other customers are having that problem.”

Because, you see, the vast majority of the time that ends up not being the case. And that’s a generous estimate.  It’s an understandable mistake on the part of the person we are talking to – we say “Hey, there’s a big underlying issue with your service or software.” The person we’re talking to hasn’t heard that before, so naturally assumes it’s “just us,” something specific to our implementation. because they’re big, right, and have lots of customers, and surely if there was an underlying problem they’d already know from other people, right? And we move on, wasting lots of time digging into our use case rather than the supplier fixing their stuff.

But this requires a fundamental misunderstanding of how problems get identified.  What has to happen to get an issue report to a state where the guy we’re talking to would have heard about it? If it’s a blatant problem – the service is down – then other people report it.  But what if it’s subtle?  What if it’s that 10% of requests fail? (That seems to happen a lot.)

Steps Required For You To Know About This Problem

  1. First, a customer has to detect the problem in the first place.  And most don’t have sufficient instrumentation to detect a sudden performance degradation, or an availability hit short of 100%. Usually they have synthetic monitoring at most, and synthetic monitors are usually poor substitutes for real traffic – plus, usually people don’t have them alert except on multiple failures, so any “33% of the time” problem will likely slip through.  No, most other customers do what the supplier is doing – relying on their customers in turn to report the problem.
  2. Second, let’s say a customer does think they see a problem.  Well, most of the time they think, “I’m sure that’s just one of those transient issues.” If it’s a SaaS service, it must be our network, or the Internet, or whatever.  If it’s software, it must be our network, or my PC, or cosmic rays. 50% of the time the person who detects the problem wanders off and probably never comes back to look for it again – if the way they detected it isn’t an obvious part of their use of the system, it’ll fly under the radar.Or they assume the supplier knows and is surely working on it.
  3. Third, the customer has to care enough to report it. Here’s a little thing you can try.  Count your users.  You have 100k users on your site? 20k visitors a day?  OK, take your site down. Or break something obvious.  Hell, break your home page lead graphic.  Now wait and see how many problem reports you get. If you’re lucky, a dozen. Out of THOUSANDS of users.  Do the math.  What percentage is that.  Less than 1%.  OK, so even a brutally obvious problem gets less than 1% reports sent in. So now apply that percentage to the already-diminished percentages from the previous two steps. To many users, unless using your product is a huge part of their daily work, they’re going to go use something else, figuring someone else will take care of it eventually.
  4. Now, the user has to report the problem via a channel you are watching.  Many customers, and this happens to us too, have developed through long experience an aversion to your official support channel. Maybe it costs money and they aren’t paying you. Maybe they have gotten the “reboot your computer dance” from someone over VoIP in India too many times to bother with it.  Instead they report it on your forums, which you may have someone monitoring and responding to, but how many of those problems then become tickets or make their way to product development or someone important enough to hear/care?  Or they report it on Server Fault, or any number of places they’ve learned they get better support than your support channel. Or they email their sales rep or a technical guy they know at your company or accidentally go to your customer service email – all channels which may eventually lead to your support channel, but every one of those paths being lossy.
  5. So let’s say they do report it to your support channel.  Look, your front line support sucks. Yes, yours. Pretty much anyone’s.  Those same layers that are designed to appropriately escalate issues each have friction associated with them. As a customer you have to do a lot of work to get through level 1 to level 2 to level 3.  If the problem isn’t critical to you, often you’re just going to give up.  How many of your tickets are just abandoned by a customer without any meaningful explanation or fix provided by you?  Some percentage of those are people who have realized they were doing something wrong, but many just don’t have the time to mess with your support org any more – you’re probably one of 20 products they use/support. They usually have to invest days of dedicated effort to prove to each level of support (starting over each time, as no one seems to read tickets) that they’re not stupid and know what they’re talking about.
  6. Let’s say that the unlikely event has happened – a customer detects the problem, cares about it, comes to your support org about it, and actually makes  it through front line support.  Great.  But have YOU heard about it, Supplier Guy Talking To Me?  Maybe you’re a sales guy or SE, in which case you have a 1% chance of having heard about it.  Maybe you’re an ops lead or escalation guy, in which case there’s a slight chance you may have heard of other tickets on this problem.  Your support system probably sucks, and searching on it for incidents with the same  symptoms is unlikely to work right. It’s happened to me many times that on the sixth conference call with the same vendor about an issue, some new guy is on the phone this time and says “Oh, yes, that’s right.  That’s the way it works, it won’t do that.” All the other “architects” and everyone look at each other in surprise.  But not me, I got over being surprised many iterations of this ago.

So let’s look at the Percentage Chance A Big Problem With Your Product Has Been Brought To your Attention.

%CABPWYPHBBTYA =  % of users that detect the problem * % of users that don’t disregard it * % of users that care enough to report it * % of users that report it in the right channel * % of users that don’t get derailed by your support org * % of legit problems in support tickets about this same problem that you personally have seen or heard about.

This lets you calculate the Number of Customers Reporting A Real Problem With Your Product.

NOCRARPWYP= Your User Base * %CABPWYPHBBTYA

Example

Let’s run those numbers for a sample case with reasonable guesses at real world percentages, starting with a nice big 100,000 customer user base (customers using the exact same product or service on the same platform, that is, not all your customers of everything).

NOCRARPWYP =100,000 * (5% of users that detect it * 20% of users don’t wander off * 20% care enough to report it to you * 20% bring it to your support channel you think is “right” * 20% brought to clear diagnosis by your support * 5% of tickets your support org lodges that you ever hear about) = 0.4 customers.  In other words, just us, and you should consider yourselves lucky to be getting it from us at that.

And frankly these percentage estimates are high.  Fill in percentages from your real operations. Tweak them if you don’t believe me. But the message here is that even an endemic issue with your product, even if it’s only mildly subtle,  is not going to come to your attention the majority of the time, and if it does it’ll only be from a couple people.

So listen to us for God’s sake!

We’ve had this happen to us time and time again – we’re pretty detail oriented here, and believe in instrumentation. Our cloud server misbehaving?  Well, we bring up another in the same data center, then copies in other data centers, then put monitoring against it from various international locations, and pore over the logs, and run Wireshark traces.  We rule out the variables and then spend a week trying to explain it to your support techs. The sales team and previous support guys we’ve worked with know that we know what we’re talking about – give us some cred based on that.

In the end, I’ve stopped being surprised when we’re the ones to detect a pretty fundamental issue in someone’s software or service, even when we’re not a lead user, even when it’s a big supplier. And my response now to a supplier’s incredulous statement that “But none of our other customers [that I have heard of] are having this problem, I can only reply:

But, you see, your other customers are dumber than we are.

4 Comments

Filed under DevOps

Google Chrome Hates You (Error 320)

The 1.0 release of Google Chrome has everyone abuzz.  Here at NI, loads of people are adopting it.  Shortly after it went gold, we started to hear from users that they were having problems with our internal collaboration solution, based on the Atlassian Confluence wiki product.  They’d hit a page and get a terse error, which if you clicked on “More Details” you got the slightly more helpful, or at least Googleable, string  “Error 320 (net::ERR_INVALID_RESPONSE): Unknown error.”

At first, it seemed like if people reloaded or cleared cache the problem went away.  It turned out this wasn’t true – we have two load balanced servers in a cluster serving this site.  One server worked in Chrome and the other didn’t; reloading or otherwise breaking persistence just got you the working server for a time.  But both servers worked perfectly in IE and Firefox (every version we have lying around).

So we started researching.  Both servers were as identical as we could make them.  Was it a Confluence bug?  No, we have phpBB on both servers and it showed the same behavior – so it looked like an Apache level problem.

Sure enough, I looked in the logs.  The error didn’t generate an Apache error, it was still considered a 200 OK response, but when I compared the log strings the box that Chrome was erroring on showed that the cookie wasn’t being passed up; that field was blank (it was populated with the cookie value on the other box and on both boxes when hit in IE/Firefox).  Both boxes had an identically compiled Apache 2.0.61.  I diffed all the config files- except for boxname and IP, no difference.  The problem persisted for more than a week.

We did a graceful Apache restart for kicks – no effect.  Desperate, we did a full Apache stop/start – and the problem disappeared!  Not sure for how long.  If it recurs, I’ll take a packet trace and see if Chrome is just not sending the cookie, or sending it partially, or sending it and it’s Apache jacking up…  But it’s strange there would be an Apache-end problem that only Chrome would experience.

I see a number of posts out there in the wide world about this issue; people have seen this Chrome behavior in YouTube, Lycos, etc.  Mostly they think that reloading/clearing cache fixes it but I suspect that those services also have large load balanced clusters, and by luck of the draw they’re just getting a “good” one.

Any other server admins out there having Chrome issues, and can confirm this?  I’d be real interested in knowing what Web servers/versions it’s affecting.  And a packet trace of a “bad” hit would probably show the root cause.  I suspect for some reason Chrome is partially sending the cookie or whatnot, choking the hit.

3 Comments

Filed under General, Uncategorized