Test Metrics Debunked – Defect Density (3/5)

This post is the third in our series on metrics in software testing. So far we’ve looked at residual risk (here), coverage (here), and this time it’s defect density.

The following is taken from the post that sparked the series…

3.  Defect density is another metric that matters. It translates into where are the defects and how many are there? Identify each of the parts of the solution that you care about (front end, back end, service layer), or user type, or functional area, or scenario then make sure everyone know these identifiers and uses them whenever a defect is raised.

From a bird’s eye view the idea of defect density is a good one, but as testers we know that the devil is in the detail. It could be seen as a powerful risk evaluation technique to be able to know where the defects are located in a particular product. However, the value stops with this illusion. It is about as useful as asking where the developer hid all the defects.

There are just too many confounding factors that make this measure invalid. Examples include:

  1. Tester skill – a more highly skilled tester, or perhaps one that is more suited to that particular context, should find more defects.
  2. Time – the longer you test something, the more things you notice potentially wrong with it.
  3. Developer skill – the more experience a developer has with the particular coding language, interfaces, development frameworks, and domain, the fewer defects there should be (that’s a large amount of variables just in the one sentence!).
  4. Function definition – a larger and/or more complex function would normally have more defects.
  5. Defect definition – if you include UI or usability defects which may have more interpretive/subjective elements, then these areas could have more defects.
  6. Test risks – the more focus there is on one area or function, the more defects should be found within it.
  7. Gaming – if testers are rewarded by defect count, then they can game the system by raising many ‘like’ defects to increase their score.
  8. Time scale – a shorter metric capture time will present more highly volatile numbers.
  9. Lifecycle stage – there is increasing pressure at the end of a test cycle to close it out and not raise defects, and to deliver something and reach milestones.
  10. Marketing – metrics are seldom reported from their raw capture. Instead, managers modify them to better suit the political climate and intention. Unethical or just a fact of any contract with huge financial payoff?
  11. Addition fallacy – all defects are of different sizes. One complex defect may not hold the same significance and risk as a usability defect. Therefore counting them together would be invalid.
  12. Defect density is dependent on bug severity evaluation. Different people at different times can judge defects differently, or not even as defects

The above is just a short list, we’re confident you can come up with even more!

So really, the only logical way to get a close to true measure of defect density is to do complete exhaustive testing. Sadly, this is impossible, and traditional testing (and so-called “best practice”) is designed to only touch on a shallow set of tests based on a specification. We will never get an accurate measure, and will always fall victim to the invalidating factors above (even exhaustive testing would fall foul of many of many of these factors).

Why are managers using invalid metrics?

Just because something is invalid doesn’t necessarily mean it is of no use. You only have to look at aspects of the world economy to know that. Let’s look at how this metric can appear to have value.

Some managers use this metric for the following reasons:

  1. They were told to.
  2. It is considered “best practice”.
  3. Certification boards include it in their syllabi.
  4. To appear to be finding potential risk sources and fixing them.
  5. To be used with other metrics to spot patterns.
  6. To support a story or summary of defects and perceived quality of the product or code.
  7. To self-audit as we progress.

As you can see that there could be a number of reasons to use this metric. But most of these reasons seem weak, almost unprofessional in some cases and perhaps unethical for others.

The most compelling and believable reason I see from above is pattern spotting for risk reduction.

Potential risk can be evaluated by undertaking pattern spotting on reports. Metrics show patterns and outliers. Groups of patterns can be interpreted to draw a similar conclusion. Then a question can be asked, and a risk or issue can be announced, and hopefully treated. The result is perceived improvements in process or product quality.

An example:

An abnormally high number of defects were found in a large new feature that was developed in the shorter sprint before Christmas by the new developer, Walter. Walter is a graduate programmer who is working on his first job. The tester was previously in the development team (which he loved) but was moved to testing as he was replaced by Walter.

How would you deal with this situation if you were the test manager (TM)?

  • Feature progress and completion is slow due to Walter having to fix the defects and redeploy to test.
  • The developer and tester do not seem to be getting along at the daily stand ups.
  • Other developers are taking early leave for the break.
  • The requirements for the feature keep having changes made to them.
  • There is an important release coming up that will use this functionality to reach a milestone and subsequent contract payment.

Like most people, I doubt your first response would be to send the report up to the manager. I’m sure you would seek to address many possible issues that appear to be happening, for example:

  1. Feature is large, complex, and will have more defects.
  2. Development time is shorter due to Christmas, leading to more mistakes.
  3. There is resentment between the tester and Walter.
  4. Walter is new and still learning about the organisation, and the product.

So now you have a bunch of things to get on with and resolve for the sake of the milestone payment and your company’s success.

But now how would you report this information to the manager?

At this point fear can take over from rationality. If a defect report shows that bugs are not evenly distributed across functions, then a question will be asked by the TM to the tester and developer. He uses this information to inform his manager when the TM is asked the same question. This is also called story-telling.

It’s a charade as with most metrics. Numbers are thrown up at a manager who questions them, and can explain the patterns to others. It’s also a big waste of time.

It’s marketing and salesmanship. Reports are discover-able by audit and are scrutinised by all levels of management. They are instruments of a controlled and centralised message to stakeholders. That is why they are manipulated.

Often reports are altered to remove outliers to smooth the patterns to what the TM and other manager believe is a more realistic situation.

So how would you report this?

  1. Remove the outlier as you are managing it?
  2. Include the outlier with a commentary of how you are managing it?
  3. Standardise the outlier in line with the others to smooth the results to avoid questions; as you are resolving the issues?

Hopefully, you are with me and see little value with even reporting this metric. Not all metrics are useful. It seems pointless to report this metric as it is a waste of time and not worth the scrutiny of management.

Here is the point – In practice what you are actually doing is relying on metrics to show you are attempting to monitor, control, and evaluate the test process. It’s audit-able, and looks great on paper, right?

The problem is that as soon as you scratch beneath the surface of these metrics, the whole house comes tumbling down. You discover they are invalid due to too many confounding factors. Each factor diminishes the power of the measure to an unknown extent. The metric is untrustworthy, unprofessional, and a waste of your time. Using metrics makes testing slower, more costly, and worse off.

And what about product quality? What are we left with?

Product quality has been replaced with a poor surrogate measure that confuses management and the customer enough to make them believe they are doing a high quality test process.

Product quality has been absorbed into a process and the process is being measured instead. It becomes more about reading and questioning the report rather than asking how the product is really doing.

But I have been told this is “best practice”. Is there another way?

Yes there are ways you can safeguard your test reporting:

  1. Do not report defect density, it’s invalid.
  2. Report the actual top defects by describing them.
  3. Let the managers manage their team and the issues.
  4. Do not report team issues through invalid metrics.

So next time you are presented with a defect density metric, or asked to produce one… what will you do?

This is the 3rd article in a series of five on software testing metrics and was written by Richard Robinson & David Greenlees.


About David


2 thoughts on “Test Metrics Debunked – Defect Density (3/5)

  1. I’ve seen defect density used on several of the larger projects I’ve been on. I’ve never seen anyone use that as the only metric nor have I really seen anyone use it for any other reason than as an indicator. If defect density indicates a skewed report, it simply acts as a red flag. This flag would typically trigger us to look into the other considerations you listed above. Considering that the calculation of defect density doesn’t cost anything as most modern test management tools offer this freely, why would you not look at it? It’s information, if even a small piece so why would you ignore any information available to you as a test manager?

    I’ve never seen anyone rewarded or punished because of this number. If that is the case, I’d say the issue resides with the test manager and not necessarily with the metric. Seems like people are looking at individual metrics and saying because these numbers cannot indicate the overall quality of a project/product/system they’re meaningless. That’s like saying you cannot base the ability of a baseball player solely on his batting average so calculating a batting average is useless, right?

  2. I agree completely with John. DDD is a result. It shows the number of defects normalized by size of delivery (FPs or KLOCs). What you point out as a rationale for not putting any value in the metric is precisely its value. If DDD is unusually high, it’s a red flag that says it’s time to perform a root cause analysis to see WHY it’s high. That’s what points you to developer/tester competency as a contributing cause, or as you mention developer-tester conflict. Or any number of other things. Without the metric to guide you, you might never know that there are underlying issues that need to be addressed.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s