Software Test & Performance Collaborative

Community, Resources & Knowledge Sharing for Test & QA Professionals

Matt Heusser’s Blog

Testing at the Edge of Chaos

Naïve Metrics

By Matthew Heusser on January 20, 2010 | 5 comments.

Let’s say you don’t know anything about software testing. It makes sense that, to figure out status, you could be able to reduce the work to a set of basic primitives – maybe test cases. Then you can count the total number of test cases, the number to go, do some simple arithmetic, and come up with the percentage complete of the project, right?

A little more arithmetic, and we can come up with a predicted end-date of the project.

Likewise, if we have another primitive – say a bug report, and a field in bug tracker to indicate if the bug was discovered pre-or-post release, we can calculate our defect detection percentage – DDP.

With just a few of these “hard” measurements in place, we can manage our test – even who development projects quantitatively, accomplishing the goals of CMMI level four, possibly five.

No, really, you can. I interviewed the CMMI product manager, J. Michael Phillips, in December of 2009, and he said so. The core metrics he recommended were something like time, features, productivity (source lines of code), and defects. If you want advice on how to perform this kind of best practice, you only need look to the latest issue of Professional Tester Magazine.

What follows is a true story.

The Great CD Debacle
Not too many years ago I was hired into a remote office of a Fortune 200 company we’ll call BigCo. BigCo has many businesses, but the purchasing organization created a sort of yellow pages for construction materials – big, thick, multi-volume books that had every product you could think of, and how to buy them. BigCo bought this much smaller product company, which I will call TinyDiv, in order to get technology to create and compile electronic catalogs onto a CD. The general idea was to ship compact disks instead of it’s multi-volume (thick) sales catalogs. (Yes, folks, this was before the web was popular)

The project was quantitatively managed, at least in that the director of software development had goals for budget and schedule.

Then Microsoft did something funny; they released Windows 98. (It might have been NT; It was over a decade ago) Now the outsourced testing company was happy to put Windows 98 on the list of operating systems to be tested – but this destroyed the schedule and the budget, costing the executives a few bonuses.

The Director of Development took a pen, marked a line through Windows 98, and said “I will have our team QA that operating system.”

Only, he didn’t tell anyone that they, specifically, were responsible for testing it.

The CD did not just fail to install under the newest windows.

It did not just cause a blue screen.

It corrupted the hard drive such that the machine was unable to boot.

The result? The software was recalled. Within a year, corporate HQ moved the work that the CD-product team did to other offices. By the time I was hired, all that remained was the original TinyDiv, still working on that original product that worked and made money.

In my first week, two of my teammates took me upstairs, to show me an entire floor of Comerica building in downtown Grand Rapids, empty. “The CD people used to work here”, they said, “If you see anything you want, take it. The lease is up in a month and it’s going to the scrap yard.”

How do I know about the project? Because I overheard people talking to the lawyers from corporate who wanted to sue to the QA company, and I heard the developers saying things the lawyers did not want to hear.

But gosh, what great defect detection percentage! Look at how many bugs QA found versus how many were found in the field! We only found one serious bug in the field! (Well, we might have found more, but we had to recall the product.)

Naïve Metrics

Hopefully, it’s obvious in the story above that D.D.P. just doesn’t make much sense; in order to quantitatively measure, you’ve got to be comparing apples to apples. At the very least, you’d have to factor in some kind of defect severity, possibly including how often we expect the users to encounter the defect. And what these really are, are guesses that we’ll plug into a formula. Even with severity, it’s unlikely that a “sev one” is exactly five times as bad as a “sev five”, or that five “sev fives” equal a “sev one” – but a simplistic formula will come to that conclusion. And, just as obviously, in the example from professional tester magazine (on page seven), if more bug reports are good, then we’re likely to get more bug reports. Some of these could have been handled by talking directly to the developers, or we may get a different bug report for every typo out of a list of 100 on the help screen. And YES, I once knew a developer who was evaluated by the number of change controls he put in, so in order to move fifteen files that were essentially one change, he put in – you guessed it – fifteen change control requests.

But where does this desire for metrics come from:

1) Lack of trust. The Manager says “I need a week”, and his stakeholders do not believe him. So he pulled out metrics. Or perhaps he needs three weeks and his stakeholders want it in one. Or perhaps he thinks the test team is doing well and wants bigger raises “look we found more bugs this year.” Or, just perhaps, the outsourcing company wants to attempt to prove it’s value.

2) Desire for control. Simplistic measures promise to make management easy. After all, all a manager should have to do is look at a spreadsheet every once in a while, and if he sees green, everything is fine. If there’s yellow or red, call the direct reports, demand status, ask what they are doing about, and check back in a few days. The problem is, they can’t deliver. The organization might be spiraling out of control, but the report is all green. (Anyone who worked at a big organization in the 1990’s, consider my servant the gannt chart. That was all about creating the illusion of control now, wasn’t it?)

3) Lack of understanding. Have you ever wondered what the purpose of grades are in school? The teacher doesn’t need them; he knows how well the students are doing. The grade “B+” is actually a lossy abstraction – it lumps the student who has mastered the material but never does homework in with the one that tries hard but always misses the harder problems. It turns out that grades are a benefit to the parents, administration, and college entrance people, who aren’t in the classroom and need some advice on what is going on. In our home school, we don’t give out grades to our students; we actually know how well they are doing.

A few alternatives I have tried and had success with:

1) If you don’t know what’s going on in your organization – find out – by actually being involved. The Scrum and XP folks suggest that customers attend the daily stand-up meeting, or that the customer be embedded in the team, both of which I have had success with. Another option is management by walking around.

2) Note my concern is the use of naive metrics as a sum total strategy to figure out “how we are doing”. Thus, you look at your metrics for the week, and if things are green, you breathe a sigh of relief and go play golf – or, if they aren’t, you call your direct reports, yell and scream, then check back again next week. I do not have a problem with digging into the numbers as an investigatory process, as part of a balanced breakfast.

3) Likewise, I expect that individuals are using metrics every day, in order to figure out dynamics and make plans. These numbers are part of a one-time problem solving strategy, often thrown away after the fact. DDP used one-time – say, looking at the numbers from this year to last year – as part of a balance breakfast – might not be that terrible. It’s when it is used repeatedly that the act of observing tends to skew behavior, and we begin to see dysfunction.

4) Earlier I mentioned dysfunction. Keep in mind, you’ll tend to get exactly what you measure. If you measure test cases, you’ll likely get lots of test cases, and even some productivity — at first. But, eventually, the team will realize that test cases and productivity are two different things, and find the shortest possible way to get you the test cases. By exploiting this difference, each individual test case will likely provide less value – thus, the assumption that “counting test cases” is roughly equivalent to productivity becomes less and less valid over time. There’s a gentleman names Robert Austin who earned his PhD at Carnegie Mellon who studied dysfunction. He concluded that since projects are multi-dimensional, any single metric (even a handful) is likely to leave things un-measured, and teams are likely to steal from those “peters” to pay the “paul” that is measured. The classic example is that, if you are measured by time, features, AND defects, you take on technical debt.) Austin’s book, Measuring and Managing Performance in Organizations is a classic in it’s field. His recommendation is to pick a small percentage of projects and do a thorough after-action review, or retrospective, that takes everything into account, and try to take home real lessons from that.

5) When evaluating quality, consider qualitative metrics and rules of thumb, as opposed to hard numbers. This can be as simple as a thumbs up or down “should we ship” crowd-source decision by the team – or at least using that as input for a decision maker. For a detailed analysis of software engineering metrics, consider the classic paper by Kaner and Bond, which ultimately recommends qualitative evaluation.

Putting it together
That leaves us with a small handful of tools – get actively involved with project, manage by walking around, conduct detailed retrospectives, or use metrics as part of a balanced breakfast to inform, not to convince, evaluate, or to control. But what if you really want to use metrics to do lightweight to control with integrity – say you have an organizational mandate?

Well, first, you could get a better job. No, really. I’m serious. I’m reluctant to offer advice on how to make a bad idea work. That said …

I do think organizations can use metrics in a mature and sophisticated way. To do that, I would introduce the metrics in context, as part of a story, including the limits and weaknesses in the approach. For example, when a particular idea should not work, and how to figure out why it worked in this instance, and the complex dance of experience and research we used to validate our opinion.

When I see these naive metrics, I smell a theory and ideology that has never been tested; yet another Pied Piper of Hamelin, telling people what they want to hear.

We can, and should, demand better.

The tyranny of the deadline

By Matthew Heusser on October 6, 2009 | 8 comments.

Hang around software development or testing long enough, read enough blogs and books, and sooner or later, you are likely to hear the great crisis talk. It typically goes something like this:

(A) Software Development is in a state of crisis!
(B) (Quote the Chaos Report)
(C) Something must be done!
(D) Therefore – whatever the presenter is selling

The presenter could be selling anything – from process improvement, to ‘getting the requirements right’, to ‘prevention’, or ‘inspection’ – it doesn’t really matter.

The point is, software development is in a state of crisis and we must be fix it!

Wait a minute. Let’s hold our horses. What is this Chaos report, exactly?

It turns out that a business think-tank called the Standish group conducts a series of surveys to determine the overall state of the health of IT; the 1995 report is available on-line free. At the very least, it is an interesting read.

The Standish’s report (the “Chaos Report”) takes IT projects and divides them into three buckets:

Three Project Outcomes
Resolution Type 1, or project success: The project is completed on-time and on-budget, with all features and functions as initially specified.

Resolution Type 2, or project challenged: The project is completed and operational but over-budget, over the time estimate, and offers fewer features and functions than originally specified.

Resolution Type 3, or project impaired: The project is canceled at some point during the development cycle.

Three buckets. Really?
Imagine, for a moment, we used the same criteria to evaluate movies. Did you know that the movie Titanic had it’s original release date of June 1995 pushed back to December? Under these definitions it would be “challenged”, even though it was the single largest grossing film in the history of cinema.

Forget Titanic; Star Wars was six months late! That’s right kids; the single largest iconic geek movie of the 20th century – the one that launched five sequels, countless action figures, lunch boxes, breakfast cereals, toy light sabres that make whoosing sounds and it’s own version of “Monolopy” – should be caulked up as a failure. (It did run $3 million over budget, after all, costing $11 million instead of $8)

Oh, bull pucky.

Even worse than late or over-budget, the Chaos report considers projects that were “canceled early” a failure, when in my career I have worked on projects that delivered 90% of the value in 50% of the allotted time. Likewise, some projects should be canceled if the competitive environment changes and the project becomes irrelevant.

What all this means is the report doesn’t allow for the possibility of change – positive or negative. Instead of measuring the execution of the software team, it actually measures the ability of the estimators at the beginning of the project; not only on their estimation, but also on their ability to predict the future (changes to the environment) that will happen before the project finishes.

Pictured this way, the ’shocking’ and ‘abysmal’ ‘failure’ rates actually mean that, as an industry, we aren’t so good at predicting the future.

I don’t think this is all that shocking, nor would I call it a crisis.

Where did we go wrong?
The biggest problem I see is that the chaos report is only looking at a single dimension – “did the project hit the deadline” instead of asking a more reasonable question, like, for example “did the delivered software have a good return on investment?”

Or, perhaps, one single question just isn’t enough. It’s like asking “what’s your weight” to someone; it’s hard to draw any conclusions without knowing their height. Even then, a weightlifter can come off “overweight”, you need body fat index to really tell. And that might not be healthy; you’d need to know cholesterol

.. you get the point. Drawing conclusions with a single dimension can be dangerous.

How can we do it you one better? Well, the usual. Develop in small chunks. Adapt to change. Plan to re-plan. Think while we do. Limit our work in progress, so that if a new opportunity arises, we have to throw away less work. Plan phase gates that release working software – not documents that may or may not be useful.

And, as testers, try not to be fooled by happy-talk – or even crisis-talk.

Meanwhile if you want to build a new framework to analyze projects in our industry – one with depth, that, to paraphrase Gandhi, is the change we want to see in the world – please, drop me a line; I’d be interested.

Philosophy – I

By Matthew Heusser on January 15, 2008 | 2 comments.

I am giving a talk in April on “Evolution, Revolution, and Test Automation.”

Here’s the abstract:

How do we know what works in software testing? And how do we prove it?

In this class, you’ll hear a brief discussion of the evolution of scientific knowledge, which leads into the evolution of software testing and test automation. We’ll discuss the different way to evaluate statements about software testing, and then apply those to common testing challenges. Starting with the “test triangle analogy,” Matt will discuss how the concept of testing has changed over the years, moving quickly from system testing to unit, acceptance, performance, and even mock-based testing, the pros and cons of each, and how to identify them.

Finally, Matt will make some predictions about where testing is going. Not magical, visionary predictions, but instead practical suggestions to take your organization to the next level.

You may not agree with what Matt has to say, but he offers three guarantees:
• You will leave the room thinking
• You will be armed with tangible techniques to evaluate the myriad of “best practices”
• You will not be bored

That’s right folks – I’m going to cover the history of scientific thought and apply it to software testing, all in one hour!

… Or, then again, maybe not. It would probably be more accurate to say that I will “Try to hit the high notes.”

Which brings me to an interesting problem.

The talk involves a good amount of discussion of the nature of knowledge. To do that, I’ve got to cover a little bit of philosophy.

After the last time I gave the talk, someone actually came up to me afterward and said “Matt, I really appreciate your point about the Heglian synthesis of thesis and antithesis, but if you are going to have academics in your audience, you’ve got to use the correct terminology.”

I have no idea that that means.

So I went home to my wife, who has a degree in Philosophy, and asked her about it. She replied something like this:

“Matt,there are two kinds of people in your audience. Academics who care about terminology, and do-ers who care about getting things done. You cannot please both. Which group is more common among your attendees?”

When I told her the crowd would be do-ers, she replied “Well, that’s easy. You don’t have to sound smart to impresss do-ers – you just have to be smart and get things done.”

Come to think of it, that’s just good advice in general.


If you’ve got a snowballs chance of making it out to ST&PCon, drop me a line. If not, but you’ve attended in the past, there is a little website with forums and stuff where you can participate anyway …

Models – II

By Matthew Heusser on August 9, 2007 | 0 comments.

(There aren’t the Requirements Models you are looking for …)

Sometimes, the models we use are predictive (like yesterday) or are analogies. For example, e-mail is a pretty straightforward analogy to snail mail; you have a send, receiver, addresses, transfer mechanism, routers, and so on. In many cases, the analogy (or prediction) is helpful; you what to expect and how to behave.

Then again, like the famous outside-the-box problem, analogies can limit our thinking. For example, Carl Wiegers was just interviewed for Dr Dobb’s Magazine on Software Requirements. The interview is interesting. Karl’s model is requirements as a contract – customers tell the technicians what they want, then we haggle over the price (schedule), sign the contract, and develop.

Except in software, the contract is not static – it’s fluid. Oh, and English is a vague and ambiguous language, so when we communicate solely in written words (like a typical contract), we are losing information. Plus, with this model, if the developers have a better idea “gosh, if we knew zip code, we could call out to the mapquest web service and …” they do not have a mechanism to contribute the idea to the software. (And don’t even think about testers contributing ideas!)

The contract model can work for certain kinds of software – for example, avionics software that has to go on a new helicopter for the department of defense. Given an incremental improvement to a well-defined, state-based system, you can both state requirements precisely and limit change. That is not software development; it is software engineering.

… And most of my project just aren’t like that. On most of my projects, the contract model is a joke. Instead, I prefer the defense-lawyer model, which works like this:

The customer has a problem. He does not understand the technical jargon and doesn’t have the education or tools to solve the problem. So he hires a lawyer. The lawyer makes the plan, and the customer can over-rule the lawyer on key decisions (how to plead), or, if he is not happy, fire the lawyer. Ideally, the two work together in collaboration, admitting that if they do not, the project will have much less chance of success.

For commercial software with at least hundreds, if not tens of thousands of customers, you don’t even have a customer to write the “contract.” Instead, you have to invent roles like business analyst or product owner. That’s fine, but sometimes the limits imposed by the model aren’t real – they are simply part of the model.

Now, on a real contract, all elements of the contract can be negotiated – not just price and schedule. By not mentioning that earlier, I’ve limited the model by omission, making it worse.

We humans do this all the time. As much as this is a problem to avoid, it’s also a huge opportunity. When your competitors get trapped in an analogy, *you* can break out of the analogy.

This doesn’t sound like much, think of what it could mean for version control software, email, file systems, or any other piece of software trapped in a metaphor. Look at what google did with search, using back-links instead of a dewey-decimal system metaphor. Think of what they did with email with “search, don’t sort”. Think of what developers are doing with AJAX and mashups – taking two different web services and combining them in interesting ways – or what testers are doing with random, automated testing instead of a traditional list-checklist-verify model.

Sometimes, people get it wrong. Sun spent years trying to change the model from PC-centric back to the mainframe using the term “thin client.” (Eventually, that happened anyway, due to the web browser, not Sun.)

As I mentioned above, models can have a lot of value. But, often models can be like the stock market: By the time everybody is buying, it’s probably time to sell and find something else …

Real Science

By Matthew Heusser on August 6, 2007 | 2 comments.

Does anybody else remember the Middle School Science Fair? The idea is to apply the scientific method to something practical, then show off all the budding scientists.

If I remember the scientific methods as I was taught, it was essentially this:

- Collect some documented observations of your subject
- Make a hypothesis – a theory that should predict behavior (Make sure you pick the right one!)
- Conduct an experiment – Conduct it properly, so to easily repeatable.
- Evaluate the results of the experiment – do they support your hypothesis?
- Draw Conclusions

When done, you should have some artifacts that document the experiment. This is usually a big poster-board with a page for each step in the method, plus a cutesey visual display. We were then evaluated on our displays, documentation, and how valid our experiments were.

… and I suspect that the entire approach is wrong. Wrong. Wrong.

What if we were not taught to do one experiment meticulously and correctly, but instead taught how to do lots of experiments – dozens, hundreds – very quickly and very sloppily? Then, if we find something interesting, we can come back and do the experiment again. If the results are interesting, then play the documentation game.

Done this way, our students actually have a shot at real innovation, instead of pseudo-science. Thus we can judge competitors not by the quality of the documentation, but by weather or not they actually found anything interesting.

This test early, test often, test always idea of science is not new; actually, it’s the basic idea that Thomas Edison used to bring the light bulb to market. In fact, when I looked up the Scientifc Method in Wikipedia this morning, it had a large section on “Evaluation and Iteration” and “Testing and Improvement.” It turns out, that’s the real scientific method after all.

The problem is that we are so seeped in this culture that we confuse the accidental things like documentation with the essence – discovery and experiment.

We have this problem in software testing.

We call it the test case.