Matt Heusser’s Blog
Testing at the Edge of Chaos
Naïve Metrics
Let’s say you don’t know anything about software testing. It makes sense that, to figure out status, you could be able to reduce the work to a set of basic primitives – maybe test cases. Then you can count the total number of test cases, the number to go, do some simple arithmetic, and come up with the percentage complete of the project, right?
A little more arithmetic, and we can come up with a predicted end-date of the project.
Likewise, if we have another primitive – say a bug report, and a field in bug tracker to indicate if the bug was discovered pre-or-post release, we can calculate our defect detection percentage – DDP.
With just a few of these “hard” measurements in place, we can manage our test – even who development projects quantitatively, accomplishing the goals of CMMI level four, possibly five.
No, really, you can. I interviewed the CMMI product manager, J. Michael Phillips, in December of 2009, and he said so. The core metrics he recommended were something like time, features, productivity (source lines of code), and defects. If you want advice on how to perform this kind of best practice, you only need look to the latest issue of Professional Tester Magazine.
What follows is a true story.
The Great CD Debacle
Not too many years ago I was hired into a remote office of a Fortune 200 company we’ll call BigCo. BigCo has many businesses, but the purchasing organization created a sort of yellow pages for construction materials – big, thick, multi-volume books that had every product you could think of, and how to buy them. BigCo bought this much smaller product company, which I will call TinyDiv, in order to get technology to create and compile electronic catalogs onto a CD. The general idea was to ship compact disks instead of it’s multi-volume (thick) sales catalogs. (Yes, folks, this was before the web was popular)
The project was quantitatively managed, at least in that the director of software development had goals for budget and schedule.
Then Microsoft did something funny; they released Windows 98. (It might have been NT; It was over a decade ago) Now the outsourced testing company was happy to put Windows 98 on the list of operating systems to be tested – but this destroyed the schedule and the budget, costing the executives a few bonuses.
The Director of Development took a pen, marked a line through Windows 98, and said “I will have our team QA that operating system.”
Only, he didn’t tell anyone that they, specifically, were responsible for testing it.
The CD did not just fail to install under the newest windows.
It did not just cause a blue screen.
It corrupted the hard drive such that the machine was unable to boot.
The result? The software was recalled. Within a year, corporate HQ moved the work that the CD-product team did to other offices. By the time I was hired, all that remained was the original TinyDiv, still working on that original product that worked and made money.
In my first week, two of my teammates took me upstairs, to show me an entire floor of Comerica building in downtown Grand Rapids, empty. “The CD people used to work here”, they said, “If you see anything you want, take it. The lease is up in a month and it’s going to the scrap yard.”
How do I know about the project? Because I overheard people talking to the lawyers from corporate who wanted to sue to the QA company, and I heard the developers saying things the lawyers did not want to hear.
But gosh, what great defect detection percentage! Look at how many bugs QA found versus how many were found in the field! We only found one serious bug in the field! (Well, we might have found more, but we had to recall the product.)
Naïve Metrics
Hopefully, it’s obvious in the story above that D.D.P. just doesn’t make much sense; in order to quantitatively measure, you’ve got to be comparing apples to apples. At the very least, you’d have to factor in some kind of defect severity, possibly including how often we expect the users to encounter the defect. And what these really are, are guesses that we’ll plug into a formula. Even with severity, it’s unlikely that a “sev one” is exactly five times as bad as a “sev five”, or that five “sev fives” equal a “sev one” – but a simplistic formula will come to that conclusion. And, just as obviously, in the example from professional tester magazine (on page seven), if more bug reports are good, then we’re likely to get more bug reports. Some of these could have been handled by talking directly to the developers, or we may get a different bug report for every typo out of a list of 100 on the help screen. And YES, I once knew a developer who was evaluated by the number of change controls he put in, so in order to move fifteen files that were essentially one change, he put in – you guessed it – fifteen change control requests.
But where does this desire for metrics come from:
1) Lack of trust. The Manager says “I need a week”, and his stakeholders do not believe him. So he pulled out metrics. Or perhaps he needs three weeks and his stakeholders want it in one. Or perhaps he thinks the test team is doing well and wants bigger raises “look we found more bugs this year.” Or, just perhaps, the outsourcing company wants to attempt to prove it’s value.
2) Desire for control. Simplistic measures promise to make management easy. After all, all a manager should have to do is look at a spreadsheet every once in a while, and if he sees green, everything is fine. If there’s yellow or red, call the direct reports, demand status, ask what they are doing about, and check back in a few days. The problem is, they can’t deliver. The organization might be spiraling out of control, but the report is all green. (Anyone who worked at a big organization in the 1990’s, consider my servant the gannt chart. That was all about creating the illusion of control now, wasn’t it?)
3) Lack of understanding. Have you ever wondered what the purpose of grades are in school? The teacher doesn’t need them; he knows how well the students are doing. The grade “B+” is actually a lossy abstraction – it lumps the student who has mastered the material but never does homework in with the one that tries hard but always misses the harder problems. It turns out that grades are a benefit to the parents, administration, and college entrance people, who aren’t in the classroom and need some advice on what is going on. In our home school, we don’t give out grades to our students; we actually know how well they are doing.
A few alternatives I have tried and had success with:
1) If you don’t know what’s going on in your organization – find out – by actually being involved. The Scrum and XP folks suggest that customers attend the daily stand-up meeting, or that the customer be embedded in the team, both of which I have had success with. Another option is management by walking around.
2) Note my concern is the use of naive metrics as a sum total strategy to figure out “how we are doing”. Thus, you look at your metrics for the week, and if things are green, you breathe a sigh of relief and go play golf – or, if they aren’t, you call your direct reports, yell and scream, then check back again next week. I do not have a problem with digging into the numbers as an investigatory process, as part of a balanced breakfast.
3) Likewise, I expect that individuals are using metrics every day, in order to figure out dynamics and make plans. These numbers are part of a one-time problem solving strategy, often thrown away after the fact. DDP used one-time – say, looking at the numbers from this year to last year – as part of a balance breakfast – might not be that terrible. It’s when it is used repeatedly that the act of observing tends to skew behavior, and we begin to see dysfunction.
4) Earlier I mentioned dysfunction. Keep in mind, you’ll tend to get exactly what you measure. If you measure test cases, you’ll likely get lots of test cases, and even some productivity — at first. But, eventually, the team will realize that test cases and productivity are two different things, and find the shortest possible way to get you the test cases. By exploiting this difference, each individual test case will likely provide less value – thus, the assumption that “counting test cases” is roughly equivalent to productivity becomes less and less valid over time. There’s a gentleman names Robert Austin who earned his PhD at Carnegie Mellon who studied dysfunction. He concluded that since projects are multi-dimensional, any single metric (even a handful) is likely to leave things un-measured, and teams are likely to steal from those “peters” to pay the “paul” that is measured. The classic example is that, if you are measured by time, features, AND defects, you take on technical debt.) Austin’s book, Measuring and Managing Performance in Organizations is a classic in it’s field. His recommendation is to pick a small percentage of projects and do a thorough after-action review, or retrospective, that takes everything into account, and try to take home real lessons from that.
5) When evaluating quality, consider qualitative metrics and rules of thumb, as opposed to hard numbers. This can be as simple as a thumbs up or down “should we ship” crowd-source decision by the team – or at least using that as input for a decision maker. For a detailed analysis of software engineering metrics, consider the classic paper by Kaner and Bond, which ultimately recommends qualitative evaluation.
Putting it together
That leaves us with a small handful of tools – get actively involved with project, manage by walking around, conduct detailed retrospectives, or use metrics as part of a balanced breakfast to inform, not to convince, evaluate, or to control. But what if you really want to use metrics to do lightweight to control with integrity – say you have an organizational mandate?
Well, first, you could get a better job. No, really. I’m serious. I’m reluctant to offer advice on how to make a bad idea work. That said …
I do think organizations can use metrics in a mature and sophisticated way. To do that, I would introduce the metrics in context, as part of a story, including the limits and weaknesses in the approach. For example, when a particular idea should not work, and how to figure out why it worked in this instance, and the complex dance of experience and research we used to validate our opinion.
When I see these naive metrics, I smell a theory and ideology that has never been tested; yet another Pied Piper of Hamelin, telling people what they want to hear.
We can, and should, demand better.
Code Coverage Steel Cage Knife Fight!
The November Software Test&Performance Magazine had a little article on Code Coverage Metrics by myself and Chris McMahon. Our conclusion was highly skeptical of code coverage as a tool for testers; Alan Page took that and started a … dialogue with me over twitter.
Some things are just hard to discuss in 140 characters or less. We took our discussion offline, and we’ll be publishing the results as an interview for a weekly STPCollaborative newsletter called “Test&QA Report.”. The interview will appear on Tuesday, December 22nd. We’ve asked Marlena Compton to referee … I mean, um do the introduction. It’s going to be awesome.
And you can be certain to get it in your in-box by creating an account with STPCollaborative and changing your settings (this will get you the download of the November Magazine too, and it’s free).
UPDATE: If registering is too intrusive, you can sign up for just the report by email here.
On Management And Metrics
Speaking of Management: Peter Drucker
Peter Drucker analyzed how General Motors was run in the 1940’s, essentially defining management as a term. He wrote a number of books, published an incredible number of articles in the Harvard Business Review, and has a Graduate School of Management named in his honor. More importantly, Drucker defined the term “Knowledge Worker”, and made it clear that the knowledge worker of tomorrow would have more authority and scope of responsibility than the factory foreman of yesterday. Every MBA I have ever met has had to read something by Peter Drucker at some point in graduate school. I read The Practice Of Management several years ago (and have re-read it quite a bit) and, while it’s a little thick and a little dry, I found it absolutely wonderful.
So last week I started reading Drucker’s Magnum Opus – Management: Tasks, Responsibilities, Practices weighing in at 819 pages. You’ll never guess what I found about thirty pages in:
The measurements which give us productivity for the manual worker, such as number of pieces turned out per hour or per dollar of wage, are irrelevant if applied to the knowledge worker. There are few things as useless and unproductive as the engineering department which with great dispatch, industry, and elegance turns out the drawings for an unsalable product. Productivity with respect to the knowledge worker is, in other words, primarily quality.
So there you have it. The very inventor of our modern concepts of management and knowledge work writes that metrics can miss the whole point – in 1973!
Go figure.
Four Testing Strategies
I’ve spent a good deal of time lately thinking about how we frame the problem of software testing – and how we solve it. It impacts how we see the world, and how we treat each other. Over the weekend, I came up with four fundamental strategies in software testing, which I considered writing up as a blog post.
The thing is, blog posts are one-way; I dump a bunch of stuff at the end. Sometimes, if a comment is particularly insightful, it goes in the UPDATE section at the bottom. Or, if a magazine will let me, I might put a first draft of an article here and incorporate your comments.
What if we all could contribute to such an article? What if we could add, remove, update, delete – all version controlled, with possibly even a back story?
There’s a tool for this – it’s called a wiki. Oh, sure, wikipedia is incredibly popular, but there are many, many wikis that delve deeply into a specific content area. Rahul Verma, an Indian Tester, has even made a free wiki for software testers – the Testing Perspective Wiki.
So, instead of beating myself up over a perfect article, I put up a short piece on the Testing Perspective Wiki, titled Fundamental Strategies in Software Testing.
Is is perfect? Certainly not! Why, it doesn’t even have references yet. Have I missed a few common strategies? Probably. That is where you come in – you can create an account on the wiki, sign in, add or change content to make this a stronger article. Or add new articles to make it a stronger reference.
In fact, the Testing Perspective Wiki is basically a clean slate. Besides a couple of articles, the whole wiki is open.
So please, check out my little piece on Fundamental Strategies … then leave your own.
If the 1,000-odd monthly readers of this site were to each write one page of text on testing and review two more, we’d essentially have created a book for the community.
Wouldn’t that be a nice thing to do for the world?
GQM – Yellow Brick Road – II
This is the best response I have read to my earlier GQM Request.
It is from James Bach, Originally posted on the Software-Testing Discussion List.
Enjoy:
I don’t like GQM. Here’s why:
It assumes you have nothing to learn!
GQM is appropriate for situations where you understand the precise workings of the system you wish to control.
Example:
GOAL: I wish to back out of my driveway.
QUESTION: Is anyone behind me?
METRIC: turn around and look behind me.
But this is rarely the case with software development projects. In projects, we are trying to learn how things work. A development project is a social system. We don’t just monitor social systems, we must study them. Is X doing testing well? I’ll have to observe his testing and learn how he is doing it. I must be open to surprises while doing so.
My version is called OIM:
1. Observe: What is happening?
2. Inquire: Why is THIS pattern happening? Let’s study that.
3. Model: Here’s my theory about how this project works.
This is not only a cycle, each task is simultaneous. During any of these tasks, at any time, you take action based on your current model of the project. OIM is consistent with social research methods such as Grounded Theory.
This is how we used metrics at Borland, and everywhere I’ve worked since then.
GQM makes it sound like metrics are easy to use and interpret. OIM is all about coping with murky and changing reality.
Goal Question Metric – The Yellow Brick Road?
I posted this on the software-testing email group yesterday.
The replies have been fascinating; but I’m curious what you think:
Hello Folks.
Many people here have heard my own life stories about programming and testing; how, essentially, I kept getting “patted on the head’ and told that I “Didn’t Get It” because I opposed big extensible designs, rituals, signoffs and handoffs in the development process, and expensive, heavyweight test case programs.
I stopped worrying about it when I realized that my projects were far more successful than my peers. Eventually, I started talking about it openly.
Metrics are currently on that list. I have in my study the handbook of software quality assurance, 3rd edition, that contains a list of about 150 qualities (like scalability, security, etc) that can be measured. Then it tells you that one or two metrics will cause dysfunction, you need a balanced scorcard. And that the easy-to-gather metrics are also easy to game and bad, but that the good metrics are expensive to measure. Oh, and be careful, because your engineering staff will rebel if they have to spend too much time gathering metrics instead of doing work.
To summarize, this is what the book has to say about metrics:
“Good Luck.”
Which brings me to my next sacred cow: Goal Question Metric.
GQM is a framework written by Victor Basili; you can google it. The basic idea is that instead of gathering a bunch of metrics, you actually figure out your goal (like “faster production”), ask a question that will help measure that goal, and turn that into a metric.
I have to grant that this is an intellectually valid framework, and it beats the pants off of mindless gathering of numbers. For software testing, the idea has been endorsed by people I respect like Cem Kaner and Lee Copleand.
Here’s my problem: This idea has been around for a long time. When It comes to software testing, I’ve read a great deal of the literature, been to the conferences, read a lot of blogs.
Except for a few examples from people like Lee Copeland and James Bach here’s what I always see: “If you want metrics, use GQM. Since all contexts are different, I can’t give you an example.”
pschaw. Is it too much for me to ask for a case study before I invest time, energy, and effort into a metrics program? One with positive ROI? Enough positive ROI that I wouldn’t be better off working on other projects, or sticking the money I would have spent in a CD?
It’s been 13 years since the first GQM paper was published. I haven’t seen GQM provide it’s value in a software testing context.(*)
Have you? I would be really interested in success stories, please.
Regards,
–heusser
(*) – Please don’t say NASA. They work under an entirely different set of constraints than commercial software development. And even then, the business case is shaky.
UPDATE: Dr. Kaner replied that he doesn’t really ‘endorse’ GQM as much as he simply mentions it during talks. His overall comments are along the lines of “GQM looks interesting, it’s more grounded than nothing – if it works for you, good for you.”
Metrics
I’m overweight.
That means: 210 Lbs, 5′11″. Lots of driving, airplanes, lots of coffee, lots of typing (coding, testing, writing), and raising kids will do that to you.
Yes, I coach soccer, which is an hour of light exercise a few times a month in-season.
I suppose I could look at someone else more overweight and say “hey, I’m not that bad off.”
I could make a New Year’s commitment to exercise and eat better; but I gave that one up this year in May for STAREast and never made it back.
The problems? First, Over-eating isn’t *visible* to me – or to anyone else – except in a gradual form that is hard to notice. I don’t have much energy and my clothes don’t fit … but I haven’t had much energy in awhile, and I can always buy larger clothes. So I am trapped in an addiction cycle; I feel bad, so I eat, and feel better for a short while, but worse in the long term. So, the next day, I feel bad, so …
Second, exercise is not convenient. Especially in the winter.
Now, If I could just make diet and exercise something that was visible to my friends, family and colleagues – something I could bask in glory for success, and something they could hound and decry me for failure. Then I might have a chance to break this addiction cycle. I think the key is to make it public.
So, here are a few things I’m going to do:
1) I purchased an elliptical trainer, so “It’s too cold” is not a good excuse.
2) I created an account on Traineo.com.
Traineo is a metric manic’s dream website.
You enter your weight as often as you check it, and it creates pretty graphs.
You enter your amount of exercise, and it creates pretty graphs – even calculating the calories you burned based on the type of exercise, time you spend, and intensity.
You can create a goal, and it will show how far you are from that goal, and how much time you have remaining.
And you can create custom metrics. Here’s my site.
Here’s my basic strategy:
If I work out three to four times a week on the elliptical, that should be enough to maintain, but not lose, weight.
So I want to do something else to lose the weight. How about give up Mountain Dew during the work week? A 20 Oz, twice a day, five days a week, that’s 200 Oz less Mountain Dew.
I also often eat junk for breakfast. So I created a score for my breakfast eating; 0 is Cheerios or fruit; 5 is super sized McDonalds.
What this has to do with software development
I’ve been using traineo for four days now. Just four days into the system, I found that I was eating candy bars and other snacks.
That doesn’t show up in the metrics!
So the metrics have some value, but are imperfect. Once you realize how to game them, it’s pretty easy to abuse them and remove the value, if not make them downright misleading and harmful.
Does that sound familiar?
Over the holidays, I may not have time to blog, but I’ll try to keep the traineo site up. I believe there will be additional insights into software that we can mine from it.
And, if you’d like to encourage me, please feel free to check out the site and see how I’m doin’. There is even a role called a traineo “motivator” where traineo emails you my stats weekly. If we’ve actually met in real life, and you’re interested in being a motivator, let me know.
–heusser
PS – If you can’t see the tie-in between technical debt, metrics, and weight yet – don’t worry, it’s coming …
Metrics Madness
A Cautionary Tale
Years ago I worked for an established fortune 500 company. At the beginning of each year, executives set goals by which they would be evaluated. These objectives were numerical and SMART – Specific, Measurable, Achievable, Relevant, and Time-Boxed. In order to make sure that the person didn’t do long-term damage to meet a specific goal, the company required at least two goals, or a “balanced score card.” They were also somewhat enlightened in that, within ethical limits, how you met the goal mattered much less than weather or not the goal was met. The company was split into independent sub-units, each with it’s own profit and loss statement – a wonderful source of hard metrics.
If you think about it, the way that company was run, metrics and all, is very similar to one particular point of view for IS Management. The argument goes that if we could only take those ideas and adapt them to software engineering, all would be salt and light.
To answer that argument, I would like to tell you a story.
In the late 1990’s, one of those independent units had a Vice President of Sales, who I will call Joe. Well-deserving of the job, the man was seriously brilliant. Taking over from the last sales VP, Joe reorganized the way sales was done, focusing on selling things which cost less to produce and sold for more money. He also expanded the client base, selling into markets that could see more value in the product or with larger purse strings – thus making sales easier.
By the tenth month of the fiscal year, Joe’s sales team had booked more profit than the “exceeds expectations” goal set at the beginning of the year. Why, with the incentives complete and no further profit possible, the team would be just as well off to take two months off work and start again in January, right?
Well, of course not. The company could always use more money, and besides, Joe was measured with a balanced score card. Because he sold more profitable items, he had met his profit target but not his gross sales target. Unless he could hit the gross sales goal, there would be no bonus and no big raise. Yet, late in the year, most customers had already spent the available budget; what remained was very little.
Dilemma. What’s an ambitious business genius to do?
You probably guessed it – Joe had his team go back to the old products and sell them at a loss in order to hit the gross sales goal. This is dysfunction; having all the metrics right but missing the point. As the anonymous philosopher once said “Be careful what you measure, because you are going to get it.”
In this story we have an established, mature organization trying to metrics right and do the right metrics. They used established accounting and business administration principles, which, compared to software engineering, seem wise and established. Why, Mark Twain once remarked that there are three good ways to mislead: Lies, Damn Lies, and Statistics. If metrics dysfunction can find it’s way into a mature field like business administration, we must realize it is a very real risk for software engineering.
Playing pick a number at the beginning of the year and “managing to the numbers” may be easy, but that doesn’t make it right. Numbers can provide information or evidence to help lead to a conclusion, but without the context, we’re likely to make a mistake. We might either abuse the metrics – like the example above, or misinterpret them – such as the new defects trend line that seems to shrink right around spring break.
Twenty years ago, Tom DeMarco wrote that “You can’t control what you can not measure.” Jorge Amanda points out on his blog1 that without measurements it would be very hard to control weight loss, blood pressure, or cholesterol. Yet he also begs this question:
When was the last time you measured the length of your hair? So how can you control the length of your hair?
Now, take that observation and general systems thinking and apply it to software, and I’ve got two wonderful words that describe it:
Management and Leadership.
References:
1. http://catenary.wordpress.com/2007/01/11/controlling-what-you-cant-measure/
Agile Metrics
Someone on the Agile-Testing list asked about test metrics for his SEI/CMM compliance effort. Here’s my reply:
I have one graph, which is a stacked-line graph. On the X axis I have time. On the Y axis I have deliverables.
Each deliverable has phases – need requirements, in dev, in software engineering test, in customer acceptance test, waiting for prod, and in production.
I update the chart every monday. Of course, I am an agile guy, so I dev as I test, so spending a lot of time in SE test tells me something. (If I move to SE test on tuesday and promote to CA test the next day, it never shows up on the spreadsheet. That’s good.)
Looking at this sheet, I should see the size of the delivered features go up regularly. Now that is what I care about. It also shows the relative size of the work-in-progress inventory.
It helps the devs. The testers. The requirements people … and it’s holistic.
Now, if I see things start to stack up at a specific point, (especially testing), I know something is going on, and I put more effort in eliminating the bottleneck; that’s basic constraint theory.
Of course, it’s a first-order approximation. Some deliverables are done in a few days, others are a few weeks. I could weight them, but that seems like ‘good enough’ for now. Of course, the graph tells a story, so I show it to people in context only.
That has nothing to do with the CMM(I) – it’s what I actually do to make my life easier. It has pros and cons, but it gives me and my management visibility into what I am producing, and opportunities for feedback.
As for the CMM(I):
I just spent a considerable amount of time reviewing the CMM(I) Integrated Version 1.1 for Systems Engineering and Software Engineering, looking for a tie between metrics and testing.
I couldn’t find it. Of course, the thing is so poorly written that it’s probably in there.
The one thing I did find was that for level 4, it could be argued that you need to measure your adherance to the defined process. (”Quantitative Project Management”). Now that is relatively easy, and counting test cases doesn’t help you get there, unless you have a standard policy of X test cases per 100 lines of code, or something like that.
I don’t know your environment, but in mine, I would want a CMM(I) assessor who believed that our environment changes so rapidly that common approaches to test metrics would be nieve and premature, and that we could get all of the level 4 goals accomplished without them. (Ref: Handbook of SQA, 3rd ed, Schulmeyer/McManus)
But, to be honest, I have fundamental problems with the CMMI. I suspect that you might be better off reading “Quality is Free” for yourself.
So please take this with a grain of salt. I did a best-effort attempt at a answering your question, but my head hurts now.
Good Luck.
Metrics Madness – II
UPDATE: Mark Waite is quick to point out this article by Cem Kaner on Metrics Dysfunction, which predates Joel by years. The style of the two articles is very different; Joel uses a little bit of logic, a little bit of generalization, some common sense and emotion to make his point, where Dr. Kaner wrote a legal brief.
Of course, Dr. Kaner is a lawyer.
The great thing about the article (which I have printed off to explore in depth) is that it is comprehensive. Perhaps when the issues comes up again, I can find a way to politely ask “Have you read the Kaner paper on the subject?” Of course, he has published others.
This brings up an interesting question: So far, I’ve been listing interesting resources here as I find them. It would be neat to have some sort of categorization scheme to make them available quickly; something like Brett Pettichord’s Software Testing Hotlist.
