Matt Heusser’s Blog
Testing at the Edge of Chaos
Test This – Finale
It’s down to the entries and the grading for the test challenge. I would like to start with some of the actual bugs found in test; then we can ask “would the test plans have found these?”
Activities widget: cannot select Page Tags in My Conversations
Activities Widget: my conversations ignores event filter
IE: JS error during Activities widget testing
Activities widget: Showing menu cut off by widget boundary
IE6: Activities Widget: Cutoff slightly in left col at 1024×768
Activities widget does not maintain “Post Signals to” value
IE: Activities Widget: Vertical padding in comment activities
IE: Activities Widget: Edits show up in my conversations/signals
Private messages are not all grey in activities widget
Activities Widget: Edits and Comments for People I follow are the same thing
Activities Widget: Unable to see follow/unfollow events in ‘all events’
Activities widget does not display edit summary signals
Activities widget does not display My Signals
Humongous stacktrace upon adding Activities widget
—-> Yes, we could do a whole blog series of predicting bugs and test strategy. For now, as you can see, most of these errors are with either internet explorer, the displaying of specific events, or with signals.
The Entries
To grade the entries, I used three measurements: How through was the test plan, how well-written was it, and did I think it could be executed reasonably fast – after all, one of the requirements was that it work in an agile shop. Now, some of the test plans looked like they took longer to write than I actually had to test the widget, so I tried to give credit for thoroughness but not speed.
Two of the entries, Joe Harter and Justin Rohrman were unique, in that they did not actually define test cases – instead, they laid out high-level functionality and how much time they would devote to each element.
Two other entries – Marlena Compton and Joe Harter again, tried to engage me in a discussion of what the functionality was and how it would be used. They used to this both to fill out, and to ‘check’ their mental model of what the functionality was before designing tests around it. This was what I was looking for; in addition, I would have given out bonus points if they had tried to have a negotiation about how long testing should take. (Justin and Joe’s proposals to test were great, but Joe both wanted about twice as long to do the testing as I could reasonably give him, and Justin wanted a more modest, but still too large, percent.)
I was also looking for other ways to mitigate risk, like having the company actually use the software prior to it’s release to production.
There were other ideas that impressed me. Markus Gaertner pointed out that Socialtext offers a ‘free 50′ program, and that he could actually test the software, then work backward to define his plan. (Speaking of which, his plan is pretty good.) That was a novel approach, and I give him a ‘tester reputation point’, a James Bach would say, for the idea.
Ajay Balamurugadas, Parimala Shankaraiah, and Sharath Byregowda used a similar, traditional, test strategy template with details filled in. I was impressed by the thoroughness, especially Ajay – who scored highest of all entries in that category. What struck me as I read he entries is that I believe such an approach would slow down the team.
It’s hard for me to explain why, but I’m left with a feeling that a team attempting such a strategy would become bound by checklists, documents and process. I certainly don’t see anyone creating a twelve-page document, then turning around and executing the tests in the three to six hours I had to do the work.
Competition Results:
First Place: Marlena Compton
Second Place: Joe Harter
Third Place: Ajay Balamurugadas
To the winners: drop me an email (matt.heusser@gmail.com) with your home address, and I’ll get the prizes in the mail.
To everyone else: I tried very hard to grade fairly based on those three categories. You’ll notice, however, that the first and second place awards went to the people who actively engaged me in a dialogue about my expectations. There’s a testing lesson in there; I’ll leave it to you.
This completes my current challenge; what are some of yours? Please feel free to write them up and leave a link in the comments. If I like one, I’ll endorse it …
The Boutique Tester Goes Big-Time
The fine folks at SQE just published my article, “The Boutique Tester“, on Stickyminds.com.
The article was listed as a thought experiment; a “what if.” A few people have asked me for more meat on the bones on the business model, but to do that, I have to ask – what do you want to talk about?
Mind the Gap
(was: Metrics and Process and Templates, oh my!)
I’m back from STPCon. On the last night, I had dinner with some the attendees, and, as much as I like to talk, I tried to at least listen a little. What I heard struck me.
First I heard from a tester. She said that her main goal for the conference was to get more test ideas – examples, “how I would test this”, test exercises combined with direct coaching. In the past, I’ve noticed a lack this at test conferences, but was struck to hear it from a conference that I associate with on a very personal level.
Yes, we had some exercises and examples (Testing Outside The Bach’s comes to mind), and yes, we could have more. I heard you, Bobbie, and I’m going to work on this.
The second person I met as a director of test, and he pointed out something interesting. He said that yes, deep down, we all know that metrics and templates and process are horse-pucky. Yet senior management – people who have a cursory understanding of what we do, keep asking for them. So, as a test executive, he picks the battles he can win, and tries hard to build relationships.
According to this director, who I will call Rob, Test leaders don’t need to decide what to do; they need help answering executives who want those “traditional status tools.”
But why do executives need those ‘tools’? Why do they want them?
Those tools promise to bridge the gap created by a lack of trust.
I once knew a project manager I refer to as Biff, who slipped a project by four months at a time, every four months, for about a year and a half. He had no metrics, no documentation, no evidence at all that he could bring the project in on time. In fact, the evidence was stacked against him.
Yet it did not matter. The executives trusted Biff. Biff could hand in the same natty old gannt chart with the dates pushed out by four more months – again – and would have no consequences.
Biff didn’t use evidence. He did … other things to build trust. I would argue that those things can be done better or worse, consciously or unconsciously, with integrity or without integrity.
So, in the next few months, I’d like to try to do a little more writing about what we actually do when we test, and also how we create and build trust in all directions – with our superiors, customers, peers and reports – in software development. I may also do some writing about Miagi-Do Testing, my formal but non-commerical and zero-profit software test training program.
What do you want to hear about?
On Testing
I’m reading Stephen King’s “On Writing” Amazing stuff; I’d recommend it for anyone who has to do any kind of technical communication, be it writing business email or corresponding with the parents of your Sunday-School Class. Suddenly I run into this paragraph and I am struck:
I’m not asking you to come reverently or unquestioningly; I’m not asking you to be politically correct or cast aside your sense of humor (please God you have). This isn’t a popularity contest, it’s not the moral Olympics, and it’s not church. But it’s writing, damn it, not washing the car or putting on eyeliner. If you can take it seriously, we can do business. If you can’t, or won’t, it’s time for you to close the book and do something else.
Wash the car, maybe.
No, I’m not Stephen King, and stuff we are testing is not ‘Carrie’ or ‘The Shining’. The odds that the tales of the software we test will echo down the corridors of history are vanishingly slim.
But still, when I think of testing and new product development, somewhere inside, I feel something of same emotion King has about writing.
This is testing, Dang it. Take it seriously.
I suspect, if you’re bothering to read this blog, likely on your lunch hour, or your own time at night, you already do.
For that, I thank you.
New Article – Accellerate your Agile Testin’
The kind folks at SearchSoftwareQuality.com just published my latest article – on techniques o compress test time in order to enable Agile Software Development.
The article is called “Accelerating your Agile Software Testing” and it’s available now.
It’s very hard to cover a wide subject like Agile testing in a thousand words or so; this article is just the tip of the iceberg. Look for follow-ups to come, especially about self-organized teams and the team mindset. In the mean time, you can rate the article at SearchSoftwareQuality or leave comments here.
Still more to come: Building trust with stakeholders, the test challenge awards, and, possibly, my notes from the Software Test and Performance Conference, this week in Cambridge, Massachusetts.
Is something wrong here?
Yesterday I sent out the invitations for the next GRTesters Meetup. Now, we have a lot of different people, with a lot of different email and calendaring systems.
I find that emails tend to be forgotten; so you need to to follow-up, and that’s annoying to people who actually plan ahead, who should be rewarded. So I send invites via google calendar. These invitations tend to work with every popular email system – Lotus Notes, Outlook, Groupwise … it “just works.” Turns out, there is an interoperability standard for these things. Woot.
So we decide to have meetings on the first Monday of every-other month, and yes, Google can set up recurring invitations. Excellent, right?
So I set up the recurring invitations, and get a few acceptances. Then I get this in my inbox:

Something is wrong here - 51 acceptances from Ross Wagner
That’s right. Apparently, whatever Calendar system Ross uses realizes the recurring appointment into a series of appointments, ranging from today to 2017. When he clicks “accept”, it sends me dozens of acceptance emails.
Shortly after I got the dozens of acceptances from Ross, I got a bunch (about sixty) from Daniel Mish. It turns out they both work at the same company and, surprise, surprise, they both use GroupWise.
What’s interesting to me is – arguably – both systems are doing rational interpretations of the interoperability standard, according to their specifications. If I called GroupWise, I’d bet they’d say “well, yeah, it’s supposed to work like that.”
At Socialtext, we support an interoperability standard called “Open Social“; thanks to Open Social, we can embed other 3rd Party apps directly in our software. (And yes, this is why you’ll occasionally read that I am testing duck hunt; duck hunt is a very common Flash-based Open Social Application, and I am testing flash. Mostly.)
When we have an issue with duck hunt, or any other 3rd party app, my first action is to try to test the gadget in google’s own open social container – www.google.com/ig. If the gadget doesn’t work in /ig, it’s a problem with the 3rd party code, not Socialtext. Yes, we try to be nice people and email the original authors; if the code is licensed under creative commons or GPL, we’ve even offered patches to fix it. It improves our customer experience and allows us to be good citizens; everybody wins.
This “check the reference implementation” game, as I see it, is fundamental to interoperability testing. It also explains why “Standards” that don’t offer a reference implementation are such a bugbear. (I’m looking at YOU, object model group and Software Engineering Institute).
Bottom line: If you’ve got interoperability problems, go to the source. If you have no source, try to be the loudest voice in the room.
Otherwise, you’ll be stuck with 68 emails from Ross Wagner. Aside from his wife, I doubt anybody else wants that.
Test This -My Strategy
Earlier I introduced the test challenge, along with this picture:
A simple look at the potential test cases:
7 basic events x 3 types (all, followed, conversations) x 2 networks (all or a specific one) = 42 test cases
Yes, 42 different cases. (Apparently, 42 really is the answer to life, the universe, and everything.)
But that’s just a start; that is just verifying that when I do things, they show up, and they also show up when filtered. It does not include evens done by others (+42) – and then events that should /not/ show up because they do not fit the filter. For example, when filtering by my conversations, we would want to check that all event types that are not conversations do not appear. Depending on how you count them, I would say that is at least 84 more test cases. So currently, 168 potential test ideas.
Then you have different browsers – IE8, IE7, IE6, FF3.5, FF3, Safari3, and Safari4. Depending on how much risk you are willing to take (hey, are FireFox 3.5 and 3.0 really all that different?) we’ve got somewhere between four and seven browser variants.
That is something between 600 and 1,000 test ideas.
But we are just getting started.
The tests above are basic confirmatory tests. They don’t deal with special characters, internationalization, or boundary conditions. For example – say I send a message that is 140 characters of pure text with no spaces in it; I suspect the browser will not be able to chop off the word and that it will “spill over” the right side of the widget. (Turns out, it did.) Or can the widget handle UTF-8 – extended character set characters? Turns out, it could not. Also – what if someone else sends a signal or updates a page while I am looking at the widget? Do I see the change? (Turns out – eventually. The widget polls the server every 30 seconds. This turned into a load on the server for muliple users, and we had to increase he delay between polls.)
So, as of now, I’m only testing the basic functionality – not looking at the widget settings, or trying to move the widget within the container, or next/previous links – and I have a tremendous number of potential tests. Far more than I could realistically test in the three to six hours for which I am expected to test this widget.
And automation? Any time spent writing computer-assisted tests with something like selenium is time not spent actively testing. In three to six hours I could only test a tiny fraction of those 164 test ideas. Yet if I don’t write any automation, in two weeks, I’ll have to test it manually again, plus have new features to test.
Bummer.
The next option is to go to management and say “We need to peel off three developers for the next two weeks to write automation. I know, I know, that will mean they don’t produce working software next iteration, but I read in a book that that iteration isn’t done until testing is done.”
How do you think that will go over?
Even if we got the funding to automate those 168 test ideas, they won’t catch rendering errors (the text falls off the screen), and sadly, if the UI has changes, they will creates a significant maintenance burden.
Here’s what Matt Actually Did
Before the developers began coding on the story, I had a chance to review it and work on the story-tests described in my second blog post. These story-tests are important because they are concrete examples. The challenge I find with story-tests is in getting enough stories to provide good examples to guide development, increasing quality before “code complete” – but not too many. With too many story tests, I find that human nature takes over, and the developer skims over the tests, saying “oh yeah, it should do that” without physically doing the testing. (After all, the developer will think, that’s the tester’s job, right?)
After story review, the whole team does a kickoff, to make sure everyone is on the same page about what the story should do. This was the final chance to clarify expectations, to put the story into our own words, bounce it around, and see if anyone says “oh, I didn’t realize you meant that.” (This overcomes document under the door syndrome.)
Then the developers write the code. Time passes …
Code complete; story is moved to QA. I build the environment (5 minutes). I spent about an hour doing exploratory testing, simulating three users – one in FireFox 3, one in Selenium, and one in IE7, and found the bounds issues above. I also explored logical concepts; for example, conversations only consist of page edits, and the ‘code complete’ version both allowed you to select signals and actually showed some. Then, I created a basic selenium RC test, demonstrating that signals and page edits do appear when a user creates them. I added to the test every iteration for the next few months, until it is now modestly robust. Then I logged in as a fourth user, who only overlapped with the IE7 user by one workspace, and tested in Internet Explorer Six. Thus I say the IE7 user’s actions, but not other users – and only the profile actions and actions within the workspace we overlapped in.
Then we got the software up to our staging server, and ran the entire company on staging for a week prior to releasing the software to production. Thus I was able to test with real-world data, logging in as different users and asking “does this look right?”, then exploratory test on staging.
A few other techniques to mitigate risk:
Create a staging clone and test update on it.
Log and monitor. Our software creates a wiki page with log output, sorted by either number of calls or duration. We can mine this output for most frequent operations (test those first) or slowest operations (mine for performance issues.) We can also log and monitor production, monitoring as a kind of performance test. Assuming that staging has a much larger use per user than production use and that the activity widget will have a slow ramp-up adoption curve, this combination may be “just enough” load testing. (Perf testing is implied throughout; the difference between service level expectations and service level capability is likely a separate blog post.)
Actively engage the business users in the test process. We have a very simplified bugzilla form our employees can fill out in order to report bugs and a defined, non-painful test process. We train our business users; if Joe files a bug titled “the software is broke” with no text, we consider that a training issue, not a people issue.
Combine the test suite with the application. Our test are expressed as wiki-pages; the test suite “pulls down” the page and then executes it as a very high level programming language. In this way, we test the app while we are writing our tests! Moreover, if I know what the test suite does, I can log into the app after a test run – which is roughly 12,000 GUI clicks and inspections – and look to see if the activities of the wikitester user are documented properly; then I can filter to see if they appear.
Regularly rotate browsers. We do our work in different browsers and different browser versions; combining that with creating tests in our application creates a level of automatic coverage.
So that’s it
The big games of testing, as I see it, are coming up with the test ideas, selecting the most powerful test ideas to run in the time allotted, and summarizing the results in order to provide information and make recommendations to decision makers. (And, when bugs are slotted to be fixed, selecting the appropriate sub-set of tests to run to gain confidence that the fix ‘took’ and didn’t break anything else along that way.) In that way, testing is both an investigative activity and a risk-management activity.
Likewise, the challenge in the blog post is summarizing the strategy in a way that captures the essence of the work in a short period, without leaving important bits out, and, likewise, without putting everything in and putting your reader to sleep.
Still with me?
I’ll be talking more about this Oct 23rd at STPCon in Boston, in my session “Testing the user centric web.
Still to come: Announcing the winners.
Test This – The Entries
In many performance disciplines, the assumption is that students will, yes, do a lot of practice, but also study the masters – both what they created and how they perform the work. We even do this in mathematics; my highest course at Salisbury was Math 480: History of Mathematics, where we basically studied the proofs of ancient mathematicians, from Euclid to Euler.
Yet, as I’ve said before, in software testing, our canon of sample tests include the triangle problem and maybe the insurance problem(it’s on page 51), which haven’t changed since they were introduced in the 1960’s and 1970’s.
I was hoping, with this challenge, to change that a little, and there’s good news. Not only can you work through the problem, but in many cases, you can see how others worked through it:
* Several people left comments on the initial blog entry with test strategies.
* Several went on to post their own blog entires:
** Marlena Compton
** Joe Harter
** Justin Rohrman (with a follow-up)
** Markus Gaertner (with a Part II)
** And, last but not least, the a truly unique reply came from Phil Kirkham. If I don’t give Phil second place, I may have to invent a prize category for “humor.”
I also received private entries from Justin Hunter, Ajay Balamurugadas, Parimala Shankaraiah, and Sharath Byregowda. (Justin Generated his with his Hexawise tool. It’s /very/ interesting and collapses the exercise into something like twenty test cases.)
Needless to say, I have a lot of reading to do. If you would like to help me out, please, feel free to check out the public entries and vote with comments.
Over the next few days I will list my strategy, test plan, the actual bugs we found in the software, and yes, the winning entry. Keep it tuned here …
The Fishing Maturity Model
Imagine, for a moment, you own a fishing company. Perhaps it is in Louisiana, something like Bubba Gump’s Shrimp Company from the movie Forrest Gump.
What you do is very simple; every day, you go out, cast your nets, bring in a few hundred pounds of shrimp. Minus the cost of gasoline and maintenance on the boat, you can “make a good living.”
But the Blue family were not professional shrimp fishers; they were shrimp cookers.
Enter me, the Management Consultant, and my fishing maturity model.
I point out that you’ve been running your business in an ad-hoc fashion – just running out into the water and dragging the nets. Why, you don’t know if you’re doing well or not, and have no plan for improvement. I’d like to help.
The five levels of the fishing maturity model:
1 – Ad-hoc. Fishing is an improvised process.
2 – Planned. The location and timing of our ships is planned. With a knowledge of how we did for the past two weeks, knowing we will go to the same places, we can predict our shrimp intake.
3 – Managed. If we can take the shrimp fishing process and create standard processes – how fast to drive the boat, and how deep to let out the nets, how quickly, etc, we can improve our estimates over time, more importantly.
4 – Measured. We track our results over time – to know exactly how many pounds of shrimp are delivered at what time with what processes.
5 – Optimizing. At level 5, we experiment with different techniques; to see what gathers more shrimp and what does not. This leads us to continual improvement.
Sounds good, right? Why, with a little work, this would make a decent 1-hour conference presentation. We could write a little book, create a certification, start running conferences …
The problem:
I’ve never fished with nets in my entire life. In fact, the last time I fished with a pole, I was ten years old at Webelo’s camp.
You see, I actually have no idea what I am talking about. I made it all up by inserting a process improvement framework over a specific activity. It’s Gunk; it’s Garbage.
Or is it?
The beauty of Maturity Model Mania (the MMM) is that it’s a non-falsifiable argument. If you boil it all down, all it’s saying is do the same thing every time, measure what you do, experiment, and then do what works best. How can you argue with that?
I’m not going to argue with the idea that evaluation when picking changes to technique is good. But that’s not what we’re talking about; we’re talking about a specific framework that, when you get down to it, introduces a series of wasteful and expensive processes that don’t really pay off until level five.
I would argue that, as testers – and quality experts – or heck, anyone doing knowledge work, it’s our job to not be fooled – not just by the software, not just by the project plan, but even by the people in suits with fancy ideas and power point. If Forrest Gump wants to run a successful fishing business, he probably wants to listen to just enough of the idea to see what parts of it apply, then use his own good judgement and discernment on what to implement. (On the other hand, if the guy actually fishes every day and has been wildly successful, you might want to listen to him.)
There are a lot of bogus ideas in software development. This is just one little reminder to trust your gut; don’t be afraid to say “This whole thing smells fishy to me.”
The tyranny of the deadline
Hang around software development or testing long enough, read enough blogs and books, and sooner or later, you are likely to hear the great crisis talk. It typically goes something like this:
(A) Software Development is in a state of crisis!
(B) (Quote the Chaos Report)
(C) Something must be done!
(D) Therefore – whatever the presenter is selling
The presenter could be selling anything – from process improvement, to ‘getting the requirements right’, to ‘prevention’, or ‘inspection’ – it doesn’t really matter.
The point is, software development is in a state of crisis and we must be fix it!
Wait a minute. Let’s hold our horses. What is this Chaos report, exactly?
It turns out that a business think-tank called the Standish group conducts a series of surveys to determine the overall state of the health of IT; the 1995 report is available on-line free. At the very least, it is an interesting read.
The Standish’s report (the “Chaos Report”) takes IT projects and divides them into three buckets:
Three Project Outcomes
Resolution Type 1, or project success: The project is completed on-time and on-budget, with all features and functions as initially specified.
Resolution Type 2, or project challenged: The project is completed and operational but over-budget, over the time estimate, and offers fewer features and functions than originally specified.
Resolution Type 3, or project impaired: The project is canceled at some point during the development cycle.
Three buckets. Really?
Imagine, for a moment, we used the same criteria to evaluate movies. Did you know that the movie Titanic had it’s original release date of June 1995 pushed back to December? Under these definitions it would be “challenged”, even though it was the single largest grossing film in the history of cinema.
Forget Titanic; Star Wars was six months late! That’s right kids; the single largest iconic geek movie of the 20th century – the one that launched five sequels, countless action figures, lunch boxes, breakfast cereals, toy light sabres that make whoosing sounds and it’s own version of “Monolopy” – should be caulked up as a failure. (It did run $3 million over budget, after all, costing $11 million instead of $8)
Oh, bull pucky.
Even worse than late or over-budget, the Chaos report considers projects that were “canceled early” a failure, when in my career I have worked on projects that delivered 90% of the value in 50% of the allotted time. Likewise, some projects should be canceled if the competitive environment changes and the project becomes irrelevant.
What all this means is the report doesn’t allow for the possibility of change – positive or negative. Instead of measuring the execution of the software team, it actually measures the ability of the estimators at the beginning of the project; not only on their estimation, but also on their ability to predict the future (changes to the environment) that will happen before the project finishes.
Pictured this way, the ’shocking’ and ‘abysmal’ ‘failure’ rates actually mean that, as an industry, we aren’t so good at predicting the future.
I don’t think this is all that shocking, nor would I call it a crisis.
Where did we go wrong?
The biggest problem I see is that the chaos report is only looking at a single dimension – “did the project hit the deadline” instead of asking a more reasonable question, like, for example “did the delivered software have a good return on investment?”
Or, perhaps, one single question just isn’t enough. It’s like asking “what’s your weight” to someone; it’s hard to draw any conclusions without knowing their height. Even then, a weightlifter can come off “overweight”, you need body fat index to really tell. And that might not be healthy; you’d need to know cholesterol
.. you get the point. Drawing conclusions with a single dimension can be dangerous.
How can we do it you one better? Well, the usual. Develop in small chunks. Adapt to change. Plan to re-plan. Think while we do. Limit our work in progress, so that if a new opportunity arises, we have to throw away less work. Plan phase gates that release working software – not documents that may or may not be useful.
And, as testers, try not to be fooled by happy-talk – or even crisis-talk.
Meanwhile if you want to build a new framework to analyze projects in our industry – one with depth, that, to paraphrase Gandhi, is the change we want to see in the world – please, drop me a line; I’d be interested.
