Continued debate about the “Task Performance Indicator”

Continued debate from: http://www.iallenkelhet.no/slik-maler-du-effekten-av-nettstedet-ditt/

@Gerry McGovern and @Bjørn: I have to say that even though a webguru like McGovern uses this method and argues quite well about its advantages, I don’t trust the results and the TPI number will be… well… useless?

You both say that the “optimal time” for the task is the most difficult part of the equation. The way you calculate this number is a “black box” of mystery as @bjørn said earlier – you use the customer, the fastest participant, your own expertise…. “We take a number of issues into account” (McGovern). I’m sorry but this doesn’t seem like something that would lead to a credible result. If you are using your expertise to decide optimal time, then you apply qualitiative factor into the quantitative method that makes it actually less trustworthy as an indicator.

This week I did a usability test with only 1 task – It’s basically 4 screens to fill out if you do it right. The fastest participant used less than 6 minutes and did it without any trouble. The slowest participant did it in 23 minutes. I wouldn’t try to make any conclusions about time on task from these results (median = 882 seconds | average time = 877 sec):

Average time on task for usability test with 6 users

The result you get from the test is largely depended upon the successrate. If the successrate is low, then the TPI will be low.

Successrate is a number I have dropped from my usability analysis alltogther. Why? Because its a number depended on a large number of factors and even though 10 of 10 actually completes a task, that says next to nothing about how easy it was, or how many in the “real” world that would manage to complete the task, or how good the website really is.

Usability testing is a qualitative analysis and its wrong trying to mask it as something quantitative by introducing this magic number called TPI.

With 15-20 users you will be able to see a (strong) recurring pattern, no doubt about that, but its a long way from seeing a pattern and to grade someones website from 0-100 and calling it Task Performance Indicator slapping the grade on the report and force the client to improve the website so the number goes up!

*EDIT* Gah, so @josmag is complaining about wrong use of errormargin – even though I posted my disclaimer 2 seconds after my main post :D Let’s fix the errormargin thingy and see where that leads me:

OLD POST:

(TPI for this website would be 1(360/882) = 0,40 = 40%. Ok and then apply the error margin for +/- 19% so I can trust that my  result is really somewhere  between 21% and 59%.

That means i get a TPI that either give me: “TPI under 30 er ganske dårlig. Du har et stort problem.” OR “TPI på 51-70 er bra. Fortsatt mulig å forbedre nettstedet ditt.”

Doh? Does it suck or not? Well, the number doesn’t give me any indication really, but from what I saw in the test I would say to ignore the TPI and just fix the obvious problems.)

NEW POST:

TPI for this website would be 1 (360/882) = 0,40 = 40% or 1 (360/714) to 1 (360/1050) which gives me a number between 50% and 34%. Not a great difference from my original post. You (@josmag) are also correct that the median will be more trustworthy if I have more users in my test – we don’t know if the number will go up or down – and if we get users that don’t complete the task I will have to adjust the successrate from 100% with 5% for each user failing the task (with 20 users).

I think that the money spent on testing 20 users would be better spent if you split it up in more than 1 usability test and use the money more wisely. And as @magnusrevang points out – use analytics to get the “magic numbers” for time on task and successrate.

/End NEW POST

Actually, knowing the competence from both Netlife Research and G. McGovern I would trust their (expert) judgement a lot more than the actual number they get out of their magic blackbox :)

9 Comments

  1. haakonha says:

    When I think about it… I might not be able to apply the error margin on the TPI itself, but only on the time used on task (median time = 882 seconds +/- 19%) which actually doesn’t move the TPI as much. Damn. BUT the rest of the post is still valid guys :D

  2. haakonha says:

    Now the error margin should be correct (in the “new Post” section).

  3. Gerry McGovern says:

    Valid points about the time on task, Haakonha. It definitely has a qualitiative element. We split the measurements into two essential parts:
    1) Fix the basics (success rate and disaster rate)
    2) Best practice (completion time)

    The bigger argument I would have with your piece is that the success rate doesn’t matter. It absolutely does. It’s central and core in our experience. If we test Task A with 15 customers and 13 fail to complete, and we test Task B with 15 customers and 13 complete, that tells a story. There are always reasons for the task failures.

    Now, one difference here may be that we don’t actually do standard usability tests. It’s much more like time and motion analysis. No talk aloud, no discussions. And with screen sharing we find that we are much more likely to get close to normal behavior patterns.

    We have done many hundreds of these tests by now and restested many after the imporvements were made, and the figures do change–the metrics improve.

    Time I think is absolutely critical. We now see typical times begin to emerge. And if the time on the task is only twice the optimal time then it’s not factored in so much, but if it is 10 times, then it is. Getting the web team to focus on the time of the customer, in for me, the single most important change in thinking. It changes the whole culture and approach. I know it’s not easy to get the optimal time and it will rarely be exact, but to be able to say “this is taking four times longer than it should” is a real change in the management model.

    1. haakonha says:

      Successrate matters, but not in a quantitative way. I never use successrate to prove anything. In a standard usability test we can have ONE of 20 users failing a task, but the way he fails might be the most important finding in the study.

      As in the example (I used earlier in this debate) about the shopper that asked if he would get customs and tax for shopping on the norwegian website with english (only) language. He was ready to abandon shop because of this. One user, one fail – huge consequences. I’m sure you have tons of examples yourself of situations like this.

      Fine – if 10 users more are failing at the same thing, then its easier to spot and easier to argue for change, but you can’t really claim that one user is less important than 10. It all depends… because it’s a qualitative method – depending on judgement and analysis that doesn’t include calculators.

      In your TPI calculation you put a lot of faith in the accuracy of the successrate. Basically, if the successrate drops below 0,8 you will instantly get “you need to improve your site” no matter how good the time on task is. (For example ideal time is 80 – median 90 then you are down to score of 71 already with successrate at 80%).

      I claim that successrate is not an accurate measurement and you should not rely on it so heavily. Feel free to disagree.

      You say that “if the time on the task is only twice the optimal time its not factored in so much, but if it is 10 times, then it is”. OK?

      Example 1: Successrate 80%, Optimal time 120 sec. Median 240 sec. = 0,8 (120/240) = 40% = TPI på 31-50 er greit nok. Men det er mye å ta tak i. (Lots of space to improve).

      Example 2: Successrate 80%, optimal time 120, median 1200 = 0,8 (120/1200) = 8%

      Um yea, that is bad. But so is 40%.

      I claim that optimal time and median time is not accurate measurements and you should not rely on those so heavily.
      Feel free to disagree.

      OK:
      I realize that you are not abandoning the good practise of a regular usability study, but why this strive for an ideal time? And why put a number on the performance? What happens to the ideal time if you redesign? Back to square one? Do you use different ideal times for different target audiences? Is there such a thing as an ideal time – for anything?

      The usability community has fought for ages to get the clients to understand the qualitative approach to testing and then you guys just pulls this silly numbergame out of your pockets. Why even bother?

      I think you are trying to be more “scientific” than you need to be.

      Show a video from one of the users in the usability test to the management and ask if they think that the way the users solve the problem could have been done better and faster? This is a better way instead of insisting on TPI, probably a lot cheaper too.

  4. tormodg says:

    Just tagging along. Very interesting discussion. We’ve developed a different content quality measurement system which looks at the potential for content to solve user tasks. It is interesting to apply the TPI as well, because we can then match the results from both and see if they agree. Our method looks at how content is produced, while TPI looks at the way content is consumed. So for us, this requires some compatibility thinking, so I’m just observing for now. Keep it up – it’s the actual dissection of methods that makes them useful and practical.

  5. Gerry McGovern says:

    I’m totally in favor or regular testing, so we definitely agree on that. However, we have other major disagreements. The reason why the usability community is not listened to as much as it should be by management is because it is so qualitative. So much of a craft, so dependent on the individual usability expert. Management simply does not respect that. If you can measure, you can’t manage. We have to move beyond opinion.

    Success rate is critical and time on task is critical. If you have 19 out of 20 completing a task then that is amazing success. I rarely come across that, and I would be extremely careful in trying to make it 20 out of 20. It is almost impossible to get everyone to be ale to complete a task. And I have found that trying to cover everyone actually impacts negatively overall task completion.

    Once you get a 90% plus completion rate you then really focus on the time on task. But going back to the original area of agreement I am totally for a continious improvement model based on testing. But you need to be able to know if things are getting better. How do you know that:
    The completion rate is increasing
    The completion time is reducing

  6. Gerry McGovern says:

    I also should have said that I agreed with showing a video. But you need data to pack it up. “Is this an exceptional case,” a manager recently asked me when I showed him a video of a customer failing to complete an important task. “No,” I replied. “About 50% of your customers cannot complete this top task.” Then he really paid attention. Managers listen to numbers. Right now, they listen to silly numbers like HITS or page views. We need better numbers

  7. Gord Hopkins says:

    Thanks for the great discussion of the TPI. I’m one of Gerry McGovern’s Canadian partners and we’ve been using and helping to evolve the TPI method for the past 3 years. There are a number of issues that have been discussed. Here is my take on a few of them. I apologize in advance for the length of this reply.

    How do TPI and Usability Testing Differ?

    The TPI was not designed to replace usability testing (we do both) but to augment it with metrics that focus on top task performance. The most important elements are the success rate, disaster rate, and time on task, however some senior level management benefit from giving them a single score. The calculation of that score has evolved over time and some of the misconceptions in the blog discusion have arisen from references to older versions of the calculation. We are still striving to simplify the model and to make the score more useful so this discussion has been helpful.

    The major differences between usability testing and TPI testing are: 1) TPI focuses on specific and repeatable top tasks with the emphasis on measuring performance whereas usability testing is often more exploratory with an emphasis on understanding the “why”; 2) TPI introduces the concept of the disaster rate (people think they got it right but they are wrong) because that type of failure can have very serious implications; 3) the TPI tasks are specific, unambiguous, with a clear success/failure endpoint whereas usability testing tasks are often much more open ended; 4) because the emphasis is on performance and time to completion, no think aloud protocol is used, no hints given, etc. like you typically see in usability testing; 4) if combined with frequency estimates, the TPI can provide reasonable ROI estimates that are typically not possible based on usability test results; 5) the TPI takes confidence in results into account (confidence in disasters and non-confidence in successes) as an added parameter that, based on our experience, is much more reliable than the satisfaction ratings we use to collect during usability tests.

    We find that focusing on “managing the tasks” makes a lot of other decisions much easier to make – e.g. what content is needed, when, where and in what format. However, getting people to move from managing content to managing tasks often requires organizational change. Some of our clients are making that change and are seeing tremendous benefit.

    Is 15-20 Test Participants Enough?

    Obviously recruiting a representative sample of website visitors is essential. If done correctly, we find the metrics stabilize within the first 12-15 people such that we are at the point of diminishing returns. Does the TPI still have variability? Yes, typically about plus or minus 5-8% but the important decisions are being made about the poor performing tasks.

    If we get agreement from the client that 9 out of 10 people should be able to complete the task successfully and we find only 60% or fewer of the people can complete the task, then only a simple binomial test is needed to show significance of that finding at below the p<0.05 level – 1 in 20 chance of this result occurring by chance.

    How and why do we calculate the “optimal time”?

    I should point out that we’ve been moving away from this particular term for a variety of reasons that have already been discussed. The time on task is the most important metric but when combining it into a single score we need some way of measuring the time on task relative to some target measurement. Many of our clients see this as critical and demand some sort of comparison.

    We’ve been making the calculation more objective as the TPI has evolved. Basically we work with client experts to identify the most efficient path for a given task. We then consider what reasonable changes could be made to the content, navigation, and interface to make the path a desirable target, given that it is a top task. Once we have agreement, we look at each step in the process and attach a best-practice time (based on other top industry websites) for completing each of the components – e.g. number of seconds to position the cursor into a search field, enter a search string and get the results page displayed.

    The component times are added together and we add 5 seconds to be conservative and to account for slight variations in registering the start and stop times for a task. This gives us a target value to go after that we and the client have agreed upon. It is not used for industry benchmarking, only internal benchmarking.

    What we find is that, as improvements are made, not only do the median times decrease but the variability of times drastically decreases as well because more participants immediately get on the right path and the path only has 2 to 5 steps.

    In terms of the impact of time on the overall TPI score, that has evolved over time as well. The current formula we use is based on a normal distribution such that 2 times as long incurs less than a 5% penalty, 3 times as long incurs about a 10% penalty and then it drops off more quickly, such that taking 6 times as long impacts the TPI by close to 50%. Even though the calculations are evolving, we are able to re-analyze past data so we can create direct comparisons of data over time.

    I believe there is a place for both the TPI and usability testing and we’ve been using both with good results.

  8. In addition to the points that Haakon points out I have huge problems with this method. With TPI I have to conduct a test with 20 participants to see if my webpage has improved or not. If the successrate goes up or time to complete the task goes down you will see an improvement. To me, this seems like a massively slow way to improve your site. With analytics, feedback forms and a/b testing I can get the same answer (is my website improving?) in close to real time. My question is simple: What does TPI bring to the table that we haven’t already got?

Leave a Reply to haakonha Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s