Measurement of 1st Set Completed Successfully Continue to Measure Second Set
Numbers are powerful (even though they are often misused in user experience). They offer a simple way to communicate usability findings to a general audience. Saying, for example, that "Amazon.com complies with 72% of the e-commerce usability guidelines" is a much more specific statement than "Amazon.com has great usability, but it doesn't do everything right."
Metrics are great for assessing long-term progress on a project and for setting goals. They are an integral part of a benchmarking program and can be used to assess if the money you invested in your redesign project was well spent.
Unfortunately, there is a conflict between the need for numbers and the need for insight. Although numbers can help you communicate usability status and the need for improvements, the truepurpose of a user experience practice is to set the design direction, not to generate numbers for reports and presentations. Thus, some of the best research methods for usability (and, in particular, qualitative usability testing) conflict with the demands of metrics collection.
The best usability tests involve frequent small tests, rather than a few big ones. You gain maximum insight by working with 4–5 users and asking them to think out loud during the test. As soon as users identify a problem, you fix it immediately (rather than continue testing to see how bad it is). You then test again to see if the "fix" solved the problem.
Although small tests give you ample insight into how to improve design, such tests do not generate the sufficiently tight confidence intervals that traditional metrics require. Think-aloud protocols are the best way to understand users' thinking and thus how to design for them, but the extra time it takes for users to verbalize their thoughts contaminates task time measures. Plus, qualitative tests often involve small tweaks from one session to the next, and, because of that metrics, collected in such tests are rarely measuring the same thing.
Thus, the best usability methodology is the one least suited for generating detailed numbers.
Measuring Success
One of the more common metrics used in user experience is task success or completion. This is a very simple binary metric. When we run a study with multiple users, we usually report the success (or task-completion) rate: the percentage of users who were able to complete a task in a study.
Like most metrics, it is fairly coarse — it says nothing aboutwhyusers fail orhow wellthey perform the tasks they did complete.
Nonetheless, success rates are easy to collect and a very telling statistic. After all, if users can't accomplish their target task, all else is irrelevant. User success is the bottom line of usability.
Levels of Success
Success rates are easy to measure, with one major exception: How do we account for cases of partial success? If users can accomplish part of a task, but fail other parts, how should we score them?
Let's say, for example, that the users' task is to order twelve yellow roses to be delivered to their mothers on their birthday. True task success would mean just that: Mom receives a dozen roses on her birthday. If a test user leaves the site in a state where this event will occur, we can certainly score the task as a success. If the user fails to place any order, we can just as easily determine the task a failure.
But there are other possibilities as well. For example, a user might:
- order twelve yellow tulips, twenty-four yellow roses, or some other deviant bouquet
- fail to specify a shipping address, and thus have the flowers delivered to their own billing address
- specify the correct address, but the wrong date
- do everything perfectly except forget to specify a gift message to enclose with the shipment, so that mom gets the flowers but has no idea who they are from
Each of these cases constitutes some degree of failure.
If a user does not perform a task as specified, you could be strict and score it as a failure. It's certainly a simple model: Users either do everything correctly or they fail. No middle ground. Success is success, without qualification.
However, we sometimesgrant partial creditfor a partially successful task. It can seem unreasonable to give the same score (zero) to both users who did nothing and those who successfully completed much of the task. How to score partial success depends on the magnitude of user error.
In the flower example, we might define several levels of success:
- complete success: the user places the order with no error, exactly as specified
- success with one minor issue: the user places the order but omits the gift message or orders the wrong flowers
- success with a major issue: the user places the order but enters the wrong date or delivery address
- failure: the user is not able to place the order
Of course, the precise levels of success would depend on the task and your and your users' particular needs. (For example, if you did a survey and determined that most mothers would consider it a major offense to get tulips instead of roses, you may change the rating accordingly).
Reporting Levels of Success
To report levels of success, you simply report the percentage of users who were at a given level. So, for example, if out of 100 users, 35 completed the task with a minor issue, you would say that 35% of your users were able to complete the task with a minor issue. Like for any metric, you would have to report the confidence interval for that number.
Level of success | Number of users (out of 100) | How you report it |
Complete success | 20 | 20% of our participants were able to complete the task successfully with no error. Based on this result, we expect that between 13% and 29% (*) of our general user population will complete the task with no error. |
Success with a minor issue | 35 | 35% of our participants placed an order but had a minor issue. Based on this result, we expect that between 26% and 45% (*) of our general user population will complete the task with a minor error. |
Success with a major issue | 30 | 30% of our participants placed an order but encountered a major issue. Based on this result, we expect that between 22% and 40% (*) of our general population will complete the task with a major error. |
Failure | 15 | 15% of our participants were not able to place the order. Based on this task, we expect that between 9% and 23% (*) of our general population will not be able to place an order. |
(*) In this table, the ranges represent 95% confidence intervals calculated using the Adjusted Wald method.
Note that this method simply amounts to using multiple metrics for success instead of just one — each level of success is a separate metric.
You can also use other metrics such as number of errors; for example, you could define different error types (e.g., wrong flowers, wrong shipping address) and track the number of people who made each of these errors. Doing so may actually give you a more nuanced picture than using levels of success because you might be able to say precisely which of the different errors is more common and, thus, focus on fixing that one.
Do Not Use Numbers for Success Levels
A common error that people make when working with success levels is to assign numbers to them; for example, they may say:
- complete success = 1
- success with one minor issue = 0.66
- success with a major issue = 0.33
- failure = 0
And then, instead of reporting success, they simply average these success levels for their participants. In our example, they might say that the success rate is:
(20*1+35*0.66+ 30*0.33+0*15)/100 = 0.53 = 53%
This approach is wrong! The numbers that we assigned to the different levels of success are simply labels and they form an ordinal scale, not an interval or ratio scale. That means that, even though there is an order established across these levels of success (e.g., failure is worse than success with major issue), there is no mathematical meaning to these numbers and we cannot average them because we cannot truly guarantee that these numbers are evenly spaced on a 0 to 1 scale (or whatever other scale we're using between complete success and complete failure). In other words, we don't know and have no reason to assume if the difference between complete success and success with minor issue is the same as the difference between failure and success with major issue.
Since the temptation of averaging numbers is so big in real life, we strongly recommend that you assign word labels to levels of success instead of numbering them.
Source: https://www.nngroup.com/articles/success-rate-the-simplest-usability-metric/
0 Response to "Measurement of 1st Set Completed Successfully Continue to Measure Second Set"
Post a Comment