Benchmarking at its core is a process of test, change, re-test. You can look outward to benchmark against your competitors or other industries, or you can take a more introspective approach, setting a baseline for future iterations of the same experience. Setting up research on a recurring cadence to gather results and then retest at a later date can provide immense value. Not only does it set you up for long term success, but it also keeps teams accountable, giving you a structure and deadline for implementing changes.
Clients often ask us about benchmarking research, but over the years, we’ve found everyone’s definition is slightly different. At AnswerLab, we conduct a variety of different types of benchmarking research customized to our clients’ needs. We find most benchmarking can be categorized into one of three buckets, ranging from comparing experiences to one another qualitatively all the way to gathering data and metrics for a pure quantitative diagnostic of your product. Each of these approaches is used for slightly different product needs and questions and have a variety of considerations. Let’s dive in.
Setting up your benchmarking program
Conducting a true qualitative benchmark:
When you need to compare how your users interact with your product compared with another product (and the whys behind it)
True qualitative benchmarking is best used when you want to understand how high-level experiences compare to one another and how your customers feel about those experiences. This form of benchmarking is not just about what’s happening, but the why behind it. These benchmarking ratings are represented by colors or other qualitative descriptors, not numeric metrics and summary percentages.
Typically, this is more appropriate for foundational or early stage research where you’re looking for information on how experiences are being perceived, what is liked or disliked, and the differences between them. This type of benchmarking can be useful when your stakeholders want to compare experiences to one another, but are looking for directional feedback rather than statistically significant metrics. Findings from this research can help inform next steps and create a roadmap for future development. And, it's very flexible. You can do this with a group as small as 6 participants, but can always include more. This option should always be moderated when speaking with participants as it’s valuable to ask qualitative questions and probe when you want to dig into their responses. However, you can also take a heuristic approach, relying on the researcher to explore different experiences and rate particular tasks or components against one another.
Some common examples of where this might be used:
- You want to compare the experiences of recovering and resetting your password on a set of different websites to understand what makes a good experience as you begin developing or iterating on your own.
- You want to understand participants’ experiences of set-up and onboarding on two or three similar products and sites to inform how to improve your onboarding experience.
Tips from our team:
- Make sure your prototypes or live accounts are set-up and ready to go. For participants to interact with each of these experiences and sites seamlessly during a session, the research team will need access to the tools or technology for each of them. This might require prep time on the front end to set up dummy accounts and settings to keep things smooth and focused while you’re in research.
- Keep your sessions light. With this form of benchmarking, the moderator must ask the participants to think aloud, which means leaving a significant time for each task. Keeping your scope and research plan light is critical in this scenario so participants have the time to really explain their thinking.
- Rely on expert findings, rather than error counts. We find value in this form of benchmarking because the moderator can point out which issues rise to the top and why, rather than relying on quant metrics that tell you what happened but not the cause. You won’t see hard numbers with this form of benchmarking, but rather scorecards and color coding to show what’s happening.
Quantitative benchmarking with a qualitative twist:
When you need to see how users are approaching defined tasks and experiences (often with a moderator for qualitative insights)
This form of benchmarking focuses on defining what’s actually happening during the participant’s experience in quantifiable terms. This is best used when you want to understand how very specific experiences perform as compared to previous performances and/or to other similar experiences. In this method, you would recruit enough participants to get around 20-40 attempts per task. You might track observed metrics like success rate and main reasons for failure, as well as participant-reported metrics like how familiar the participant was with the task and how difficult they found it.
We typically see this form of benchmarking used when tasks are subjective and complex enough that you can’t use an unmoderated quant study. Unmoderated research tools are great at setting up structured and repeatable research tasks, but the technology isn’t sophisticated enough to determine task success or failure without a human reviewing and coding hours of video footage. However, if there is a way for task success to be self-reported or tracked, unmoderated research may be an option.
Task-based unmoderated research reports can tell you the data on which tasks stumped people, how complicated or difficult participants found the task, etc. But if you don’t qualitatively note the error paths or where people got stuck, it’s much less impactful for making the right changes. By observing sessions, our research experts can comment on the why behind the numbers so you can actively improve task success rates.
Reports that result from task-based research are much more interesting when you qualitatively note the error paths or where people get stuck. Otherwise, the report is just a data sheet of numerical values. Observation is critical to comment on the "why" behind these numbers.
Some common examples of where this might be used:
- You want to see how the features of your product are working today and re-evaluate after product development
- You want to see how easy or difficult it is to complete tasks on your products compared with your competitors’ products
Tips from our team:
- Consistency is key. During benchmarking studies, the researcher (or the research team if you’re using multiple moderators) has to create a very similar experience across participants to ensure consistency and not influence the participants or the data. If your number of participants requires multiple moderators, break up session days with regroup and analysis days to ensure full communication amongst researchers. Unlike in typical moderated research, we even script everything the moderator says, even their responses to common participant questions. Ideally, there should not be a single moment of unpredictability.
- Rapport building is particularly important for benchmarking sessions, as it is not uncommon for participants to be on the defensive given the task-based nature of the session. If they feel they’re being tested or evaluated, participants can become nervous or stressed, and may literally or emotionally withdraw midway through the session if the moderator doesn’t address this appropriately. Leave some breathing room during the session to give the participant a short break after a couple difficult tasks. Ask them a qualitative question to let them recover and feel heard before jumping back into another task.
- Take the “human” element into account when reviewing your results. When you’re conducting moderated benchmarking, you introduce a human element that can influence your results. Participants might talk nervously during a task or be self-deprecating if they fail a few tasks in a row. Participants behave differently because there’s a moderator present—it can get a little performative. This is a challenge and a big reason why we don’t recommend using time on task in this style of benchmarking. It can give the false impression of being a clear indicator of usability, when in fact time spent per task doesn’t always cleanly map to usability.
Taking a pure quantitative approach:
When you need data and metrics to establish your product’s benchmark at scale
Pure quantitative benchmarking enables you to establish a quantitative baseline or set of standard metric scores that may be used for re-assessing your experience at regular intervals or after major updates and changes. These baselines may be established for a single product or across product lines. For pure quantitative benchmarking, we recommend a minimum of 200 per segment. By including large numbers of participants in these studies, you develop a data-driven foundation for understanding your product and the broader digital landscape.
We conduct this form of benchmarking as either “point-in-time” or “ongoing.” For point-in-time, you would typically collect data to establish a baseline for a specific moment in time. Then, you can conduct studies at future points to compare it with. Whereas in ongoing benchmarking, we continuously run surveys, so we’re always collecting new data. We recommend conducting quarterly assessments of the data to understand how things are changing on a regular basis.
Some common examples of where this might be used:
- You want to conduct a continuous intercept survey on your website to monitor the impact of changes made to your site, discover conversion rates, and develop an understanding of how well you’re meeting customer needs.
- You want to launch a survey to understand user reactions to three prototypes for a new user flow on your website. The participant completes sample tasks and then answers a series of questions about the experience on a Likert scale.
Tips from our team:
- Keep your tasks clear. This form of benchmarking works best when your tasks are straightforward enough to be evaluated without human observation. With sample sizes this large, you need these tasks to be self-reported or tracked through the software. There should not be any gray area or confusion that requires a human moderator to witness a session.
- Plan for the future. Consider which metrics will be useful to you over time, while allowing space to answer more immediate questions. Putting in the time at the beginning of the process to align metrics around research goals and plan for future benchmarks will help reduce confusion and get your team on the same page for the long term. Stay consistent with the measurements being used over time.
- Build a plan for analysis. It's easy to get lost in the vast amounts of data being collected in these studies. Instead of wandering through data, create an analysis plan (even before fielding) and stick to it. If you have time remaining, explore the data for additional relevant findings.
Building on an existing benchmarking program
While the three approaches above focus on setting up a benchmarking program, some clients come to us with a pre-existing set of metrics and standards they’ve been using for years. If you already have a benchmark set up and just need to continue running the program, we’ll help build an ongoing study to take over those needs.
What about metrics?
There are a lot of metrics you can collect during benchmarking research. But teams sometimes assume some metrics correlate to product comprehension or usability, when in fact this relationship is not always direct. We hear a lot of requests for benchmarking metrics such as time on task, click count, and error count, but these can be flawed.
Time on task, for example, is not always feasible to collect or a reliable metric depending on your approach. Time is only one facet of an experience, and spending more time on a task can be due to excitement or exploration, not necessarily poor usability. As for error counts, preparing to collect this data requires a good deal of upfront work to collect a list of anticipated errors, often overlooking unknown errors that will be surfaced during the research sessions. We often find that simply counting errors isn’t as valuable as a qualitative discussion around high-frequency errors to understand the why behind them.
Next in this series: Three Common Benchmarking Metrics to Ditch and What to Use Instead