Mystery Shopper Research: Performance Standards (Part 5)

Welcome to Part 5 of our Mystery Shopper Research Series – where we explore how to select, define, and measure Performance Standards.

This article is part of our Mystery Shopper Research Series — practical insights on designing, running, and learning from Mystery Shopper programs.

In Part 5 we:

Define Performance Standards—how to select, score, and report them
Share practical examples

Let’s Refer Back to the 7-Step Design Process

In this post we cover Step 4:

What do I want to learn? Define and document your research objectives
What scenarios do I want to test? Define specific journey and/or touchpoint scenarios. Categorize each as existing Customer, prospective Customer, or observation-only.
Special profiling & selection needs: Specify what shoppers need to be or have (such as a status or product type)
Select performance standards: Define the standards to be scored and the scoring logic (binary and scaled) with rubrics
Plan volumes & timing: Build a fieldwork calendar with finalized volumes and cadence
Design reporting: Decide how results will be presented—both as they come in (such as red flags) and at the final readout
Design & execute training: Provide paths and scripts for shoppers, plus the tasks and equipment (such as recording devices) needed.

You Decide What Behaviors You Want to Measure

You’ve reached an important step in your Mystery Shopper Research program design.

In this step, you decide, specifically, which behaviors and conditions to measure and report numerically. Where a behavior could be empathy expressed by a Customer Service staff while a condition could be that the hotel lobby was clean and organized.

At the end of the program, each behavior (and condition) will be presented with objective scores — supported by relevant qualitative comments.

The possibilities for what you want to measure with Mystery Shopper are nearly endless. We’ve measured literally thousands of different behaviours and conditions over the years.

So to help our Clients imagine what can be measured I share this list — starting with a question:

How well do we..?

Deliver service
Drive sales
Provide an effective digital journey
Adhere to policies & procedures
Know products & services
Convey brand image
Pick up and handle sales signals
Manage queues and response times
Handle questions about competitors
Handover Customers from one person or function to another
Give bad news when we have to
Compare to the competition
Show empathy
Present the right environment
Use branded words & phrases
Upsell / Cross-sell

The opportunities to select meaningful Performance Standards are many and varied.

Selecting the Right Performance Standards Matters

As you clarify your research objectives and determine what you want to learn (Step 1), the relevant Performance Standards tend to emerge.

Here are examples:

If your objective is to evaluate the application of a sales skills training, measure the sales behaviors taught.
If your objective is to assess product enquiry handling across 10 markets, focus on the behaviors required to resolve that enquiry well.
If your objective is to learn how best-in-class organizations say ‘no’, select standards for empathy, positive language, and specific ‘saying no’ techniques.

Focus on selecting Performance Standards that directly align to your research objective.

Resist the temptation to let everyone jump in and say what should be measured

If you let different organizational functions pile in with their own wishlists, you’ll end up creating a Frankenstein program.

Remember Frankenstein?

He’s made out of different body parts taken from different bodies. And in Mystery Shopper Research, it’s pretty easy to spot a Frankenstein program.

Because there’s no central theme or direction to follow. So when scores come out, it becomes difficult to organize results into a coherent narrative and set of actions.

There are times when it makes sense to consider a separate or different Mystery Shopper or research program — rather than try to have one program with everything rolled into it.

Here’s a Model I Use to Help Select Performance Standards

When selecting Performance Standards (a Quality topic), I consider three input areas:

Input 1 — Who are we? (The Organization)
Vision, Mission, Values, Brand attributes, Key business objective (We once counted ‘friendly’ 17 times on a client’s website — so we advised them to measure friendliness.)
Input 2 — What do they want? (Customer expectations)
What are our Customer expectations? If Customers want patience, measure patience. If they want easy access, measure access, response time, effectiveness. Your Voice of Customer data can help here.
Input 3 — What must we do? (Regulatory requirements)
For regulated sectors, include required compliance items as needed.

When Clients get stuck with selecting relevant Performance Standards, I think using these inputs can help them get unstuck.

Each Performance Standard is scored – here’s how

There are two options used to measure a Performance Standard. Let’s work through each one so that you can make better decisions about how to measure your chosen standards.

Binary Compliance Standards

Some standards are best measured on a binary scale:

Yes / No
1 / 0
Did it happen or Did it not happen

For example, was the appropriate Greeting delivered (or not). Was the appropriate Verification conducted (or not).

Do we really need to measure a Greeting behaviour on a 4 point scale? Definitely not. Compliance standards are about presence/absence, not degree of achievement.

Scaled Range/Calibre Standards

Some standards are best measured on a range/calibre scale:

3, 2, 1
Excellent / Good / Fair / Poor
How well something was done

For example, if you score the Tone of Voice in a Contact Center call as ‘Good’ — it means that the Tone was better than ‘Fair’ but not yet ‘Excellent’. So there’s an opportunity to make it better — that’s the value of having a range.

The key to success is to specifically define what each level looks like with examples.

I’ve learned to start with the highest level of performance — whether that’s Excellenct or 3 on a 3 point scale. Define what the ‘best’ performance looks like. Then work you way down to the next level — Good or 2 on a 3 point scale.

I’ve learned that the best way to define a ‘2’ or a ‘Good’ score is to share what held this back from being a ‘3’ or ‘Excellent’.

Because it’s so much easier to describe what was missing when you coach someone to a higher level of performance.

If you’re struggling to decide between compliance or calibre, ask:

Are we checking for presence? (That’s compliance)
Or are we checking for quality/degree? (That’s calibre).

Which Standards Do Customers Remember?

Which standards do Customers feel more? Compliance standards? Or Calibre standards?

Calibre. Because that’s where humanity, tone, and style live.

Compliance behaviors matter of course — we’re not dismissing verification for example. But compliance standards aren’t so much about emotional resonance.

Compliance behaviors tend to be expected — such as wearing a clean professional uniform — and are mostly noticed when they’re not up to standard (a dirty uniform) or even missing (no Greeting was given).

Performance Standard Scores Should Be Objective

Every Performance Standard in the program is individually measured and has a score attached to it. This means that every Performance Standard needs to be clearly defined along with clear scoring logic.

Mystery Shoppers should not score based on their opinion — that’s where many Mystery Shopper programs fail. Mystery Shopper scores are not real Customers.

The main focus of the program is not to gather Mystery Shopper perceptions — though of course their perceptions can provide The Magic 20% insight which we spoke about earlier in this series.

Mystery Shoppers are stand-ins for real Customers. They follow a pre-designed scenario across a pre-selected set of journeys and touchpoints, and make observations along the way.

And they are paid to do this.

How to Ensure Your Scores Are Accurate

Calibration means anyone involved in scoring scores the same way — with shared definitions and clear distinctions between scoring levels for each standard.

Sometimes, depending on the simplicity or scale of the program, it is possible to calibrate across all Mystery Shoppers — and get everyone on the same page.

But when you’re dealing with a complex or large scale program, calibrating the individual Mystery Shoppers isn’t always possible.

For these cases, we employ a small Quality Assurance team to score the recordings/transcripts.

Or we have them interview the Shopper (e.g., when language differs) to ensure accurate scoring and capture supporting qualitative input.

Where You Are in This Series

In this Part 5 article we:

Defined Performance Standards—how to select, score, and report them
Shared practical examples

We are currently working on the Part 6 article with an intended focus on training, operations and execution.

Thank you for reading!

I regularly share stories, strategies, and insights from our work across Contact Centers, Customer Service, and Customer Experience. If this resonates, I’d love to stay connected.

You can drop me a line anytime, or subscribe via our website.

Daniel Ord
[email protected]
www.omnitouchinternational.com