MindTorc
Back to Blog
AI / UX

Designing Match Scores That Humans Trust

Showing confidence intervals, explanations, and counter-examples: the UX of AI ranking systems done right.

MindTorc Product Team·AI & DesignJan 22, 20266 min read
Designing Match Scores That Humans Trust

We built a supplier-matching feature that returned confidence scores alongside each result. Users ignored it for three months. When we dug into session recordings we saw why: they could not interpret what "87% match" meant. 87% of what? Compared to what? Better or worse than what they had seen yesterday? The number existed but it communicated nothing actionable.

Getting users to engage with AI-generated scores is a UX problem, not a model problem. Here is what we learned.

Raw probabilities are not user-meaningful

Model probabilities communicate something about the model's internal state, not about whether a result is actually good for a specific user in a specific context. The first change we made was mapping raw scores to human-readable categories: "Strong match," "Good fit," and "Partial match" conveyed more actionable information than 94%, 87%, and 71% while being just as honest about the ranking.

If you want to keep numbers for power users, add them as secondary information rather than the headline. Lead with the human-readable label, show the number on hover or in an expanded view. The users who want the raw number will find it; the users who would be confused by it will not be.

Explanations move the needle more than accuracy

The biggest trust driver for match scores is not how accurate they are. It is whether users understand why a result got the score it did.

When we added a one-line explanation to each match ("Strong alignment on industry, budget, and delivery timeline"), session depth increased 34% and users started acting on lower-scored matches they had previously skipped. The explanation gave them enough context to apply their own judgment rather than either blindly following the number or ignoring it entirely.

This requires your model to surface feature importance in a way that can be translated into natural language. For tree-based models this is relatively straightforward. For neural networks you either need attention visualization or a parallel explanation model. It is real engineering work, but it changes how the product feels to users far more than another point of model accuracy does.

Counter-examples build trust in high-confidence results

When a score is very high, show what would lower it. "This is a strong match. If you need on-site presence or a team under 10 people, these alternatives might be worth a look." This is counter-intuitive because you are giving users reasons to look elsewhere. But it dramatically increases trust because it demonstrates that the score is reasoned, not arbitrary. Users who see counter-examples report higher satisfaction even when they still choose the top result.

Handle uncertainty as a feature, not a failure mode

For every result scored above 90%, there are results scored below 50%. How you display low-confidence results matters a lot. A low number without context reads as "this result is bad." But low confidence often means something else: the model has not seen enough examples of this match type, or there is missing information that would resolve the ambiguity.

Label uncertain results with the reason for the uncertainty: "We do not have enough data about this vendor yet" or "This match type needs more information to score accurately." This prevents users from interpreting low confidence as low quality, which would cause them to miss results that are actually good candidates.

Measure downstream outcomes, not clicks

Match score UX is difficult to A/B test because you are affecting a judgment process, not a discrete action. The right measure is downstream: did users who engaged with scores make better decisions than users who did not?

We measure this through holdback experiments where a fraction of users see scores and explanations while the rest see a simple ranked list, then track downstream success metrics over several weeks. This takes longer to run than a typical A/B test but it tells you whether the AI is actually helping users, not just whether they click on it. That distinction matters a lot when you are deciding how much to invest in making the system more sophisticated.