Education Resource

What is an Agent Quality Score?

The single, trackable number that tells you whether an AI agent is doing its job well — rolled up from its evals.

Updated 13 June 2026 5 min read 3 sections

TL;DR

An agent quality score is a single, trackable measure of how well an AI agent performs its job, rolled up from the underlying evals — correctness, groundedness, task completion, policy adherence. It exists so teams can answer 'is this agent good, and is it getting better or worse?' with one number per agent and per version, instead of a wall of disconnected metrics.

What goes into an agent quality score?

A composite of the eval signals that matter for the agent's job: task completion rate, groundedness (are claims supported), tool-use correctness, policy adherence, and often a penalty for hallucinations or out-of-scope actions. Operational signals like cost and latency are usually tracked alongside rather than folded in, so a cheap-but-wrong agent does not score well. The exact weighting is yours to set — what matters is that it is consistent across versions.

How is an agent quality score calculated?

Each case in your eval dataset is scored by its graders (rule-based, LLM-as-a-judge, human), those scores are aggregated into the component metrics, and the components are combined into one weighted score per agent version. Because the same dataset and graders are applied every time, the score is comparable run to run — so a release that drops the number is caught before it ships.

How do you use an agent quality score?

Three ways. Gate releases: block a deploy if the score regresses against the current version. Monitor live: score a sample of production traffic so the number reflects reality, not just the test set, and alert when it drifts down. And prioritise work: the component breakdown shows whether to fix retrieval, prompts, tools or the model. It is the headline metric that makes agent quality a managed trend rather than an anecdote.

Track a quality score for every agent with Prefactor

Prefactor gives enterprises runtime governance, observability, and control over every AI agent in production.

Book a demo →

Platform overview Glossary Integrations

What is an Agent Quality Score?

What goes into an agent quality score?

How is an agent quality score calculated?

How do you use an agent quality score?

Track a quality score for every agent with Prefactor

Related guides

Related glossary terms

Ready to control your agents?