← All guides
Education Resource

What is Agent Evaluation?

The shift from evaluating models at dev time to evaluating agents in production — what it means, what it measures, and why model benchmarks don't tell you if your agent works.

Updated 13 June 2026 5 min read 4 sections
TL;DR

Agent evaluation is the practice of measuring whether an AI agent actually does its job in production — scoring its real behaviour across multi-step tasks, tool calls, and retrieved context, continuously. It is distinct from LLM evaluation, which scores a model in isolation on benchmarks at development time. As teams move from shipping models to shipping agents, evaluation has to move with them: from the model to the agent, from dev time to production, from sampled benchmarks to continuous scoring of live behaviour.

Agent evaluation vs LLM evaluation

The two sound similar and are often confused, but they measure different things at different layers.

LLM evaluation (and LLM observability) score the model. They run a model against a benchmark or a sampled dataset, usually at development time, and ask 'how good is this model at this task in isolation?' BLEU, MMLU, a RAG faithfulness score on a fixed set — useful signals about a component.

Agent evaluation scores the agent. The agent is the model plus its tools, its retrieval, its memory, its multi-step plan, and the actions it takes in the real world. Agent evaluation asks a different question: 'did this agent complete the task correctly, use the right tools, stay grounded and in policy, and do it at acceptable cost — on real production traffic, right now?'

This is the shift. You can have a state-of-the-art model and a broken agent. Evaluating the model tells you the engine is good; evaluating the agent tells you the car gets people where they are going.

Why model benchmarks don't tell you if your agent works

A model can top every public leaderboard and still fail as an agent. It can pick the wrong tool, call it with malformed arguments, lose the thread across a long task, hallucinate an action that was never authorised, or quietly degrade when a provider ships an update — none of which a static model benchmark measures.

Benchmarks are run once, on curated data, in isolation. Agents run continuously, on messy real inputs, with side effects. The properties that decide whether an agent is trustworthy in production — task completion, tool-use correctness, groundedness, policy adherence, cost and latency at the session level — only show up when you evaluate the agent doing its actual job. That is why agent evaluation is a production discipline, not a pre-launch checkbox.

What agent evaluation measures

A practical agent evaluation programme scores a handful of properties on every session, or a sample of them:

Task completion — did the agent actually resolve the user's request, end to end, not just produce plausible text. Tool-use correctness — did it select the right tools and call them with valid inputs. Groundedness — were the agent's claims supported by the context and data it retrieved, or did it invent them. Policy adherence — did it stay within the actions and data it is allowed to touch. And operational quality — cost, latency, and token usage per session.

These are scored with a mix of rule-based checks, LLM-as-a-judge graders, and human feedback — and the scores are tracked per agent and per version so quality becomes a trend, not an anecdote. See the practical how-to in [Agent Evals](/learn/agent-evals).

Evaluating agents in production, not just in CI

Because agents face inputs no test set anticipated and can drift with no code change, agent evaluation has to run in two places. Offline, against a golden dataset on every change, to gate regressions before they ship. Online, scoring a sample of live production sessions continuously, to catch the failures and drift the test set never imagined.

This is the difference between knowing your agent worked last Tuesday and knowing it is working right now. It is also what separates agent evaluation from the dev-time, sampled world of model evaluation — and it is the layer Prefactor is built for: continuous evaluation of agents in production, with the identity and runtime controls to act on what the evaluation finds.

See how Prefactor evaluates your agents in production

Prefactor gives enterprises runtime governance, observability, and control over every AI agent in production.

Book a demo →

Ready to control your agents?

Maintain visibility and control across agents, frameworks, and AI providers. Prefactor helps teams monitor activity, enforce boundaries, and manage operational risk.