Education Resource

Agent Evals: A Practical Guide

What evals are, the four types that matter for agents, and how to ship your first eval this week, from vibes to verdicts.

Updated 13 June 202610 min read8 sections

§01 / OVERVIEWupdated: 13 June 2026

TL;DR

Evals are repeatable tests that score an AI agent's outputs against expectations: the agent equivalent of a test suite. Where unit tests check code paths, evals check judgment: did the agent answer correctly, ground its claims, complete the task, stay in policy. Teams that ship reliable agents run evals on every change and on samples of live traffic.

§02 / THE GUIDEread: 10 min

What are evals?

Evals (short for evaluations) are repeatable, automated tests that score an AI system's outputs against defined expectations. For agents, an eval takes a task input, a user request, a document, a scenario, runs the agent, and scores the result: was the answer correct, was it grounded in the provided context, did the agent complete the task, did it use the right tools, did it stay within policy.

The comparison to unit tests is useful but imprecise in one important way. A unit test checks a deterministic code path: same input, same output, pass or fail. An agent's output varies between runs, and 'correct' is often a judgment call rather than a string match. So evals score rather than assert, a groundedness score of 0.84, a task completion verdict from a judge model, a pass rate across twenty test cases, and you set thresholds on the scores.

Evals run in two places. Offline, against a fixed dataset of test cases, on every prompt or model change: this is your regression suite. Online, against samples of real production traffic: this is how you catch the failures your test set never imagined.

You're testing on vibes, and that's normal

Here is how most teams actually test agents today: an engineer changes a prompt, runs five or six familiar questions through the agent, reads the answers, decides they 'look right', and ships. One of our customers described their process exactly this way: they test internally and base it on vibes. No stored test cases, no scores, no record of what the last version did on the same inputs.

This is not a failing: it is the natural starting point, and vibes-based testing genuinely works at small scale. A capable engineer reading outputs catches a lot. It breaks down in three predictable ways. First, regressions hide: the prompt change that fixes your five test questions quietly breaks a sixth case you didn't re-run. Second, memory fails: nobody can recall how version 14 handled the edge case that version 19 just fumbled, so every debate about 'did it get worse?' is unresolvable. Third, it doesn't scale past one person: the engineer's taste is the spec, and it leaves when they do.

The exit from vibes testing is smaller than most teams assume. It is not a hundred test cases and an evaluation platform. It is twenty real examples, written down, scored the same way every time. That is an eval.

The four types of agent evals

Four eval types cover the practical landscape, and mature teams run all four.

Golden-dataset regression evals run the agent against a curated set of inputs with known-good expected outputs, on every change. Example: a support agent has 40 stored tickets with verified correct resolutions; every prompt change replays all 40 and reports pass rate. This is your safety net against regressions, and the first eval most teams should build.

LLM-as-a-judge evals use a second model to score outputs against a rubric: correctness, groundedness, tone, completeness. Example: a judge model receives the agent's answer plus the retrieved documents and scores whether every claim is supported, flagging the unsupported ones. This is how scoring scales beyond what humans can read, at the price of needing to calibrate the judge against human spot checks.

Human feedback evals capture signal from real users: thumbs up/down, ratings, corrections. Example: every agent response in your app carries a thumbs widget; the scores aggregate per agent version. Feedback is noisy and sparse but it is ground truth about user experience, and it is the cheapest way to calibrate your judge.

Online production sampling scores a percentage of live traffic continuously with the same judges you use offline. Example: 5% of sessions get groundedness-scored within minutes of completing; an alert fires when the rolling rate degrades. This is what turns evals from a deploy gate into a monitoring system.

Eval-driven development: the agent quality loop

Teams that ship reliable agents converge on the same loop, whatever they call it. We call it the agent quality loop: trace, eval, feedback, change, re-eval.

It works like this. Production traces show you what the agent actually did, including the sessions that went wrong. Failed or suspicious sessions become eval cases: real failures, captured with their full context, added to the dataset. Evals score every proposed change against that growing dataset before it ships. User feedback and online sampling verify the change in production and surface the next failures. Those become new eval cases, and the loop closes.

Two properties make this loop work. The eval dataset grows from production reality, not from imagined test cases, so it stays representative of what your agent actually faces. And every change is scored against the same cases as the last change, so quality becomes a trend you can plot rather than an argument you have in Slack.

This is eval-driven development in the same sense as test-driven development: the eval for a failure is written before the fix, and the fix is accepted when the eval passes, without breaking the rest of the suite.

How do I write my first eval for an AI agent?

Here is the 30-minute version, no platform required.

Step one (ten minutes): collect twenty real cases. Pull them from production logs, support escalations, or your own testing history. Each case needs an input and an expectation, for a support agent, the customer message plus the correct resolution; for a RAG agent, the question plus the source passage that contains the answer. Real cases beat synthetic ones every time; they encode the weirdness of actual usage.

Step two (ten minutes): write the judge prompt. A minimal judge prompt has four parts: the role ('You are evaluating a customer support agent's response'), the materials (the input, the agent's output, and the expected outcome or source context), the criteria ('The response is correct if it resolves the issue described, makes no claims unsupported by the provided context, and follows refund policy'), and the output format ('Return PASS or FAIL and one sentence of reasoning'). Resist the urge to score ten dimensions at once: one clear criterion, scored reliably, beats a vague rubric.

Step three (ten minutes): wire it into a loop. A script that runs all twenty cases through your agent, sends each result to the judge, and prints a pass rate. Run it on your current version to get a baseline. Run it again on every change.

That is a real eval. It will feel too small. It will also catch your next regression before your users do, which is more than vibes ever did.

What's the difference between evals and testing?

Traditional software testing verifies deterministic behaviour: given input X, the code returns Y, every time. The test asserts equality and the answer is binary.

Evals verify judgment under non-determinism. The same agent, on the same input, can phrase its answer differently every run, and several different answers may all be correct. So evals score properties of the output (correctness, groundedness, completeness, policy adherence) rather than asserting exact matches, and they aggregate over many cases rather than trusting any single run.

The operational consequence: a unit test suite at 100% stays at 100% until someone changes the code. An eval suite's pass rate can move when nothing in your repository changed, because the model provider shipped an update, a tool's API started returning different data, or user behaviour drifted. This is why evals belong in monitoring, not just CI. Testing tells you your code still works. Evals tell you your agent still works, which is a different and perishable fact.

You still need both. Tool integrations, parsers, and retrieval pipelines are deterministic code and deserve ordinary tests. The agent's judgment on top of them needs evals.

How many evals do I need before production?

Fewer than you fear, more than zero. A defensible minimum for a first production deployment: twenty to fifty cases covering your core task, your top three failure modes, and a handful of adversarial or out-of-scope inputs (the customer who asks for something the agent must refuse). That suite, run on every change, catches the regressions that matter.

What actually determines readiness is coverage of consequence, not case count. Fifty cases that include every action your agent can take with real-world side effects, refunds it can issue, emails it can send, records it can modify, beat five hundred variations of the happy path. Write at least one eval per consequential action, including one that verifies the agent declines the action when it should.

Then let production grow the suite. Every escalation, bad-feedback session, and incident becomes a case. Teams that follow this discipline typically reach one to two hundred meaningful cases within a few months, each one earned from a real failure, which is precisely what makes the suite worth running.

And pair the offline suite with online sampling from day one. Your test set guards against the failures you have imagined; sampled scoring of live traffic guards against the ones you have not.

The agent evals tooling landscape

An honest map of the tools, because they solve different problems.

Open-source eval frameworks, DeepEval, Ragas, OpenAI Evals, give you metrics and test harnesses as code. DeepEval brings a pytest-style workflow with a broad metric library; Ragas specialises in RAG metrics like faithfulness and context precision; OpenAI Evals is a registry-driven harness. All three are strong for offline, developer-run evaluation and free to adopt. You assemble the production side, sampling, dashboards, alerting, yourself.

Observability platforms with evals, Langfuse, LangSmith, attach scores to traces. Because they already capture your traffic, adding LLM-as-a-judge scoring to it is natural, and both are credible here: Langfuse open-source with strong tracing, LangSmith deeply integrated with LangChain. Their evals are trace-centric: strong on scoring what happened, lighter on completion verdicts, business feedback capture, and per-agent quality trends over versions.

Eval-first commercial platforms, Braintrust and similar, centre the workflow on datasets, experiments, and judge calibration, and are strongest for teams running large structured eval programmes pre-deploy.

Prefactor sits at the agent quality layer: evals (golden-dataset and judge-based), human feedback capture, and per-agent cost and quality analytics in one place, designed around multi-step agents, sessions, sub-agents, and tool calls, not just single completions. We integrate with traces you already collect rather than demanding you switch observability stacks. The right choice depends on where your gap is: if you lack offline testing, start with an open-source framework this week; if you lack production quality visibility, that is the gap we built for.

§03 / NEXT STEPSprefactor: watch, evaluate, improve, prove

Run evals on every change and every agent in production

Prefactor helps teams observe, evaluate, and improve their AI agents in production — across every framework and provider.

Book a demo →

Platform overview Glossary Integrations