← All guides
Education Resource

What is Eval-Driven Development?

Test-driven development for agents — write the eval before the fix, ship only when it passes.

Updated 13 June 2026 5 min read 3 sections
TL;DR

Eval-driven development (EDD) is test-driven development applied to non-deterministic AI agents: you write the eval for a desired behaviour or a known failure before you make the fix, and a change ships only when it passes the eval suite without breaking the rest. It is the operating rhythm that makes agent optimization safe at speed — quality becomes a measured trend, not a Slack argument.

How does the eval-driven loop work?

Trace what the agent actually did, including the sessions that went wrong. Turn failures into eval cases — real inputs with the outcome they should have produced. Score every proposed change against that growing dataset before it ships. Verify in production with online sampling and user feedback, which surface the next failures, which become new cases. The loop closes, and because the dataset grows from production reality it stays representative of what the agent actually faces.

How is eval-driven development different from normal testing?

Traditional tests assert fixed outputs for fixed inputs — pass or fail, and a green suite stays green until the code changes. Evals score judgement under non-determinism: the same input can yield several acceptable answers, so you score properties (correctness, groundedness, completion) and aggregate over many cases. And an eval suite's pass rate can move when nothing in your repo changed — because a model updated or behaviour drifted — which is why evals belong in monitoring, not just CI.

How do you start eval-driven development?

Smaller than you think. Collect twenty real cases, write a minimal judge or rule for each, and script a run that reports a pass rate. From then on, the rule is simple: when production surfaces a failure, write the eval that captures it before you write the fix, and accept the fix only when the eval passes and the rest of the suite still does. Within months you have a couple of hundred cases, each earned from a real incident — and an agent that improves without regressing.

Run the eval-driven loop on every agent with Prefactor

Prefactor gives enterprises runtime governance, observability, and control over every AI agent in production.

Book a demo →

Ready to control your agents?

Maintain visibility and control across agents, frameworks, and AI providers. Prefactor helps teams monitor activity, enforce boundaries, and manage operational risk.