Glossary
Evaluation Harness (Agent)
An agent evaluation harness is the test infrastructure that automatically runs an agent against a suite of benchmark tasks, captures outputs, scores them against defined criteria, and generates performance reports. It is the CI layer for agent quality — running on every code or prompt change to catch regressions before they reach production. A mature evaluation harness covers accuracy, latency, cost, tool-call correctness, and safety-policy adherence.