← Back to glossary
Glossary

LLM-as-a-Judge

Reviewed 9 April 2026 Canonical definition

LLM-as-a-judge is an evaluation technique where one language model scores another model's outputs against criteria you define — correctness, groundedness, tone, task completion. It makes evaluation scalable: instead of humans reviewing every output, a judge model scores thousands per hour, with humans auditing a sample to keep the judge honest. Judge models carry known biases — position bias, verbosity bias, and self-preference — which are mitigated with structured rubrics, randomised orderings, and regular calibration against human spot checks. For agents, judges can score whole trajectories and tool calls, not just final answers.

Ready to control your agents?

Maintain visibility and control across agents, frameworks, and AI providers. Prefactor helps teams monitor activity, enforce boundaries, and manage operational risk.