LLM-as-a-Judge
LLM-as-a-judge is an evaluation technique where one language model scores another model's outputs against criteria you define — correctness, groundedness, tone, task completion. It makes evaluation scalable: instead of humans reviewing every output, a judge model scores thousands per hour, with humans auditing a sample to keep the judge honest. Judge models carry known biases — position bias, verbosity bias, and self-preference — which are mitigated with structured rubrics, randomised orderings, and regular calibration against human spot checks. For agents, judges can score whole trajectories and tool calls, not just final answers.