Education Resource

What is LLM-as-a-Judge?

How one model scores another: the scalable backbone of modern agent evaluation, from judge prompts and bias controls to agent-as-a-judge.

Updated 13 June 20269 min read9 sections

§01 / OVERVIEWupdated: 13 June 2026

TL;DR

LLM-as-a-judge uses one language model to score another model's or agent's output against a rubric: the scalable way to measure qualities like correctness, groundedness and helpfulness that exact-match metrics miss. Judge models have been shown to agree with human raters more than 80% of the time on many tasks, but they carry position, verbosity and self-preference biases that must be controlled with clear rubrics and calibration.

§02 / THE GUIDEread: 9 min

What is LLM-as-a-judge?

LLM-as-a-judge is the practice of using one large language model to score another model's or agent's output against a rubric or reference answer. It scales the subjective quality checks, correctness, groundedness, helpfulness, tone, that exact-match metrics cannot capture, and studies have reported over 80% agreement with human raters on many tasks.

The technique emerged because the things we most want to know about an agent's output are rarely a string match. Did the answer resolve the user's problem? Is every claim supported by the retrieved context? Is the tone appropriate? Classic metrics like BLEU, ROUGE or exact match were built for translation and classification, not for open-ended generation, and they fall silent exactly where agent quality lives.

A judge runs in two places, the same two places every eval runs: offline, against a fixed dataset on every prompt or model change, and online, against a sample of live production traffic. The first is your regression net; the second catches the failures your test set never imagined.

Why use an LLM as a judge?

Most agent outputs have no single correct string, so traditional metrics fail. A judge model reads the output the way a human reviewer would and returns a score or verdict, making it possible to evaluate quality at a scale and speed no human review team could match, on every change and on live traffic.

The economics are the point. A human reviewer might score a few dozen responses an hour; a judge scores thousands, consistently, for the price of an API call. That is what turns evaluation from a quarterly research exercise into something you can run in CI on every pull request and continuously in production.

The trade is that a judge is an approximation of human judgement, not a replacement for it. Used well, it is calibrated against a small set of human labels and spot-checked over time. Used badly, it is a confident number nobody has validated. The rest of this guide is about staying on the right side of that line.

How does LLM-as-a-judge work?

A judge pipeline takes three inputs, the task, the candidate output, and an evaluation rubric (optionally a reference answer or retrieved context), and asks a judge model to return a score and its reasoning. Three scoring modes dominate: single-output scoring, pairwise comparison of two outputs, and reference-based grading.

Single-output scoring asks the judge to rate one response against a rubric, for example a groundedness score from 0 to 1 or a PASS/FAIL on policy adherence. Pairwise comparison shows the judge two responses and asks which is better: the method behind preference rankings and model leaderboards. Reference-based grading gives the judge a known-good answer or source passage and asks whether the candidate matches or is supported by it.

Whatever the mode, the judge should return structured output, a verdict plus one line of reasoning, or a numeric score on a fixed scale, so results aggregate into pass rates and trends rather than prose you have to read by hand.

How do you write an LLM-as-a-judge prompt?

A reliable judge prompt has four parts: the role (what is being evaluated), the materials (the input, the output, and any reference or context), the criteria (an explicit, single-focus rubric), and the output format (for example, return PASS or FAIL with one sentence of reasoning). One clear criterion scored reliably beats a ten-dimension rubric scored vaguely.

A minimal example for a support agent: 'You are evaluating a customer support response. Here is the customer message, the agent's reply, and the relevant policy. The response passes if it resolves the issue, makes no claim unsupported by the policy, and follows the refund rules. Return PASS or FAIL and one sentence of reasoning.' That prompt is short, specific, and produces a result you can aggregate.

Three habits make judges more reliable: fix the scale and define what each point means, require the judge to give its reasoning before its verdict rather than after, and resist scoring many dimensions at once: run separate judges for correctness, groundedness and tone instead of asking one prompt to weigh all three.

LLM-as-a-judge vs human evaluation

Human evaluation is the ground truth, but it is slow, expensive and inconsistent at scale. LLM judges approximate it: research on benchmarks such as MT-Bench has reported judge-to-human agreement above 80%, comparable to the agreement between two human raters. The practical pattern is to calibrate the judge against a small human-labelled set, then let it scale.

Keep humans in the loop where it matters most: building the initial calibration set, spot-checking the judge's scores periodically, and making the final call on high-stakes outputs such as anything with legal, financial or safety consequences. Human feedback from real users, thumbs up and down, corrections, is also the cheapest way to keep a judge honest over time.

The goal is not to remove humans but to spend their attention where it is most valuable. The judge handles volume; people handle calibration and the hard cases.

What are the biases in LLM-as-a-judge, and how do you control them?

Judge models show measurable biases: position bias (favouring the first option in a pairwise test), verbosity bias (preferring longer answers regardless of quality), and self-preference bias (favouring outputs from the same model family as the judge). Control them by randomising option order, scoring against an explicit rubric rather than overall impression, and calibrating the judge against human labels.

Specific mitigations help. For pairwise tests, run each comparison twice with the positions swapped and average the result, which cancels position bias. For verbosity, state in the rubric that length is not a quality signal and score only the criteria you care about. For self-preference, use a judge from a different model family than the agent under test.

For high-stakes evaluation, an ensemble of judges from different providers, with disagreements escalated to a human, is more robust than trusting any single model. The principle throughout: the more explicit and narrow the rubric, the less room a judge has to fall back on a biased gut feel.

What is agent-as-a-judge?

Agent-as-a-judge extends LLM-as-a-judge from single answers to multi-step agents. Instead of scoring only the final output, an agentic judge inspects the whole trajectory, the plan, the tool calls, the retrieved context and the intermediate steps, giving richer, step-level feedback on where and why an agent went wrong.

This matters because an agent can reach a correct-looking answer through a broken process: it called the wrong tool, looped unnecessarily, or grounded a claim in a document it never actually retrieved. A final-answer judge misses all of that. A trajectory-aware judge can flag the exact step that failed, which is what makes the feedback actionable for the engineer fixing it.

Agent-as-a-judge is the natural evaluation method for the kind of multi-step, tool-using agents most teams are now shipping, and it is where agent evaluation is heading as agents take on longer, more autonomous tasks.

Which model should you use as a judge?

Use a capable frontier model as the judge, ideally from a different family than the agent under test to reduce self-preference bias, and calibrate it against human labels before trusting its scores. For high-volume online scoring, a smaller calibrated model can cut cost, but verify its agreement with the larger judge first.

The judge does not have to be the most expensive model for every job. A common pattern is to use a strong frontier model to score offline regression runs and to calibrate, then deploy a cheaper, faster model for continuous online sampling once you have confirmed the two agree closely on your tasks.

Whatever you pick, treat the judge itself as something that can drift: model providers ship updates, and a judge that agreed with your humans last quarter may not this quarter. Re-check calibration on a schedule, the same way you re-run the evals it powers.

How LLM-as-a-judge fits the agent quality loop

LLM-as-a-judge is one scorer inside a larger loop: trace what the agent did, score it with judges and human feedback, turn failures into eval cases, ship a change, and re-score. Judges make that loop affordable by scoring every change offline and a sample of live traffic online: the difference between knowing your agent works and hoping it does.

It pairs naturally with the other eval types. Golden-dataset regression evals catch the failures you have already seen; judge-based evals score the open-ended quality those fixed answers cannot; human feedback and online sampling surface the failures you have not imagined yet. Together they form the Evaluate layer between an agent and the systems it touches.

This is the layer Prefactor is built for. Prefactor runs judge-based and golden-dataset evals over multi-step agents, sessions, sub-agents and tool calls, not just single completions, captures human feedback, and tracks per-agent quality and cost over time, integrating with the traces you already collect rather than asking you to switch observability stacks. For the wider picture of eval types and how to ship your first one, see the Agent Evals guide.

§03 / NEXT STEPSprefactor: watch, evaluate, improve, prove

Score every agent with judge-based evals in production

Prefactor helps teams observe, evaluate, and improve their AI agents in production — across every framework and provider.

Book a demo →

Platform overview Glossary Integrations