← All guides
Education Resource

What is RAG Evaluation?

Measuring whether a retrieval-augmented system fetches the right context and generates faithful, relevant answers.

Updated 13 June 2026 5 min read 3 sections
TL;DR

RAG evaluation measures whether a retrieval-augmented generation system does two jobs well: retrieves the right context, and generates an answer that is faithful to it. The core metrics are faithfulness (is every claim supported by the retrieved context), context precision and recall (did retrieval surface the right passages and not junk), and answer relevance (did it actually address the question). For agents that use RAG, this is one component of the wider agent evaluation, not the whole of it.

What metrics are used to evaluate RAG?

Four do most of the work. Faithfulness (also called groundedness): the share of the answer's claims that are supported by the retrieved context — the direct measure of hallucination. Context precision: how much of what was retrieved was actually relevant. Context recall: whether the passages that contained the answer were retrieved at all. Answer relevance: whether the response addresses the user's question rather than drifting. Faithfulness and context recall catch the two most common RAG failures — making things up, and not retrieving the answer in the first place.

How is RAG evaluation different from agent evaluation?

RAG evaluation scores a retrieval-and-generation pipeline: query in, retrieved context, answer out. Agent evaluation scores the whole agent — which may use RAG as one tool among many, plus other tools, multi-step plans and actions. If your agent retrieves and answers, RAG metrics are a crucial part of its eval; but a RAG score alone will not tell you whether the agent called the right tool, stayed in policy, or completed the task. Evaluate the RAG component with RAG metrics, and the agent around it with agent evals.

How do you evaluate a RAG agent in production?

Run RAG evals in the same two places as any agent eval: offline against a golden dataset of questions with known source passages, on every prompt, model or index change; and online, scoring a sample of live answers for faithfulness and relevance with an LLM-as-a-judge. Production sampling is what catches the failures a fixed test set never imagined — a new document that confuses retrieval, or a model update that starts paraphrasing beyond the source. Feed those live failures back into the golden dataset.

Score retrieval and groundedness on every agent run

Prefactor gives enterprises runtime governance, observability, and control over every AI agent in production.

Book a demo →

Ready to control your agents?

Maintain visibility and control across agents, frameworks, and AI providers. Prefactor helps teams monitor activity, enforce boundaries, and manage operational risk.