Guide

RAG Evaluation Checklist

A practical checklist for evaluating RAG retrieval quality, source faithfulness, citations, no-answer behavior, latency, and human review effort.

A RAG evaluation should measure retrieval relevance, source faithfulness, citation quality, no-answer behavior, latency, and review effort. Fluency is not the goal. A fluent answer can still use the wrong source, invent a claim, or hide missing evidence.

The evaluation should split retrieval from generation. If the right chunks were never retrieved, the answer cannot be faithful. If the chunks were retrieved but the answer adds unsupported claims, the generation or output policy is the problem.

Use this checklist with RAG No-Answer Testing and the AI Hallucination Testing Guide when unsupported claims are the main risk.

Build the fixture set

Start with a small set of representative questions. Include answerable questions, ambiguous questions, no-answer questions, and adversarial prompts that pressure the system to make unsupported claims.

Each fixture should record:

Do not build the fixture set only from easy examples. The no-answer and ambiguous cases are what reveal whether the system understands its evidence boundary.

Score retrieval quality

Before reading the generated answer, inspect retrieved sources. Ask whether the retrieved chunks contain enough information to answer the question. Score relevance, completeness, freshness if relevant, and whether the chunks include conflicting information.

Common retrieval failures:

If retrieval fails, do not treat a good-sounding answer as success. It may be relying on model prior knowledge instead of the approved corpus.

This separation also helps teams prioritize fixes. Retrieval failures may require corpus cleanup, metadata, chunking, or ranking changes. Generation failures may require prompt, format, or verification changes. Mixing the two hides the real bottleneck during review and later retest planning too.

Score answer faithfulness

Faithfulness asks whether the answer’s material claims are supported by retrieved sources. It is stricter than general correctness. The answer may be true in the world and still fail if the source packet does not support it.

Check:

Use the LLM Output Verification Guide for a general claim-by-claim review pattern.

Score citations and no-answer behavior

Citations should help users or reviewers inspect the answer. A citation that points to a broad document but not the supporting passage may be weak. A citation attached to a claim that the source does not support is worse than no citation because it creates false confidence.

No-answer behavior matters just as much. If the corpus does not answer the question, the system should say what evidence is missing and avoid filling the gap. The refusal can still be helpful by suggesting the next source or escalation path.

Score latency and review effort

RAG systems are operational tools, so measure the burden they create. A slow answer may be fine for research and unacceptable for support. An answer with visible citations and clear uncertainty is easier to review than a dense paragraph that requires source hunting.

Review effort should include how long it takes a human to confirm or reject the answer. If the answer is technically correct but costly to verify, the product may need a better output format.

Verification checklist

Before accepting a RAG result, confirm:

Use the RAG Evaluation Template to record fixtures, scores, failures, and retest notes.

FAQ

What should a RAG evaluation measure?

A RAG evaluation should measure retrieval relevance, source faithfulness, citation quality, no-answer behavior, latency, and review effort.

Why is answer fluency not enough?

Answer fluency is not enough because a fluent answer can still use the wrong source, invent a claim, or hide missing evidence.

Reusable resource: Download the RAG template