Guide

RAG Evaluation Checklist

A practical checklist for evaluating RAG retrieval quality, source faithfulness, citations, no-answer behavior, latency, and human review effort.

Short answer

A RAG evaluation should measure retrieval relevance, source faithfulness, citation quality, no-answer behavior, latency, and review effort — fluency is not the goal. Split retrieval from generation: if the right chunks were never retrieved, the answer cannot be faithful; if they were retrieved but the answer adds unsupported claims, the generation or output policy is the problem.

A RAG evaluation should measure retrieval relevance, source faithfulness, citation quality, no-answer behavior, latency, and review effort. Fluency is not the goal. A fluent answer can still use the wrong source, invent a claim, or hide missing evidence.

The evaluation should split retrieval from generation. If the right chunks were never retrieved, the answer cannot be faithful. If the chunks were retrieved but the answer adds unsupported claims, the generation or output policy is the problem.

Use this checklist with RAG No-Answer Testing and the AI Hallucination Testing Guide when unsupported claims are the main risk.

Build the fixture set

Start with a small set of representative questions. Include answerable questions, ambiguous questions, no-answer questions, and adversarial prompts that pressure the system to make unsupported claims.

Each fixture should record:

User question.
Expected source or source group.
Expected answer behavior.
Required citation behavior.
Whether refusal is acceptable or required.
Severity if the answer is wrong.

Do not build the fixture set only from easy examples. The no-answer and ambiguous cases are what reveal whether the system understands its evidence boundary.

Score retrieval quality

Before reading the generated answer, inspect retrieved sources. Ask whether the retrieved chunks contain enough information to answer the question. Score relevance, completeness, freshness if relevant, and whether the chunks include conflicting information.

Common retrieval failures:

The right document is missing.
The right document is retrieved but the relevant passage is absent.
Similar but wrong documents are ranked higher.
The chunk lacks context needed for interpretation.
Old or superseded content appears without warning.

If retrieval fails, do not treat a good-sounding answer as success. It may be relying on model prior knowledge instead of the approved corpus.

This separation also helps teams prioritize fixes. Retrieval failures may require corpus cleanup, metadata, chunking, or ranking changes. Generation failures may require prompt, format, or verification changes. Mixing the two hides the real bottleneck during review and later retest planning too.

Score answer faithfulness

Faithfulness asks whether the answer’s material claims are supported by retrieved sources. It is stricter than general correctness. The answer may be true in the world and still fail if the source packet does not support it.

Check:

Does every material claim map to a retrieved source?
Are caveats preserved?
Are numbers, names, dates, and constraints copied accurately?
Does the answer combine sources into an unsupported conclusion?
Does it say when evidence is incomplete?

Use the LLM Output Verification Guide for a general claim-by-claim review pattern.

Score citations and no-answer behavior

Citations should help users or reviewers inspect the answer. A citation that points to a broad document but not the supporting passage may be weak. A citation attached to a claim that the source does not support is worse than no citation because it creates false confidence.

No-answer behavior matters just as much. If the corpus does not answer the question, the system should say what evidence is missing and avoid filling the gap. The refusal can still be helpful by suggesting the next source or escalation path.

Score latency and review effort

RAG systems are operational tools, so measure the burden they create. A slow answer may be fine for research and unacceptable for support. An answer with visible citations and clear uncertainty is easier to review than a dense paragraph that requires source hunting.

Review effort should include how long it takes a human to confirm or reject the answer. If the answer is technically correct but costly to verify, the product may need a better output format.

Verification checklist

Before accepting a RAG result, confirm:

Retrieved sources are relevant and sufficient.
Material claims are faithful to sources.
Citations point to supporting evidence.
No-answer behavior works.
Ambiguous source cases preserve uncertainty.
Latency fits the workflow.
Reviewer effort is acceptable.

Use the RAG Evaluation Template to record fixtures, scores, failures, and retest notes.

FAQ

What should a RAG evaluation measure?

A RAG evaluation should measure retrieval relevance, source faithfulness, citation quality, no-answer behavior, latency, and review effort.

Why is answer fluency not enough?

Answer fluency is not enough because a fluent answer can still use the wrong source, invent a claim, or hide missing evidence.

Reusable resource: Download the RAG template

RAG Evaluation Checklist

Build the fixture set

Score retrieval quality

Score answer faithfulness

Score citations and no-answer behavior

Score latency and review effort

Verification checklist

FAQ

What should a RAG evaluation measure?

Why is answer fluency not enough?

Related content

RAG No-Answer Testing

LLM Evaluation Framework

AI Hallucination Testing Guide

How to Reduce Hallucinations in LLM Apps