Guide

AI Hallucination Testing Guide

A practical hallucination testing process for finding unsupported claims, weak refusals, weak citations, and source-faithfulness failures early.

Hallucination testing is the practice of asking questions where the correct behavior is grounded support, uncertainty, or refusal. It is not enough to ask questions the system can answer. You also need cases where the system should say “not enough evidence,” cite a narrow source, or route the user to review.

The goal is not to prove that a model never hallucinates. The goal is to find the patterns that create unsupported claims before they reach users. For a broader operating model, start with the LLM Output Verification Guide and use this page to turn hallucination risk into repeatable tests.

Build the test set from sources

Start with a source packet: policy pages, docs, code comments, database exports, product requirements, or knowledge-base articles. Then write questions that exercise what the source does and does not say.

A hallucination test should include answerable cases, ambiguous cases, adversarial cases, and no-answer cases. Answerable cases prove the system can use the source. Ambiguous cases prove it can preserve uncertainty. Adversarial cases reveal whether it follows misleading user wording. No-answer cases prove it can refuse when the source is missing.

For RAG systems, pair this workflow with RAG No-Answer Testing. For prompt-only workflows, use the Prompt Test Generator to freeze cases before editing the prompt again.

Write tests that expose unsupported claims

Good hallucination tests are specific. Avoid broad prompts such as “summarize the document accurately.” Instead, ask for names, dates, constraints, edge cases, or decisions that force the model to rely on evidence.

Useful test types include:

Direct support: the answer is stated plainly in the source.
Partial support: the source supports part of the answer but not the conclusion.
Missing source: the question cannot be answered from available material.
Conflicting source: two sources disagree and the system must surface that conflict.
User pressure: the user asks for a confident answer despite missing evidence.
Format trap: the output format encourages invented fields.

The expected result should be written before the model is run. If reviewers write expectations after seeing the answer, they tend to accept plausible prose that should have failed.

Score the failure, not just the answer

When a test fails, classify the failure. This makes fixes easier and avoids blaming every problem on the model.

Retrieval failure means the right source was not available to the model. Prompt failure means the instructions did not require evidence or refusal strongly enough. Synthesis failure means the model had the right sources but combined them incorrectly. Policy failure means the system lacked a rule for what to do when evidence is missing. Review failure means the output escaped without a human or automated gate.

Record the failure in a small table: fixture, input, expected behavior, actual behavior, source evidence, severity, and fix candidate. This table can feed the RAG Evaluation Checklist or the Prompt Testing Framework when the workflow matures.

Fix the system in layers

Do not try to solve hallucination risk with one stronger instruction. Layered controls work better.

First, narrow the task. A model asked to answer one source-backed question is easier to verify than a model asked to solve a vague research problem.

Second, improve the evidence boundary. Retrieval should return source IDs, timestamps, and enough surrounding context to support the answer.

Third, require source-aware output. Ask for citations, uncertainty, and no-answer behavior only when the UI and review process actually use those fields.

Fourth, add post-generation checks. A second pass can compare claims to retrieved source text, but it should not be treated as proof by itself. It is a filter that routes risky answers to revision or review.

Fifth, preserve regression tests. Every serious hallucination should become a fixture so the same failure does not return after the next prompt, model, or retrieval change.

Verification checklist

Before accepting a hallucination test suite, check that it includes:

A named source packet.
At least one no-answer case.
Expected behavior written before model output is reviewed.
Failure categories for retrieval, prompt, synthesis, policy, and review.
Severity labels for user impact.
A retest plan after prompt, retrieval, or model changes.

The test suite should be small enough to run regularly. A large suite that nobody reruns is less useful than ten fixtures that protect the highest-risk workflow.

FAQ

What is hallucination testing?

Hallucination testing is the practice of asking questions where the correct behavior is grounded support, uncertainty, or refusal.

What should a hallucination test include?

A hallucination test should include answerable cases, ambiguous cases, adversarial cases, and no-answer cases.

Reusable resource: Generate prompt test fixtures

AI Hallucination Testing Guide

Build the test set from sources

Write tests that expose unsupported claims

Score the failure, not just the answer

Fix the system in layers

Verification checklist

FAQ

What is hallucination testing?

What should a hallucination test include?

Related content

LLM Output Verification Guide

How to Reduce Hallucinations in LLM Apps

RAG No-Answer Testing

Prompt Testing Framework