RAG Evaluation Checklist
A practical checklist for evaluating RAG retrieval quality, source faithfulness, citations, no-answer behavior, latency, and human review effort.
Workflow
A support knowledge bot workflow with approved retrieval, citation checks, no-answer handling, escalation rules, and human review gates before rollout.
A customer support knowledge bot answers customer or internal helpdesk questions using an approved knowledge base. The value is speed and consistency, but the risk is high: a fluent answer can misstate policy, expose internal information, or promise an outcome the team cannot deliver. The workflow should therefore optimize for grounded answers, no-answer behavior, and escalation.
This workflow is suitable for product FAQs, troubleshooting steps, policy lookup, internal support enablement, and draft replies. It is not suitable for making exceptions, issuing refunds, changing account settings, or giving legal or medical advice without stronger human review and domain controls.
Inputs should include the user question, customer context if allowed, approved knowledge base, support policy, escalation rules, and response channel. The workflow should clearly separate customer-visible sources from internal-only sources.
Outputs should include a cited answer, confidence or evidence label, source list, escalation reason if needed, and any policy boundary. For internal agents, the bot can expose more diagnostic detail. For customers, it should be concise and careful.
A practical stack includes knowledge base retrieval, document metadata, LLM answer generation, citation checking, no-answer classifier, ticketing integration, and human review queue. If the bot can create or update tickets, those actions should be classified under the agent permission design model.
The retrieval layer should be evaluated with the RAG evaluation checklist. The answer layer should be tested for faithfulness, citation support, refusal behavior, and policy tone.
First, define the approved source set. Do not let the bot answer support questions from arbitrary web search or outdated drafts. Each source should have an owner, last-reviewed date, audience label, and product version when relevant.
Second, design retrieval. The bot should retrieve enough context to answer, but not so much that irrelevant snippets confuse the response. Store retrieved source IDs and snippets in the trace.
Third, generate the answer with constraints. The answer should use only approved sources, cite or reference the source when helpful, and avoid making promises not present in policy. If the source is missing, the bot should ask a clarifying question or escalate.
Fourth, route sensitive cases. Billing, refunds, security, privacy, account access, legal requests, and angry customers should have explicit escalation rules. The human-in-the-loop AI workflows guide helps define what the support owner needs to review.
The answer must be faithful to retrieved sources and refuse unsupported claims. A reviewer should be able to compare the final response with the source snippets and see why the bot answered, refused, or escalated. Use the RAG no-answer testing method to create cases where the correct behavior is not to answer.
The gate should also test stale docs, contradictory docs, ambiguous customer questions, missing product version, and internal-only policy notes. The bot should not expose internal notes in a customer-facing channel.
Add a sampling rule after launch. Even if the bot begins in draft mode, support owners should review a mix of accepted, edited, rejected, and escalated answers. That sample shows whether the workflow is improving resolution quality or merely moving review effort into quieter places.
Human support owners should review low-confidence, policy-sensitive, or customer-visible cases before full automation. Early rollout can use draft mode: the bot prepares a reply and the support agent approves or edits it. Rejection reasons should feed evaluation fixtures.
Support bots fail by inventing policy, citing sources that do not support the answer, ignoring newer docs, exposing internal context, answering ambiguous questions too confidently, or failing to escalate. They also fail when success is measured only by deflection instead of resolution quality and customer risk.
The agent observability guide is useful once the workflow includes ticketing actions, because every retrieval, answer, escalation, and human edit should be visible in the trace.
A support knowledge bot should refuse, ask a clarifying question, or escalate when approved sources do not support a safe answer.
A support bot should cite or reference approved support sources when useful, and it should keep internal-only policy notes out of customer-facing replies.
Use the RAG Evaluation Template to test retrieval relevance, faithfulness, citation quality, and no-answer behavior before expanding support coverage.