- Retrieval Quality – Evaluate whether the right chunks are retrieved and how well they are ranked (chunk relevance, context relevance, context precision, Precision @ K).
- Generation Quality – Evaluate whether the model uses the context effectively and grounds responses in retrieved context (chunk attribution utilization, context adherence, completeness). For ground truth adherence, correctness, and instruction adherence — which apply to both RAG and non-RAG use cases — see Response Quality metrics.
Problems RAG metrics help you solve
- Answers hallucinate or contradict your source documents. Use Context Adherence to see whether the model’s claims stay grounded in the retrieved context.
- Retrieved documents look unrelated to the question. Use Chunk Relevance and Context Relevance to understand whether individual chunks — and the overall context — actually help answer the query.
- Retrieval returns lots of noise in the top K results. Use Context Precision and Precision @ K to quantify how many of the highest-ranked chunks are truly relevant and how that changes as you adjust K.
- The model ignores some of the retrieved information. Use Chunk Attribution Utilization to see which chunks influenced the answer and where useful context was left on the table.
- Answers are grounded but still feel incomplete. Use Completeness to detect when important details from the retrieved context never make it into the final response.
Diagnose your RAG problem
Not sure which metric to start with? Walk through these symptoms to find the right one.My answers contain made-up information
My answers contain made-up information
Diagnosis: The model is hallucinating — generating claims not supported by your documents.Start with: Context Adherence to measure how well responses stay grounded in retrieved context.If Context Adherence is high but answers are still wrong: The problem may be in retrieval. Check Context Relevance to see if you’re retrieving the right documents in the first place.
My retrieval returns too much noise
My retrieval returns too much noise
Diagnosis: Your retriever is pulling in chunks that don’t help answer the query.Start with: Chunk Relevance to see which individual chunks are useful vs. noise.Then check: Context Precision to get an aggregate view of how much of your retrieved context is actually relevant.To tune your K: Use Precision @ K to find the sweet spot where adding more chunks stops helping.
The answer is correct but feels thin
The answer is correct but feels thin
Diagnosis: The model is being too conservative — it’s grounded but not using all available information.Start with: Completeness to detect when relevant details from the context are left out of the response.Also check: Chunk Attribution Utilization to see which chunks actually influenced the answer and which were ignored.
Some retrieved chunks never get used
Some retrieved chunks never get used
Diagnosis: You’re retrieving context that the model ignores — either the chunks are marginally relevant or your prompt isn’t encouraging full utilization.Start with: Chunk Attribution Utilization to see exactly which chunks contributed to the response.If attribution is low but chunks are relevant: Consider adjusting your prompt to encourage the model to incorporate more context.
How RAG metrics connect
RAG evaluation flows from retrieval to generation. Here’s how the metrics relate: Reading the diagram:- Chunk Relevance is the foundation — it determines whether individual chunks help answer the query and feeds into precision metrics.
- Context Relevance asks whether the overall context is sufficient, bridging retrieval and generation.
- Context Adherence checks if the model stays grounded, while Completeness checks if it uses everything relevant.
- Chunk Attribution reveals which specific chunks actually influenced the response.
Retrieval Quality
| Name | Description | Supported Nodes | When to Use | Example Use Case |
|---|---|---|---|---|
| Chunk Relevance | Measures whether each retrieved chunk contains information that could help answer the user’s query. | Retriever span | When evaluating the relevance of individual retrieved chunks to the query. | A RAG system that needs to ensure each retrieved document chunk contributes useful information toward answering user questions. |
| Context Relevance (Query Adherence) | Evaluates whether the retrieved context is relevant to the user’s query. | Retriever span | When assessing the quality of your retrieval system’s results. | An internal knowledge base search that retrieves company policies relevant to specific employee questions. |
| Context Precision | Measures the percentage of relevant chunks in the retrieved context, weighted by their position in the retrieval order. | Retriever span | When evaluating the overall quality of your retrieval system’s results and ranking effectiveness. | A document search system that needs to ensure retrieved chunks are relevant and properly ranked by importance. |
| Precision @ K | Measures the percentage of relevant chunks among the top K retrieved chunks at a specific rank position. | Retriever span | When determining the optimal number of chunks to retrieve (Top K) and evaluating ranking quality at specific positions. | A RAG system that needs to optimize retrieval parameters to balance between capturing all relevant chunks and avoiding irrelevant ones. |
Generation Quality
| Name | Description | Supported Nodes | When to Use | Example Use Case |
|---|---|---|---|---|
| Chunk Attribution Utilization | Assesses whether the response uses the retrieved chunks in its response, and properly attributes information to source documents. | Retriever span | When implementing RAG systems and want to ensure proper attribution and that retrieved information is used efficiently. | A legal research assistant that must cite specific cases and statutes when providing legal information. |
| Context Adherence | Measures how well the response aligns with the provided context. | LLM span | When you want to ensure the model is grounding its responses in the provided context. | A financial advisor bot that must base investment recommendations on the client’s specific financial situation and goals. |
| Completeness | Measures how thoroughly the response covers the relevant information available in the provided context. | LLM span | When evaluating if responses fully address the user’s intent. | A healthcare chatbot, when provided with a patient’s medical record as context, must include all relevant critical information from that record in its response. |