RAG Metrics

RAG metrics help you measure how well your retrieval-augmented generation system finds relevant context and produces accurate, complete, and well-grounded responses. These metrics are organized into two categories:

Retrieval Quality – Evaluate whether the right chunks are retrieved and how well they are ranked (chunk relevance, context relevance, context precision, Precision @ K).
Generation Quality – Evaluate whether the model uses the context effectively and grounds responses in retrieved context (chunk attribution utilization, context adherence, completeness). For ground truth adherence, correctness, and instruction adherence — which apply to both RAG and non-RAG use cases — see Response Quality metrics.

Problems RAG metrics help you solve

Answers hallucinate or contradict your source documents. Use Context Adherence to see whether the model’s claims stay grounded in the retrieved context.
Retrieved documents look unrelated to the question. Use Chunk Relevance and Context Relevance to understand whether individual chunks — and the overall context — actually help answer the query.
Retrieval returns lots of noise in the top K results. Use Context Precision and Precision @ K to quantify how many of the highest-ranked chunks are truly relevant and how that changes as you adjust K.
The model ignores some of the retrieved information. Use Chunk Attribution Utilization to see which chunks influenced the answer and where useful context was left on the table.
Answers are grounded but still feel incomplete. Use Completeness to detect when important details from the retrieved context never make it into the final response.

Diagnose your RAG problem

Not sure which metric to start with? Walk through these symptoms to find the right one.

My answers contain made-up information

Diagnosis: The model is hallucinating — generating claims not supported by your documents.Start with: Context Adherence to measure how well responses stay grounded in retrieved context.If Context Adherence is high but answers are still wrong: The problem may be in retrieval. Check Context Relevance to see if you’re retrieving the right documents in the first place.

My retrieval returns too much noise

Diagnosis: Your retriever is pulling in chunks that don’t help answer the query.Start with: Chunk Relevance to see which individual chunks are useful vs. noise.Then check: Context Precision to get an aggregate view of how much of your retrieved context is actually relevant.To tune your K: Use Precision @ K to find the sweet spot where adding more chunks stops helping.

The answer is correct but feels thin

Diagnosis: The model is being too conservative — it’s grounded but not using all available information.Start with: Completeness to detect when relevant details from the context are left out of the response.Also check: Chunk Attribution Utilization to see which chunks actually influenced the answer and which were ignored.

Some retrieved chunks never get used

Diagnosis: You’re retrieving context that the model ignores — either the chunks are marginally relevant or your prompt isn’t encouraging full utilization.Start with: Chunk Attribution Utilization to see exactly which chunks contributed to the response.If attribution is low but chunks are relevant: Consider adjusting your prompt to encourage the model to incorporate more context.

How RAG metrics connect

RAG evaluation flows from retrieval to generation. Here’s how the metrics relate: Reading the diagram:

Chunk Relevance is the foundation — it determines whether individual chunks help answer the query and feeds into precision metrics.
Context Relevance asks whether the overall context is sufficient, bridging retrieval and generation.
Context Adherence checks if the model stays grounded, while Completeness checks if it uses everything relevant.
Chunk Attribution reveals which specific chunks actually influenced the response.

High Context Adherence + Low Completeness? Your model is being too conservative. It’s staying grounded but only using a fraction of the available context. Adjust your prompt to encourage more comprehensive answers.

If Context Relevance is low, fix your retriever first — better embeddings, different chunking strategy, or higher K. If Context Relevance is high but Context Adherence is low, the problem is in generation — the model has the right context but isn’t using it faithfully.

Below is a quick reference table of all RAG metrics by category:

Retrieval Quality

Name	Description	Supported Nodes	When to Use	Example Use Case
Chunk Relevance	Measures whether each retrieved chunk contains information that could help answer the user’s query.	Retriever span	When evaluating the relevance of individual retrieved chunks to the query.	A RAG system that needs to ensure each retrieved document chunk contributes useful information toward answering user questions.
Context Relevance (Query Adherence)	Evaluates whether the retrieved context is relevant to the user’s query.	Retriever span	When assessing the quality of your retrieval system’s results.	An internal knowledge base search that retrieves company policies relevant to specific employee questions.
Context Precision	Measures the percentage of relevant chunks in the retrieved context, weighted by their position in the retrieval order.	Retriever span	When evaluating the overall quality of your retrieval system’s results and ranking effectiveness.	A document search system that needs to ensure retrieved chunks are relevant and properly ranked by importance.
Precision @ K	Measures the percentage of relevant chunks among the top K retrieved chunks at a specific rank position.	Retriever span	When determining the optimal number of chunks to retrieve (Top K) and evaluating ranking quality at specific positions.	A RAG system that needs to optimize retrieval parameters to balance between capturing all relevant chunks and avoiding irrelevant ones.

Generation Quality

Name	Description	Supported Nodes	When to Use	Example Use Case
Chunk Attribution Utilization	Assesses whether the response uses the retrieved chunks in its response, and properly attributes information to source documents.	Retriever span	When implementing RAG systems and want to ensure proper attribution and that retrieved information is used efficiently.	A legal research assistant that must cite specific cases and statutes when providing legal information.
Context Adherence	Measures how well the response aligns with the provided context.	LLM span	When you want to ensure the model is grounding its responses in the provided context.	A financial advisor bot that must base investment recommendations on the client’s specific financial situation and goals.
Completeness	Measures how thoroughly the response covers the relevant information available in the provided context.	LLM span	When evaluating if responses fully address the user’s intent.	A healthcare chatbot, when provided with a patient’s medical record as context, must include all relevant critical information from that record in its response.

Overview

Get Started

Logging and Monitoring

Experiments

Runtime Protection

Metrics

Annotations

Integrations

Security

References

Problems RAG metrics help you solve

Diagnose your RAG problem

How RAG metrics connect

Retrieval Quality

Generation Quality

Next steps

Overview

Get Started

Logging and Monitoring

Experiments

Runtime Protection

Metrics

Annotations

Integrations

Security

References

​Problems RAG metrics help you solve

​Diagnose your RAG problem

​How RAG metrics connect

​Retrieval Quality

​Generation Quality

​Next steps

Problems RAG metrics help you solve

Diagnose your RAG problem

How RAG metrics connect

Retrieval Quality

Generation Quality

Next steps