Context Relevance measures whether your retrieved context, taken together, contains enough information to fully answer the user query.
Context relevance
Context Relevance asks whether your retrieved context, as a whole, contains enough information to fully answer the user query. High Context Relevance values indicate strong confidence that there is enough context to fully answer the question. Low Context Relevance values are a sign that you need to increase your Top K, modify your retrieval strategy, or use better embeddings.Context Relevance is differentiated from Context Adherence: Context Relevance evaluates whether the retrieved context is relevant to a user’s query whereas Context Adherence determines how well the response aligns to provided context.
Chunk Relevance vs. Context Relevance
- Chunk Relevance (Chunk Relevance) evaluates each chunk individually: does this chunk contain anything useful for answering the query?
- Context Relevance evaluates the retrieved context as a whole: do all of these chunks, taken together, cover everything needed to answer the query end-to-end?
Reading Context Relevance with Context Precision
- High Context Relevance & High Context Precision: Retrieved context is both sufficient and mostly noise-free — focus next on generation quality and grounding.
- High Context Relevance & Low Context Precision: The right information is present but mixed with a lot of irrelevant chunks — keep your recall but prune noise (better filters, reranking, or a lower Top K).
- Low Context Relevance & High Context Precision: Most chunks are on-topic, but together they still miss pieces needed for a full answer — broaden retrieval (higher Top K, alternate retriever, or additional data sources).
- Low Context Relevance & Low Context Precision: Retrieval is both incomplete and noisy — revisit embeddings, indexing, and query formulation end-to-end.
Best practices
Use for Results Assessment
Leverage Context Relevance when assessing the quality of your retrieval system’s results and determining how accurately it adheres to queries.
Combine with Other Metrics
Use context relevance alongside context adherence, correctness, and completeness metrics for a comprehensive view of response quality.
Performance Benchmarks
We evaluated Context Relevance against human expert labels on an internal dataset of RAG samples using top frontier models.| Model | F1 (True) |
|---|---|
| GPT-4.1 | 0.82 |
| GPT-4.1-mini (judges=3) | 0.85 |
| Claude Sonnet 4.5 | 0.81 |
| Gemini 3 Flash | 0.81 |
GPT-4.1 Classification Report
| Precision | Recall | F1-Score | |
|---|---|---|---|
| False | 0.82 | 0.99 | 0.89 |
| True | 0.97 | 0.71 | 0.82 |
Confusion Matrix (Normalized)
Predicted
True
False
Actual
True
0.708
0.292
False
0.014
0.986
0.01.0
Benchmarks based on internal evaluation dataset. Performance may vary by use case.
Related Resources
If you would like to dive deeper or start implementing Context Relevance, check out the following resources:Examples
- Context Relevance Examples - Log in and explore the “Context Relevance” Log Stream in the “Preset Metric Examples” Project to see this metric in action.