Precision @ K measures the percentage of relevant chunks among the top K retrieved chunks, where K is a specific rank position.
01
Low Precision
Few or no chunks in the top K are relevant to the queryHigh Precision
Most or all chunks in the top K are relevant to the queryCalculation method
Precision @ K is computed based on Chunk Relevance scores for the top K chunks:Chunk Relevance Calculation
First, Chunk Relevance is computed for each retrieved chunk, producing a binary classification (Relevant or Not Relevant) for each chunk.
Rank Ordering
The retrieved chunks are ordered by their rank position (as logged in the retriever span), with position 1 being the highest-ranked chunk.
Top K Selection
The top K chunks are selected based on their rank order, where K is the specified rank position (e.g., K=3 means the top 3 chunks).
Understanding precision @ K
How Precision @ K Helps Optimize Top K
Example scenario: When retrieving 10 chunks (Top K = 10), Context Precision is 40%, meaning only 4 out of 10 chunks are relevant. This suggests reducing Top K.
Using Precision @ K: Evaluating Precision @ 4 shows it’s only 40%, meaning for 60% of examples, there are useful chunks in ranks 5-10. However, Precision @ 7 is 90%, indicating that for 90% of examples, the most relevant chunks are in the top 7.
Optimization decision: Reducing Top K from 10 to 7 captures the relevant chunks for most queries while reducing unnecessary retrieval and processing.
Precision @ K is differentiated from Context Precision: Precision @ K evaluates precision at a specific rank K and helps assess ranking quality, while Context Precision considers all retrieved chunks to measure noise in retrieval.
Choosing K
Guidance for Selecting K
Start with small K values: Begin with small values such as K = 1, 3, or 5 to understand how well the very top-ranked chunks support high-quality responses.
Analyze Precision @ multiple K values: Evaluate Precision @ K across a range of K values (for example, 1, 3, 5, 10) to see where the metric plateaus. Points where precision stops improving significantly often indicate a good upper bound for K.
Balance recall and efficiency: Larger K values may improve recall by including more relevant chunks, but at the cost of more noise, higher latency, and higher token usage. The chosen K should balance these tradeoffs for the specific application.
Optimizing your RAG pipeline
Addressing Low Precision @ K Scores
Optimize Top K value: Evaluate Precision @ K metrics at different K values to find the optimal number of chunks to retrieve. Reduce K if precision remains high at lower values.
Improve ranking quality: If Precision @ K is low but higher K values show better precision, focus on improving ranking/reranking to move relevant chunks earlier.
Enhance retrieval quality: Refine embedding models, similarity search algorithms, or retrieval parameters to better match queries with relevant content.
Implement reranking: Use a reranking model to improve the order of retrieved chunks, ensuring the most relevant ones appear in the top K positions.
Comparing Precision @ K and Context Precision
Understanding Metric Combinations
High Precision @ K, High Context Precision: The retrieval system is performing well overall. The top K positions contain mostly relevant chunks (good ranking), and the overall retrieved set has minimal noise. This indicates both effective ranking and high-quality retrieval.
High Precision @ K, Low Context Precision: The top K positions contain mostly relevant chunks (good ranking), but the overall retrieved set has significant noise. This indicates that while the ranking algorithm prioritizes relevant content effectively, the retrieval system is bringing back too many irrelevant chunks beyond the top K. Consider reducing Top K or improving retrieval quality.
Low Precision @ K, High Context Precision: While the overall retrieval contains mostly relevant chunks (low noise), the ranking is poor. Relevant chunks are distributed throughout the retrieved set rather than concentrated in the top K positions. This suggests the retrieval system finds relevant content but needs better ranking or reranking.
Low Precision @ K, Low Context Precision: Both metrics indicate problems. The retrieval system has high noise (many irrelevant chunks) and poor ranking (relevant chunks are not in top positions). This suggests fundamental issues with both retrieval quality and ranking that need to be addressed.
Best practices
Determine Optimal Top K
Evaluating Precision @ K at multiple K values helps find the optimal number of chunks to retrieve for each use case.
Monitor Ranking Quality
Tracking Precision @ K helps ensure retrieval systems rank relevant chunks appropriately in the top positions.
Combine with Context Precision
Precision @ K works alongside Context Precision for a comprehensive view of retrieval effectiveness at both specific ranks and overall.
Analyze Across Queries
Evaluating Precision @ K across different query types helps identify patterns and optimize retrieval strategies accordingly.
When optimizing for Precision @ K, the goal is to find the right balance: a K value that’s high enough to capture all relevant chunks but low enough to avoid retrieving too many irrelevant chunks. Evaluating Precision @ K metrics at different K values helps make data-driven decisions about the Top K parameter.
Creating multiple Precision @ K variants in Galileo
Configuring Multiple K Values
Define separate metric configurations: Create distinct Precision @ K configurations in Galileo for each K value of interest (such as K = 1, 3, 5, 10) so they appear as separate metrics in experiments and Log streams.
Use code-based metric customization: For each additional K value, create a new code-based metric in Galileo, copy the prefilled scorer code from the preset Precision @ K metric, and update the value of K in the code. This allows multiple Precision @ K variants to share the same logic while differing only in the K parameter.