Maximizing Chunk Utilization
Learn how to boost your AI model’s performance by fully leveraging retrieved text chunks.
Imagine you’ve built a retrieval-augmented generation (RAG) system that does a decent job finding relevant texts, yet the final answers often remain disappointingly shallow. The system might pick up on a simple definition but ignore deeper nuances, skip over chunks that are clearly relevant, or fail to combine ideas from multiple sources. All that potential remains untapped.
Under the hood, three main deficiencies hold most RAG pipelines back:
- Importance Ranking: Traditional pipelines retrieve documents, but don’t weigh or highlight which pieces of information truly matter.
- Cross-Chunk Integration: Often, these systems overlook how chunks relate to each other, even though combining them yields richer answers.
- Relevance Signaling: The model rarely receives clear cues about which sections are absolutely crucial to the user’s query.
These three factors—ranking, integration, and signaling—determine whether the model’s final answer simply skims the surface or includes the full depth of information in your document repository.
The Gap: Chunk Attribution and Utilization
When building RAG applications, there’s a critical gap between retrieving relevant information and actually using it effectively. Even when your system successfully finds the right chunks, the final responses often fall short in three key ways:
- The model cherry-picks basic information while ignoring deeper insights
- Retrieved chunks with high relevance scores are completely ignored
- Related information across multiple chunks isn’t synthesized together
Let’s look at how these problems manifest through two key concepts:
Chunk Attribution measures how well the model acknowledges and uses information from multiple sources. Poor chunk attribution occurs when the model ignores relevant chunks or fails to integrate information from multiple sources, leading to incomplete analysis and missed insights.
Chunk Utilization focuses on how thoroughly the model uses the content within each chunk. Even when chunks are properly attributed, the model often uses only a fraction of the available information, resulting in superficial responses that miss important details and nuances.
Here’s an example that demonstrates these issues:
Component | Content |
---|---|
User Query | ”What are the fundamental concepts and operations in arithmetic, and how are they used in mathematics?” |
Retrieved Chunk 1 | ”In mathematics, arithmetic is the basic study of numbers. The four basic arithmetic operations are addition, subtraction, multiplication, and division, although other operations such as exponentiation and extraction of roots are also studied in arithmetic. Other arithmetic topics includes working with signed numbers, fractions, decimals and percentages.” |
Retrieved Chunk 2 | ”Mathematics includes the study of numbers, shapes and patterns. Structure: how things are organized. This subfield is usually called algebra. Place: where things are and their arrangement. This subfield is usually called geometry. Change: how things become different. This subfield is usually called analysis.” |
Retrieved Chunk 3 | ”Multiplication is an arithmetic operation for finding the product of two numbers. With natural numbers, multiplication gives the number of tiles in a rectangle. Multiplication between numbers is said to be commutative—when the order of the numbers does not influence the value of the product.” |
Model Response | ”Arithmetic is the basic study of numbers in mathematics. The main operations are addition, subtraction, multiplication, and division. These operations are used to work with numbers in mathematics.” |
The response uses only basic definitional information despite having access to rich details about properties, applications, and relationships. This happens because our basic implementation lacks the mechanisms needed to:
- Identify which parts of each chunk are most relevant to the query
- Understand how information across chunks relates to each other
- Determine the relative importance of different pieces of information
Baseline Performance Metrics
When measuring this basic implementation against our key metrics, we see significant room for improvement:
- Chunk Utilization: Only 38% of relevant information within retrieved chunks is actually used in the final response
- Chunk Attribution: On average, only 2 out of 5 retrieved chunks are referenced or used in the response
- Completeness: We can use completness to understand whether or not the final answer would include all open-domain information. In this case, only 73% of essential information from the source material makes it into the final answer
These metrics highlight the gap between having access to relevant information and actually using it effectively in the response. Let’s look at how we can improve these numbers through enhanced implementation.
Improving the system: Enhanced Retrieval and Guidance
Let’s look at how we can improve each part of the system through three key enhancements:
1. Cross-Encoder Reranking for Better Document Selection
Vector similarity alone treats the query and the document as two separate embeddings. A cross-encoder, however, reads them side by side (e.g., as “query [SEP] document text”). This produces richer relevance scores since the model can detect subtle connections or contradictory nuances.
The reranking code adds extra metadata, like combined_score
and relevance
, which become invaluable signals later on. This process:
- Takes the query and each document and creates pairs for evaluation
- Passes these pairs through a cross-encoder model that evaluates them together
- Normalizes the scores to be between 0 and 1 for easier comparison
- Filters out documents below our threshold
- Adds metadata that helps the model understand document importance
- Returns only the top k most relevant documents
2. Prompting for Synthesis and Better Information Integration
If you want the model to combine details from multiple chunks, you need to tell it how to do so:
This system prompt:
- Sets clear expectations for how to handle multiple documents
- Requires explicit citation of sources
- Emphasizes the importance of connecting related information
- Guides the model to consider document relevance scores
- Provides a structure for building comprehensive responses
By laying out these rules, you remind the model that it isn’t enough to restate each chunk separately. It should blend details, note relevant sources, and produce a single, well-structured explanation. With explicit instructions to reference scores or cross-document relationships, your pipeline gains contextual awareness.
3. Enriched Formatting for Clear Information Structure
When you format each chunk, include metadata that matters:
This enhanced formatting:
- Numbers each document for easy reference
- Shows the source of the information
- Includes the relevance category (high/medium/low)
- Displays the numerical relevance score
- Formats everything consistently for easier processing
By embedding the relevance tag and final numeric score, you give the language model reasons to say: “Document 1 is obviously the most relevant, so I should prioritize these details.” This extra nudge prevents the model from ignoring vital chunks.
Putting It All Together
Here’s how these components work together in the final implementation:
The complete workflow:
- When a question comes in, we first retrieve potentially relevant documents using vector search
- These documents go through the reranking process to evaluate their actual relevance to the query
- The documents are formatted with clear structure and metadata highlighting their importance
- The enhanced system prompt guides the model in synthesizing information across documents
- The model generates a response that incorporates information from all relevant sources
Performance Gains: Putting Theory into Practice
The enhanced implementation delivers significant improvements across key metrics:
Metric | Basic Implementation | Enhanced Implementation |
---|---|---|
Chunk Utilization | 38% of relevant content used | ~100% of relevant content used |
Chunk Attribution | 2 out of 5 chunks referenced | All relevant chunks referenced |
Completeness | 73% of essential information | 100% of essential information |
These improvements transform our RAG system from a basic fact-finder into a true knowledge synthesizer. Instead of just retrieving information, the system now:
- Prioritizes the most relevant documents
- Guides the model to use more of the available information
- Encourages connections between related concepts
- Produces more comprehensive and accurate responses
Summary
Building an effective RAG system doesn’t stop at retrieving the right documents. To unlock more nuanced answers, you must ensure your system actively uses every piece of relevant text. Adding a cross-encoder for reranking, improving prompt design for synthesis, and clearly formatting metadata are simple yet powerful steps toward maximizing chunk utilization.
By addressing the three core deficiencies—importance ranking, cross-chunk integration, and relevance signaling—you can dramatically improve your RAG system’s performance without changing the underlying retrieval mechanism or language model.
Appendix: Building a Basic Document Store
If you’re building this solution from scratch, you’ll need to start with a basic document store implementation. This section walks you through creating the foundation that we’ll improve upon.
Below is a sample implementation of how you might load a dataset, chunk it, build a FAISS index, and perform a simple k-nearest-neighbors search for relevant chunks. Let’s break down each component:
Document Store Initialization
The DocumentStore
class serves as our foundation for managing and retrieving documents:
This initialization code handles several important tasks:
-
Dataset Loading: We use the HuggingFace datasets library to load Wikipedia articles. The
num_docs
parameter controls how many articles to load, making it easy to start small for testing. -
Document Processing:
- Each article is split into chunks using the
_chunk_text
method - Empty or very short articles (< 100 characters) are skipped to maintain quality
- Each chunk gets comprehensive metadata including its source, position in the original document, and length
- Each article is split into chunks using the
-
Embedding Setup:
- We use the
SentenceTransformer
model ‘all-MiniLM-L6-v2’, which provides a good balance of speed and quality - The
k
parameter determines how many similar documents to retrieve for each query
- We use the
Text Chunking Implementation
The chunking strategy is crucial for document retrieval. Our implementation uses a sentence-aware approach to maintain context:
This chunking implementation has several key features:
-
Sentence-Aware Splitting:
- Uses regex
(?<=[.!?])\s+
to split on sentence boundaries - Preserves sentence integrity instead of breaking mid-sentence
- Maintains natural language flow and context
- Uses regex
-
Dynamic Chunk Size:
- Tracks chunk size using word count as a proxy for tokens
- Tries to keep chunks close to the target size while respecting sentence boundaries
- Prevents chunks from growing too large while keeping related content together
-
Clean Handling:
- Joins sentences with proper spacing
- Handles the last chunk properly to avoid losing content
- Maintains document structure for better retrieval
Building the Vector Index
For efficient similarity search, we build a FAISS index. FAISS is particularly good at handling large-scale similarity search:
The indexing process involves several important steps:
-
Embedding Generation:
- Uses SentenceTransformer to create dense vector representations
- Processes all documents in batch for efficiency
- Creates fixed-dimension embeddings for each chunk
-
Vector Normalization:
- Applies L2 normalization to standardize vector lengths
- Ensures cosine similarity calculations are accurate
- Improves search quality by making magnitudes comparable
-
FAISS Index Creation:
- Uses
IndexFlatIP
for exact inner product calculations - Optimized for cosine similarity between normalized vectors
- Enables fast nearest neighbor search at scale
- Uses
Basic Search Implementation
The search functionality ties everything together, enabling efficient retrieval of relevant chunks:
The search implementation includes several sophisticated features:
-
Query Processing:
- Converts the query text to a vector using the same encoder
- Normalizes the query vector for consistent similarity calculations
- Reshapes the vector to match FAISS requirements
-
Similarity Search:
- Uses FAISS to find the k nearest neighbors
- Returns both similarity scores and document indices
- Performs search in sub-linear time thanks to FAISS optimization
-
Result Processing:
- Creates deep copies to prevent modifying original documents
- Adds similarity scores to metadata
- Categorizes relevance based on score thresholds
- Returns a clean, structured result format
This foundation is flexible enough to support cross-encoder reranking or advanced filtering strategies. As your needs grow, you’ll find myriad ways to refine retrieval, relevance scoring, and chunk usage even further. Some potential improvements include:
- Adding chunk overlap to capture context at boundaries
- Implementing more sophisticated chunking strategies
- Adding filters based on metadata
- Incorporating hybrid search approaches
- Adding caching for frequently accessed documents