Imagine you’ve built a retrieval-augmented generation (RAG) system that does a decent job finding relevant texts, yet the final answers often remain disappointingly shallow. The system might pick up on a simple definition but ignore deeper nuances, skip over chunks that are clearly relevant, or fail to combine ideas from multiple sources. All that potential remains untapped.
Under the hood, three main deficiencies hold most RAG pipelines back:
- Importance Ranking: Traditional pipelines retrieve documents, but don’t weigh or highlight which pieces of information truly matter.
- Cross-Chunk Integration: Often, these systems overlook how chunks relate to each other, even though combining them yields richer answers.
- Relevance Signaling: The model rarely receives clear cues about which sections are absolutely crucial to the user’s query.
These three factors—ranking, integration, and signaling—determine whether the model’s final answer simply skims the surface or includes the full depth of information in your document repository.
The gap: chunk attribution and utilization
When building RAG applications, there’s a critical gap between retrieving relevant information and actually using it effectively. Even when your system successfully finds the right chunks, the final responses often fall short in three key ways:
- The model cherry-picks basic information while ignoring deeper insights
- Retrieved chunks with high relevance scores are completely ignored
- Related information across multiple chunks isn’t synthesized together
Let’s look at how these problems manifest through two key concepts:
Chunk Attribution measures how well the model acknowledges and uses information from multiple sources. Poor chunk attribution occurs when the model ignores relevant chunks or fails to integrate information from multiple sources, leading to incomplete analysis and missed insights.
Chunk Utilization focuses on how thoroughly the model uses the content within each chunk. Even when chunks are properly attributed, the model often uses only a fraction of the available information, resulting in superficial responses that miss important details and nuances.
Here’s an example that demonstrates these issues:
Component | Content |
---|
User Query | ”What are the fundamental concepts and operations in arithmetic, and how are they used in mathematics?” |
Retrieved Chunk 1 | ”In mathematics, arithmetic is the basic study of numbers. The four basic arithmetic operations are addition, subtraction, multiplication, and division, although other operations such as exponentiation and extraction of roots are also studied in arithmetic. Other arithmetic topics includes working with signed numbers, fractions, decimals and percentages.” |
Retrieved Chunk 2 | ”Mathematics includes the study of numbers, shapes and patterns. Structure: how things are organized. This subfield is usually called algebra. Place: where things are and their arrangement. This subfield is usually called geometry. Change: how things become different. This subfield is usually called analysis.” |
Retrieved Chunk 3 | ”Multiplication is an arithmetic operation for finding the product of two numbers. With natural numbers, multiplication gives the number of tiles in a rectangle. Multiplication between numbers is said to be commutative—when the order of the numbers does not influence the value of the product.” |
Model Response | ”Arithmetic is the basic study of numbers in mathematics. The main operations are addition, subtraction, multiplication, and division. These operations are used to work with numbers in mathematics.” |
The response uses only basic definitional information despite having access to rich details about properties, applications, and relationships. This happens because our basic implementation lacks the mechanisms needed to:
- Identify which parts of each chunk are most relevant to the query
- Understand how information across chunks relates to each other
- Determine the relative importance of different pieces of information
When measuring this basic implementation against our key metrics, we see significant room for improvement:
- Chunk Utilization: Only 38% of relevant information within retrieved chunks is actually used in the final response
- Chunk Attribution: On average, only 2 out of 5 retrieved chunks are referenced or used in the response
- Completeness: We can use completeness to understand whether or not the final answer would include all open-domain information. In this case, only 73% of essential information from the source material makes it into the final answer
These metrics highlight the gap between having access to relevant information and actually using it effectively in the response. Let’s look at how we can improve these numbers through enhanced implementation.
Improving the system: enhanced retrieval and guidance
Let’s look at how we can improve each part of the system through three key enhancements:
1. cross-Encoder reranking for better document selection
Vector similarity alone treats the query and the document as two separate embeddings. A cross-encoder, however, reads them side by side (e.g., as “query [SEP] document text”). This produces richer relevance scores since the model can detect subtle connections or contradictory nuances.
def _rerank_documents(self, docs: List[Dict], query: str) -> List[Dict]:
# Create pairs for the cross-encoder to evaluate
pairs = [(query, doc['text']) for doc in docs]
# Get relevance scores from the cross-encoder
cross_scores = self.cross_encoder.predict(pairs)
# Convert scores to 0-1 range
min_score = min(cross_scores)
max_score = max(cross_scores)
score_range = max_score - min_score
reranked_docs = []
for doc, score in zip(docs, cross_scores):
normalized_score = (score - min_score) / score_range if score_range > 0 else 0.5
if normalized_score >= self.reranking_threshold:
doc['metadata']['combined_score'] = normalized_score
doc['metadata']['relevance'] = 'high' if normalized_score > 0.8 else \
'medium' if normalized_score > 0.6 else 'low'
reranked_docs.append(doc)
# Sort and return top k documents
reranked_docs.sort(key=lambda x: x['metadata']['combined_score'], reverse=True)
return reranked_docs[:self.k]
The reranking code adds extra metadata, like combined_score
and relevance
, which become invaluable signals later on. This process:
- Takes the query and each document and creates pairs for evaluation
- Passes these pairs through a cross-encoder model that evaluates them together
- Normalizes the scores to be between 0 and 1 for easier comparison
- Filters out documents below our threshold
- Adds metadata that helps the model understand document importance
- Returns only the top k most relevant documents
If you want the model to combine details from multiple chunks, you need to tell it how to do so:
SYSTEM_PROMPT = """You are a knowledgeable assistant tasked with providing comprehensive answers by analyzing and synthesizing information from multiple provided documents.
IMPORTANT INSTRUCTIONS:
1. Cross-reference information between documents - identify and connect complementary or contrasting information.
2. For each concept you discuss, cite the specific documents that support your explanation.
3. When multiple documents contain related information, explicitly connect their content.
4. If documents present different aspects of a concept, synthesize them into a cohesive explanation.
5. Pay attention to the relevance scores of documents.
6. Ensure you consider and utilize information from all relevant documents.
7. Structure your response to build from foundational concepts to more advanced applications."""
This system prompt:
- Sets clear expectations for how to handle multiple documents
- Requires explicit citation of sources
- Emphasizes the importance of connecting related information
- Guides the model to consider document relevance scores
- Provides a structure for building comprehensive responses
By laying out these rules, you remind the model that it isn’t enough to restate each chunk separately. It should blend details, note relevant sources, and produce a single, well-structured explanation. With explicit instructions to reference scores or cross-document relationships, your pipeline gains contextual awareness.
When you format each chunk, include metadata that matters:
def format_documents_enhanced(documents: list) -> str:
return "\n\n".join(
f"Document {i+1} (Source: {doc['metadata']['source']}, "
f"Relevance: {doc['metadata']['relevance']}, "
f"Score: {doc['metadata'].get('combined_score', doc['metadata'].get('score', 'N/A')):.3f}"
f"):\n{doc['text']}"
for i, doc in enumerate(documents)
)
This enhanced formatting:
- Numbers each document for easy reference
- Shows the source of the information
- Includes the relevance category (high/medium/low)
- Displays the numerical relevance score
- Formats everything consistently for easier processing
By embedding the relevance tag and final numeric score, you give the language model reasons to say: “Document 1 is obviously the most relevant, so I should prioritize these details.” This extra nudge prevents the model from ignoring vital chunks.
Putting it all together
Here’s how these components work together in the final implementation:
def query(question: str):
client = openai.OpenAI(
api_key=os.getenv("OPENAI_API_KEY"),
base_url=os.getenv("OPENAI_BASE_URL")
)
# Get and rerank documents
docs = DocumentStore(
num_docs=500,
k=5,
use_reranking=True,
reranking_threshold=0.6
).search(question)
# Format with enhanced metadata
prompt = Prompts.ENHANCED.format(
query=question,
documents=format_documents_enhanced(docs)
)
# Generate guided response
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": prompt}
]
)
return response.choices[0].message.content.strip()
The complete workflow:
- When a question comes in, we first retrieve potentially relevant documents using vector search
- These documents go through the reranking process to evaluate their actual relevance to the query
- The documents are formatted with clear structure and metadata highlighting their importance
- The enhanced system prompt guides the model in synthesizing information across documents
- The model generates a response that incorporates information from all relevant sources
The enhanced implementation delivers significant improvements across key metrics:
Metric | Basic Implementation | Enhanced Implementation |
---|
Chunk Utilization | 38% of relevant content used | ~100% of relevant content used |
Chunk Attribution | 2 out of 5 chunks referenced | All relevant chunks referenced |
Completeness | 73% of essential information | 100% of essential information |
These improvements transform our RAG system from a basic fact-finder into a true knowledge synthesizer. Instead of just retrieving information, the system now:
- Prioritizes the most relevant documents
- Guides the model to use more of the available information
- Encourages connections between related concepts
- Produces more comprehensive and accurate responses
Summary
Building an effective RAG system doesn’t stop at retrieving the right documents. To unlock more nuanced answers, you must ensure your system actively uses every piece of relevant text. Adding a cross-encoder for reranking, improving prompt design for synthesis, and clearly formatting metadata are simple yet powerful steps toward maximizing chunk utilization.
By addressing the three core deficiencies—importance ranking, cross-chunk integration, and relevance signaling—you can dramatically improve your RAG system’s performance without changing the underlying retrieval mechanism or language model.
Appendix: building a basic document store
If you’re building this solution from scratch, you’ll need to start with a basic document store implementation. This section walks you through creating the foundation that we’ll improve upon.
Below is a sample implementation of how you might load a dataset, chunk it, build a FAISS index, and perform a simple k-nearest-neighbors search for relevant chunks. Let’s break down each component:
Document store initialization
The DocumentStore
class serves as our foundation for managing and retrieving documents:
from sentence_transformers import SentenceTransformer
import faiss
from datasets import load_dataset
import numpy as np
from typing import List, Dict
import re
class DocumentStore:
def __init__(self, num_docs=1000, k=3, chunk_size=512):
# Load Wikipedia dataset with configurable parameters
print(f"Loading {num_docs} Wikipedia articles...")
dataset = load_dataset("wikipedia", "20220301.simple",
split=f"train[:{num_docs}]",
trust_remote_code=True)
# Process and store documents with chunking
self.documents = []
for item in dataset:
# Skip empty or very short articles
if not item["text"] or len(item["text"]) < 100:
continue
# Split text into chunks
chunks = self._chunk_text(item["text"], chunk_size)
for i, chunk in enumerate(chunks):
self.documents.append({
"text": chunk,
"metadata": {
"source": item["title"],
"chunk_id": i,
"total_chunks": len(chunks),
"relevance": "medium",
"length": len(chunk),
"url": item.get("url", ""),
"original_length": len(item["text"])
}
})
print(f"Processed {len(self.documents)} chunks from valid documents")
self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
self.k = k
self._build_index()
This initialization code handles several important tasks:
-
Dataset Loading: We use the HuggingFace datasets library to load Wikipedia articles. The num_docs
parameter controls how many articles to load, making it easy to start small for testing.
-
Document Processing:
- Each article is split into chunks using the
_chunk_text
method
- Empty or very short articles (< 100 characters) are skipped to maintain quality
- Each chunk gets comprehensive metadata including its source, position in the original document, and length
-
Embedding Setup:
- We use the
SentenceTransformer
model ‘all-MiniLM-L6-v2’, which provides a good balance of speed and quality
- The
k
parameter determines how many similar documents to retrieve for each query
Text chunking implementation
The chunking strategy is crucial for document retrieval. Our implementation uses a sentence-aware approach to maintain context:
def _chunk_text(self, text: str, chunk_size: int) -> List[str]:
"""
Split text into chunks of approximately chunk_size tokens.
Uses sentence boundaries where possible to maintain context.
"""
# First split into sentences
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks = []
current_chunk = []
current_length = 0
for sentence in sentences:
# Rough estimation of tokens (words + punctuation)
sentence_length = len(sentence.split())
if current_length + sentence_length > chunk_size and current_chunk:
# If adding this sentence would exceed chunk_size, save current chunk
chunks.append(' '.join(current_chunk))
current_chunk = [sentence]
current_length = sentence_length
else:
# Add sentence to current chunk
current_chunk.append(sentence)
current_length += sentence_length
# Add the last chunk if it exists
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
This chunking implementation has several key features:
-
Sentence-Aware Splitting:
- Uses regex
(?<=[.!?])\s+
to split on sentence boundaries
- Preserves sentence integrity instead of breaking mid-sentence
- Maintains natural language flow and context
-
Dynamic Chunk Size:
- Tracks chunk size using word count as a proxy for tokens
- Tries to keep chunks close to the target size while respecting sentence boundaries
- Prevents chunks from growing too large while keeping related content together
-
Clean Handling:
- Joins sentences with proper spacing
- Handles the last chunk properly to avoid losing content
- Maintains document structure for better retrieval
Building the vector index
For efficient similarity search, we build a FAISS index. FAISS is particularly good at handling large-scale similarity search:
def _build_index(self):
print("Building FAISS index...")
texts = [doc["text"] for doc in self.documents]
self.embeddings = self.encoder.encode(texts)
dimension = self.embeddings.shape[1]
# Use L2 normalization for better similarity search
faiss.normalize_L2(self.embeddings)
# Create an index that's optimized for cosine similarity
self.index = faiss.IndexFlatIP(dimension)
self.index.add(self.embeddings.astype('float32'))
print("Index built successfully")
The indexing process involves several important steps:
-
Embedding Generation:
- Uses SentenceTransformer to create dense vector representations
- Processes all documents in batch for efficiency
- Creates fixed-dimension embeddings for each chunk
-
Vector Normalization:
- Applies L2 normalization to standardize vector lengths
- Ensures cosine similarity calculations are accurate
- Improves search quality by making magnitudes comparable
-
FAISS Index Creation:
- Uses
IndexFlatIP
for exact inner product calculations
- Optimized for cosine similarity between normalized vectors
- Enables fast nearest neighbor search at scale
Basic search implementation
The search functionality ties everything together, enabling efficient retrieval of relevant chunks:
def search(self, query: str) -> list:
# Encode and normalize the query
query_vector = self.encoder.encode([query])[0]
query_vector = query_vector / np.linalg.norm(query_vector)
query_vector = query_vector.reshape(1, -1)
# Search for similar documents
scores, indices = self.index.search(query_vector.astype('float32'), self.k)
# Process results
results = []
for score, idx in zip(scores[0], indices[0]):
doc = self.documents[idx].copy()
doc['metadata'] = doc['metadata'].copy()
doc['metadata']['score'] = float(score)
doc['metadata']['relevance'] = 'high' if score > 0.8 else \
'medium' if score > 0.6 else 'low'
results.append(doc)
return results
The search implementation includes several sophisticated features:
-
Query Processing:
- Converts the query text to a vector using the same encoder
- Normalizes the query vector for consistent similarity calculations
- Reshapes the vector to match FAISS requirements
-
Similarity Search:
- Uses FAISS to find the k nearest neighbors
- Returns both similarity scores and document indices
- Performs search in sub-linear time thanks to FAISS optimization
-
Result Processing:
- Creates deep copies to prevent modifying original documents
- Adds similarity scores to metadata
- Categorizes relevance based on score thresholds
- Returns a clean, structured result format
This foundation is flexible enough to support cross-encoder reranking or advanced filtering strategies. As your needs grow, you’ll find myriad ways to refine retrieval, relevance scoring, and chunk usage even further. Some potential improvements include:
- Adding chunk overlap to capture context at boundaries
- Implementing more sophisticated chunking strategies
- Adding filters based on metadata
- Incorporating hybrid search approaches
- Adding caching for frequently accessed documents