Imagine you’ve built a retrieval-augmented generation (RAG) system that does a decent job finding relevant texts, yet the final answers often remain disappointingly shallow. The system might pick up on a simple definition but ignore deeper nuances, skip over chunks that are clearly relevant, or fail to combine ideas from multiple sources. All that potential remains untapped.

Under the hood, three main deficiencies hold most RAG pipelines back:

  1. Importance Ranking: Traditional pipelines retrieve documents, but don’t weigh or highlight which pieces of information truly matter.
  2. Cross-Chunk Integration: Often, these systems overlook how chunks relate to each other, even though combining them yields richer answers.
  3. Relevance Signaling: The model rarely receives clear cues about which sections are absolutely crucial to the user’s query.

These three factors—ranking, integration, and signaling—determine whether the model’s final answer simply skims the surface or includes the full depth of information in your document repository.

The Gap: Chunk Attribution and Utilization

When building RAG applications, there’s a critical gap between retrieving relevant information and actually using it effectively. Even when your system successfully finds the right chunks, the final responses often fall short in three key ways:

  1. The model cherry-picks basic information while ignoring deeper insights
  2. Retrieved chunks with high relevance scores are completely ignored
  3. Related information across multiple chunks isn’t synthesized together

Let’s look at how these problems manifest through two key concepts:

Chunk Attribution measures how well the model acknowledges and uses information from multiple sources. Poor chunk attribution occurs when the model ignores relevant chunks or fails to integrate information from multiple sources, leading to incomplete analysis and missed insights.

Chunk Utilization focuses on how thoroughly the model uses the content within each chunk. Even when chunks are properly attributed, the model often uses only a fraction of the available information, resulting in superficial responses that miss important details and nuances.

Here’s an example that demonstrates these issues:

ComponentContent
User Query”What are the fundamental concepts and operations in arithmetic, and how are they used in mathematics?”
Retrieved Chunk 1”In mathematics, arithmetic is the basic study of numbers. The four basic arithmetic operations are addition, subtraction, multiplication, and division, although other operations such as exponentiation and extraction of roots are also studied in arithmetic. Other arithmetic topics includes working with signed numbers, fractions, decimals and percentages.”
Retrieved Chunk 2”Mathematics includes the study of numbers, shapes and patterns. Structure: how things are organized. This subfield is usually called algebra. Place: where things are and their arrangement. This subfield is usually called geometry. Change: how things become different. This subfield is usually called analysis.”
Retrieved Chunk 3”Multiplication is an arithmetic operation for finding the product of two numbers. With natural numbers, multiplication gives the number of tiles in a rectangle. Multiplication between numbers is said to be commutative—when the order of the numbers does not influence the value of the product.”
Model Response”Arithmetic is the basic study of numbers in mathematics. The main operations are addition, subtraction, multiplication, and division. These operations are used to work with numbers in mathematics.”

The response uses only basic definitional information despite having access to rich details about properties, applications, and relationships. This happens because our basic implementation lacks the mechanisms needed to:

  • Identify which parts of each chunk are most relevant to the query
  • Understand how information across chunks relates to each other
  • Determine the relative importance of different pieces of information

Baseline Performance Metrics

When measuring this basic implementation against our key metrics, we see significant room for improvement:

  • Chunk Utilization: Only 38% of relevant information within retrieved chunks is actually used in the final response
  • Chunk Attribution: On average, only 2 out of 5 retrieved chunks are referenced or used in the response
  • Completeness: We can use completness to understand whether or not the final answer would include all open-domain information. In this case, only 73% of essential information from the source material makes it into the final answer

These metrics highlight the gap between having access to relevant information and actually using it effectively in the response. Let’s look at how we can improve these numbers through enhanced implementation.

Improving the system: Enhanced Retrieval and Guidance

Let’s look at how we can improve each part of the system through three key enhancements:

1. Cross-Encoder Reranking for Better Document Selection

Vector similarity alone treats the query and the document as two separate embeddings. A cross-encoder, however, reads them side by side (e.g., as “query [SEP] document text”). This produces richer relevance scores since the model can detect subtle connections or contradictory nuances.

def _rerank_documents(self, docs: List[Dict], query: str) -> List[Dict]:
    # Create pairs for the cross-encoder to evaluate
    pairs = [(query, doc['text']) for doc in docs]

    # Get relevance scores from the cross-encoder
    cross_scores = self.cross_encoder.predict(pairs)

    # Convert scores to 0-1 range
    min_score = min(cross_scores)
    max_score = max(cross_scores)
    score_range = max_score - min_score

    reranked_docs = []
    for doc, score in zip(docs, cross_scores):
        normalized_score = (score - min_score) / score_range if score_range > 0 else 0.5

        if normalized_score >= self.reranking_threshold:
            doc['metadata']['combined_score'] = normalized_score
            doc['metadata']['relevance'] = 'high' if normalized_score > 0.8 else \
                                         'medium' if normalized_score > 0.6 else 'low'
            reranked_docs.append(doc)

    # Sort and return top k documents
    reranked_docs.sort(key=lambda x: x['metadata']['combined_score'], reverse=True)
    return reranked_docs[:self.k]

The reranking code adds extra metadata, like combined_score and relevance, which become invaluable signals later on. This process:

  1. Takes the query and each document and creates pairs for evaluation
  2. Passes these pairs through a cross-encoder model that evaluates them together
  3. Normalizes the scores to be between 0 and 1 for easier comparison
  4. Filters out documents below our threshold
  5. Adds metadata that helps the model understand document importance
  6. Returns only the top k most relevant documents

2. Prompting for Synthesis and Better Information Integration

If you want the model to combine details from multiple chunks, you need to tell it how to do so:

SYSTEM_PROMPT = """You are a knowledgeable assistant tasked with providing comprehensive answers by analyzing and synthesizing information from multiple provided documents.

IMPORTANT INSTRUCTIONS:
1. Cross-reference information between documents - identify and connect complementary or contrasting information.
2. For each concept you discuss, cite the specific documents that support your explanation.
3. When multiple documents contain related information, explicitly connect their content.
4. If documents present different aspects of a concept, synthesize them into a cohesive explanation.
5. Pay attention to the relevance scores of documents.
6. Ensure you consider and utilize information from all relevant documents.
7. Structure your response to build from foundational concepts to more advanced applications."""

This system prompt:

  1. Sets clear expectations for how to handle multiple documents
  2. Requires explicit citation of sources
  3. Emphasizes the importance of connecting related information
  4. Guides the model to consider document relevance scores
  5. Provides a structure for building comprehensive responses

By laying out these rules, you remind the model that it isn’t enough to restate each chunk separately. It should blend details, note relevant sources, and produce a single, well-structured explanation. With explicit instructions to reference scores or cross-document relationships, your pipeline gains contextual awareness.

3. Enriched Formatting for Clear Information Structure

When you format each chunk, include metadata that matters:

def format_documents_enhanced(documents: list) -> str:
    return "\n\n".join(
        f"Document {i+1} (Source: {doc['metadata']['source']}, "
        f"Relevance: {doc['metadata']['relevance']}, "
        f"Score: {doc['metadata'].get('combined_score', doc['metadata'].get('score', 'N/A')):.3f}"
        f"):\n{doc['text']}"
        for i, doc in enumerate(documents)
    )

This enhanced formatting:

  1. Numbers each document for easy reference
  2. Shows the source of the information
  3. Includes the relevance category (high/medium/low)
  4. Displays the numerical relevance score
  5. Formats everything consistently for easier processing

By embedding the relevance tag and final numeric score, you give the language model reasons to say: “Document 1 is obviously the most relevant, so I should prioritize these details.” This extra nudge prevents the model from ignoring vital chunks.

Putting It All Together

Here’s how these components work together in the final implementation:

def query(question: str):
    client = openai.OpenAI(
        api_key=os.getenv("OPENAI_API_KEY"),
        base_url=os.getenv("OPENAI_BASE_URL")
    )

    # Get and rerank documents
    docs = DocumentStore(
        num_docs=500,
        k=5,
        use_reranking=True,
        reranking_threshold=0.6
    ).search(question)

    # Format with enhanced metadata
    prompt = Prompts.ENHANCED.format(
        query=question,
        documents=format_documents_enhanced(docs)
    )

    # Generate guided response
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": prompt}
        ]
    )

    return response.choices[0].message.content.strip()

The complete workflow:

  1. When a question comes in, we first retrieve potentially relevant documents using vector search
  2. These documents go through the reranking process to evaluate their actual relevance to the query
  3. The documents are formatted with clear structure and metadata highlighting their importance
  4. The enhanced system prompt guides the model in synthesizing information across documents
  5. The model generates a response that incorporates information from all relevant sources

Performance Gains: Putting Theory into Practice

The enhanced implementation delivers significant improvements across key metrics:

MetricBasic ImplementationEnhanced Implementation
Chunk Utilization38% of relevant content used~100% of relevant content used
Chunk Attribution2 out of 5 chunks referencedAll relevant chunks referenced
Completeness73% of essential information100% of essential information

These improvements transform our RAG system from a basic fact-finder into a true knowledge synthesizer. Instead of just retrieving information, the system now:

  • Prioritizes the most relevant documents
  • Guides the model to use more of the available information
  • Encourages connections between related concepts
  • Produces more comprehensive and accurate responses

Summary

Building an effective RAG system doesn’t stop at retrieving the right documents. To unlock more nuanced answers, you must ensure your system actively uses every piece of relevant text. Adding a cross-encoder for reranking, improving prompt design for synthesis, and clearly formatting metadata are simple yet powerful steps toward maximizing chunk utilization.

By addressing the three core deficiencies—importance ranking, cross-chunk integration, and relevance signaling—you can dramatically improve your RAG system’s performance without changing the underlying retrieval mechanism or language model.

Appendix: Building a Basic Document Store

If you’re building this solution from scratch, you’ll need to start with a basic document store implementation. This section walks you through creating the foundation that we’ll improve upon.

Below is a sample implementation of how you might load a dataset, chunk it, build a FAISS index, and perform a simple k-nearest-neighbors search for relevant chunks. Let’s break down each component:

Document Store Initialization

The DocumentStore class serves as our foundation for managing and retrieving documents:

from sentence_transformers import SentenceTransformer
import faiss
from datasets import load_dataset
import numpy as np
from typing import List, Dict
import re

class DocumentStore:
    def __init__(self, num_docs=1000, k=3, chunk_size=512):
        # Load Wikipedia dataset with configurable parameters
        print(f"Loading {num_docs} Wikipedia articles...")
        dataset = load_dataset("wikipedia", "20220301.simple",
                             split=f"train[:{num_docs}]",
                             trust_remote_code=True)

        # Process and store documents with chunking
        self.documents = []
        for item in dataset:
            # Skip empty or very short articles
            if not item["text"] or len(item["text"]) < 100:
                continue

            # Split text into chunks
            chunks = self._chunk_text(item["text"], chunk_size)

            for i, chunk in enumerate(chunks):
                self.documents.append({
                    "text": chunk,
                    "metadata": {
                        "source": item["title"],
                        "chunk_id": i,
                        "total_chunks": len(chunks),
                        "relevance": "medium",
                        "length": len(chunk),
                        "url": item.get("url", ""),
                        "original_length": len(item["text"])
                    }
                })

        print(f"Processed {len(self.documents)} chunks from valid documents")

        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.k = k
        self._build_index()

This initialization code handles several important tasks:

  1. Dataset Loading: We use the HuggingFace datasets library to load Wikipedia articles. The num_docs parameter controls how many articles to load, making it easy to start small for testing.

  2. Document Processing:

    • Each article is split into chunks using the _chunk_text method
    • Empty or very short articles (< 100 characters) are skipped to maintain quality
    • Each chunk gets comprehensive metadata including its source, position in the original document, and length
  3. Embedding Setup:

    • We use the SentenceTransformer model ‘all-MiniLM-L6-v2’, which provides a good balance of speed and quality
    • The k parameter determines how many similar documents to retrieve for each query

Text Chunking Implementation

The chunking strategy is crucial for document retrieval. Our implementation uses a sentence-aware approach to maintain context:

def _chunk_text(self, text: str, chunk_size: int) -> List[str]:
    """
    Split text into chunks of approximately chunk_size tokens.
    Uses sentence boundaries where possible to maintain context.
    """
    # First split into sentences
    sentences = re.split(r'(?<=[.!?])\s+', text)

    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        # Rough estimation of tokens (words + punctuation)
        sentence_length = len(sentence.split())

        if current_length + sentence_length > chunk_size and current_chunk:
            # If adding this sentence would exceed chunk_size, save current chunk
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentence]
            current_length = sentence_length
        else:
            # Add sentence to current chunk
            current_chunk.append(sentence)
            current_length += sentence_length

    # Add the last chunk if it exists
    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

This chunking implementation has several key features:

  1. Sentence-Aware Splitting:

    • Uses regex (?<=[.!?])\s+ to split on sentence boundaries
    • Preserves sentence integrity instead of breaking mid-sentence
    • Maintains natural language flow and context
  2. Dynamic Chunk Size:

    • Tracks chunk size using word count as a proxy for tokens
    • Tries to keep chunks close to the target size while respecting sentence boundaries
    • Prevents chunks from growing too large while keeping related content together
  3. Clean Handling:

    • Joins sentences with proper spacing
    • Handles the last chunk properly to avoid losing content
    • Maintains document structure for better retrieval

Building the Vector Index

For efficient similarity search, we build a FAISS index. FAISS is particularly good at handling large-scale similarity search:

def _build_index(self):
    print("Building FAISS index...")
    texts = [doc["text"] for doc in self.documents]
    self.embeddings = self.encoder.encode(texts)
    dimension = self.embeddings.shape[1]

    # Use L2 normalization for better similarity search
    faiss.normalize_L2(self.embeddings)

    # Create an index that's optimized for cosine similarity
    self.index = faiss.IndexFlatIP(dimension)
    self.index.add(self.embeddings.astype('float32'))
    print("Index built successfully")

The indexing process involves several important steps:

  1. Embedding Generation:

    • Uses SentenceTransformer to create dense vector representations
    • Processes all documents in batch for efficiency
    • Creates fixed-dimension embeddings for each chunk
  2. Vector Normalization:

    • Applies L2 normalization to standardize vector lengths
    • Ensures cosine similarity calculations are accurate
    • Improves search quality by making magnitudes comparable
  3. FAISS Index Creation:

    • Uses IndexFlatIP for exact inner product calculations
    • Optimized for cosine similarity between normalized vectors
    • Enables fast nearest neighbor search at scale

Basic Search Implementation

The search functionality ties everything together, enabling efficient retrieval of relevant chunks:

def search(self, query: str) -> list:
    # Encode and normalize the query
    query_vector = self.encoder.encode([query])[0]
    query_vector = query_vector / np.linalg.norm(query_vector)
    query_vector = query_vector.reshape(1, -1)

    # Search for similar documents
    scores, indices = self.index.search(query_vector.astype('float32'), self.k)

    # Process results
    results = []
    for score, idx in zip(scores[0], indices[0]):
        doc = self.documents[idx].copy()
        doc['metadata'] = doc['metadata'].copy()
        doc['metadata']['score'] = float(score)
        doc['metadata']['relevance'] = 'high' if score > 0.8 else \
                                     'medium' if score > 0.6 else 'low'
        results.append(doc)

    return results

The search implementation includes several sophisticated features:

  1. Query Processing:

    • Converts the query text to a vector using the same encoder
    • Normalizes the query vector for consistent similarity calculations
    • Reshapes the vector to match FAISS requirements
  2. Similarity Search:

    • Uses FAISS to find the k nearest neighbors
    • Returns both similarity scores and document indices
    • Performs search in sub-linear time thanks to FAISS optimization
  3. Result Processing:

    • Creates deep copies to prevent modifying original documents
    • Adds similarity scores to metadata
    • Categorizes relevance based on score thresholds
    • Returns a clean, structured result format

This foundation is flexible enough to support cross-encoder reranking or advanced filtering strategies. As your needs grow, you’ll find myriad ways to refine retrieval, relevance scoring, and chunk usage even further. Some potential improvements include:

  • Adding chunk overlap to capture context at boundaries
  • Implementing more sophisticated chunking strategies
  • Adding filters based on metadata
  • Incorporating hybrid search approaches
  • Adding caching for frequently accessed documents