Completeness in RAG Systems
Learn how to ensure that your RAG systems provide complete answers using the Galileo completeness metric.
This guide explains the challenge of ensuring answer completeness in Retrieval-Augmented Generation (RAG) systems, using the Galileo completeness metric to measure success. We’ll compare a basic implementation with an enhanced version to demonstrate how different approaches affect answer completeness.
The Completeness Challenge
Answer completeness refers to how thoroughly and comprehensively a RAG system answers a given question. The Galileo completeness metric evaluates this by using the LLM as a judge to compare the response against its own knowledge of the topic. This approach has important implications:
- The metric can identify when a response misses information that is part of open domain knowledge
- The metric cannot identify gaps in information that is not part of open domain knowledge
- The evaluation helps ensure responses are complete relative to what is generally known about a topic
A complete answer should:
- Cover all relevant aspects of the question
- Include all significant details from the source documents
- Synthesize information from multiple sources when relevant
- Provide proper context and background information
- Not miss any important information that is part of open domain knowledge
The Galileo completeness metric evaluates these aspects by analyzing how well the answer covers all relevant information that is part of open domain knowledge, using the LLM as a judge to identify any gaps or omissions in the response.
Basic Implementation
The basic implementation (ensure-completeness-basic.py
) demonstrates several limitations that lead to incomplete answers:
Limited Document Retrieval
The basic implementation suffers from several critical limitations in document retrieval. It’s designed to return only a single document, which significantly limits the breadth of information available for answering questions. The search mechanism relies on basic L2 distance for similarity matching, which is less effective for semantic search compared to more sophisticated approaches. Additionally, the system lacks any form of reranking or relevance scoring, meaning it can’t refine its results based on deeper semantic understanding. The chunk size is also quite large at 1000 characters, which can lead to less precise and potentially less relevant information being included in each chunk.
Simple Document Processing
The document processing in the basic implementation is quite rudimentary. It uses a simple paragraph-based splitting approach followed by basic sentence-based chunking. This method doesn’t preserve context effectively, as it doesn’t maintain relationships between chunks or consider the semantic coherence of the text. The metadata stored with each chunk is minimal, containing only basic information like source and chunk ID, without any sophisticated enrichment or relationship tracking. The chunking strategy is also fixed, applying the same approach regardless of the content type or structure.
Basic Prompting
The prompting strategy in the basic implementation is minimal and lacks specific guidance for ensuring complete answers. The system prompt is very simple, providing no explicit instructions about completeness or how to handle multiple sources of information. There are no requirements for document synthesis or citation, which can lead to answers that don’t fully utilize the available information or properly attribute their sources.
These limitations in the basic implementation often result in incomplete answers that miss key information from other relevant documents, lose context due to poor chunking, fail to synthesize information across different sources, and omit important details. This typically leads to lower completeness scores.
Imprvoed Implementation
The improved implementation (ensure-completeness-enhanced.py
) addresses these limitations through several significant improvements:
Improved Document Retrieval
The improved implementation significantly improves document retrieval by returning multiple relevant documents (10 or more) instead of just one. It uses cosine similarity for semantic matching, which provides better results for finding conceptually related content. The system implements sophisticated reranking with a threshold of 0.6, allowing it to refine and prioritize the most relevant results. The chunk size is also reduced to 512 characters, making the chunks more precise and focused.
Sophisticated Document Processing
The document processing in the improved implementation is much more sophisticated. It uses context-aware chunking that preserves sentence relationships and maintains semantic coherence. Each chunk is enriched with rich metadata including relevance scores, source tracking, and detailed formatting with relevance indicators. The system also implements dynamic chunking that adapts to the content structure, ensuring better context preservation.
Improved Prompting
The improved implementation uses a comprehensive system prompt that provides clear guidance for ensuring complete answers. It explicitly instructs the model to use all relevant information from the retrieved documents, requires proper citation of sources, and encourages synthesis of information across multiple documents. The prompt also specifies thoroughness requirements and provides a clear role for the model as a knowledgeable science historian.
These improvements in the improved implementation lead to more comprehensive answers that better incorporate multiple sources, preserve context through improved chunking, effectively synthesize information across documents, and include all significant details. This typically results in higher Galileo completeness scores.
Key Differences in Practice
When comparing the two implementations:
Document Selection
The basic implementation’s search functionality is quite limited, returning only a single document which can lead to missing key information. In contrast, the improved implementation uses a more sophisticated approach that retrieves multiple documents and implements reranking to ensure comprehensive coverage of the topic. The enhanced version also uses a multiplier for initial results when reranking is enabled, allowing for better selection of the most relevant content.
Information Synthesis
The basic implementation’s prompting strategy is minimal and doesn’t encourage synthesis of information across multiple sources. The improved implementation, however, uses a more sophisticated prompt that explicitly requires the model to synthesize information from multiple documents and provide comprehensive answers. This leads to more complete and well-rounded responses that draw from all available relevant information.
Context Preservation
The basic implementation uses a simple paragraph-based chunking approach that can lead to loss of context, especially with larger chunks. The improved implementation, on the other hand, uses a more sophisticated context-aware chunking strategy that maintains sentence relationships and semantic coherence. This results in better preservation of context and more coherent information retrieval.
Answer Quality
The basic implementation’s prompting strategy often results in answers that miss key details and lack proper context. The improved implementation, with its comprehensive prompting strategy, produces answers that are more thorough, properly cited, and effectively synthesize information from multiple sources. This leads to higher quality, more complete answers that better serve the user’s needs.
Measuring Success with Galileo
The Galileo completeness metric provides a quantitative way to evaluate the effectiveness of RAG systems by assessing several key aspects. It measures the coverage of relevant information, evaluating how well the system incorporates all necessary details from the source documents. The metric also assesses the synthesis of multiple sources, ensuring that information from different documents is properly combined and presented coherently. Additionally, it evaluates the citation of sources, checking that all information is properly attributed. The metric also considers context preservation, ensuring that the meaning and relationships between pieces of information are maintained. Finally, it provides an overall assessment of answer thoroughness.
The improved implementation consistently achieves higher completeness scores by addressing the limitations of the basic approach through better document retrieval, processing, and prompting strategies.
Practical Example: Penicillin Discovery
To illustrate the difference between the basic and improved implementations, let’s examine how they handle a query about the discovery of penicillin. This example is particularly interesting because penicillin’s discovery is well-known open domain knowledge that the LLM already has access to. The completeness metric evaluates how well the system uses the retrieved documents compared to this baseline knowledge.
Basic Implementation (75% Completeness)
improved implementation (100% Completeness)
This example clearly demonstrates how the improved implementation achieves higher completeness by:
- Retrieving and utilizing multiple relevant documents
- Properly attributing information to specific sources
- Synthesizing information across documents
- Including all significant details and context
- Presenting information in a coherent narrative
- Prioritizing retrieved document content over general knowledge
The difference in completeness scores (75% vs 100%) reflects the improved implementation’s ability to provide more thorough, well-sourced, and contextually rich answers that effectively utilize the provided documents rather than falling back on general knowledge.
Best Practices for Ensuring Completeness
Document Retrieval
Effective document retrieval is crucial for ensuring complete answers. The improved approach retrieves multiple relevant documents (typically 10 or more) to ensure comprehensive coverage of the topic. It uses semantic similarity search to find conceptually related content, rather than just keyword matches. The implementation of reranking with a threshold of 0.6 helps refine and prioritize the most relevant results. The system also uses appropriate chunk sizes to ensure that the retrieved information is precise and focused.
Document Processing
Sophisticated document processing is essential for maintaining context and relationships between pieces of information. The improved approach uses context-aware chunking that preserves sentence relationships and semantic coherence. Each chunk is enriched with relevant metadata, including source tracking and relevance indicators. The system also implements dynamic chunking that adapts to the content structure, ensuring better context preservation and more coherent information retrieval.
Prompting
Effective prompting is crucial for guiding the model to produce complete and well-structured answers. The improved approach uses a comprehensive system prompt that provides clear instructions for ensuring completeness. It requires proper citation of sources and encourages synthesis of information across multiple documents. The prompt also specifies thoroughness requirements and provides a clear role for the model, helping it understand the expectations for the quality and completeness of its responses.
By implementing these best practices, RAG systems can achieve higher completeness scores and provide more comprehensive, accurate answers to user queries.