The completeness challenge
Answer completeness refers to how thoroughly and comprehensively a RAG system answers a given question. The Galileo completeness metric evaluates this by using the LLM as a judge to compare the response against its own knowledge of the topic. This approach has important implications:- The metric can identify when a response misses information that is part of open domain knowledge
- The metric cannot identify gaps in information that is not part of open domain knowledge
- The evaluation helps ensure responses are complete relative to what is generally known about a topic
- Cover all relevant aspects of the question
- Include all significant details from the source documents
- Synthesize information from multiple sources when relevant
- Provide proper context and background information
- Not miss any important information that is part of open domain knowledge
Basic implementation
The basic implementation (ensure-completeness-basic.py
) demonstrates several limitations that lead to incomplete answers:
1
Limited Document Retrieval
2
Simple Document Processing
3
Basic Prompting
Improved implementation
The improved implementation (ensure-completeness-enhanced.py
) addresses these limitations through several significant improvements:
1
Improved Document Retrieval
2
Sophisticated Document Processing
3
Improved Prompting
Key differences in practice
When comparing the two implementations:1
Document Selection
2
Information Synthesis
3
Context Preservation
4
Answer Quality
Measuring success with Galileo
The Galileo completeness metric provides a quantitative way to evaluate the effectiveness of RAG systems by assessing several key aspects. It measures the coverage of relevant information, evaluating how well the system incorporates all necessary details from the source documents. The metric also assesses the synthesis of multiple sources, ensuring that information from different documents is properly combined and presented coherently. Additionally, it evaluates the citation of sources, checking that all information is properly attributed. The metric also considers context preservation, ensuring that the meaning and relationships between pieces of information are maintained. Finally, it provides an overall assessment of answer thoroughness. The improved implementation consistently achieves higher completeness scores by addressing the limitations of the basic approach through better document retrieval, processing, and prompting strategies.Practical example: penicillin discovery
To illustrate the difference between the basic and improved implementations, let’s examine how they handle a query about the discovery of penicillin. This example is particularly interesting because penicillin’s discovery is well-known open domain knowledge that the LLM already has access to. The completeness metric evaluates how well the system uses the retrieved documents compared to this baseline knowledge.1
Basic Implementation (75% Completeness)
2
improved implementation (100% Completeness)
- Retrieving and utilizing multiple relevant documents
- Properly attributing information to specific sources
- Synthesizing information across documents
- Including all significant details and context
- Presenting information in a coherent narrative
- Prioritizing retrieved document content over general knowledge
Best practices for ensuring completeness
1
Document Retrieval
2
Document Processing
3
Prompting