Setup
Consider the following prompt:Tigers are better because they are apex predators with a versatile skill set, capable of thriving in diverse terrestrial environments, whereas sharks are specialized for aquatic life. Tigers have complex social behaviors and are more adaptable to a range of habitats compared to sharks.
What went wrong?
Let’s look at the explanation:The response has a 66.67% chance of being factually correct.The response makes several claims regarding why tigers are better than sharks:
- Tigers are apex predators: This claim is factually correct as tigers are considered apex predators in their terrestrial habitats, primarily preying on larger herbivores.
- Tigers have a versatile skill set and can thrive in diverse terrestrial environments: This claim is true as tigers can adapt to various habitats, including forests, grasslands, and swamps, showcasing their versatility in hunting and survival skills.
- Sharks are specialized for aquatic life: This claim is also correct. Sharks have evolved as top predators in marine environments, having physical adaptations, such as streamlined bodies and gills, making them highly effective in water.
- Tigers have complex social behaviors and are more adaptable to a range of habitats than sharks: While it is true that tigers display some level of social interaction, such as mother-cub relationships, they are generally solitary animals. The adaptability claim is partially accurate; however, it should be noted that sharks are well adapted to their own environments, but they are not adaptable to land environments due to their physiological needs.
- No constraints on responses: The model wasn’t instructed to limit responses to known facts
- Allowed speculation: The model was permitted to make claims without verification
- Lack of validation: No post-processing was implemented to validate factual correctness
How it appears in metrics
Hallucinations typically manifest in evaluation metrics as:- Low Correctness scores: Responses contain factual inaccuracies
- Uncertainty indicators: The model shows some level of uncertainty about of the response
Solutions to prevent hallucinations
1
Explicitly Instruct the Model to Avoid Guessing
Modify your prompts to discourage speculation:Implementation tips:
- Add explicit uncertainty instructions to your system prompt
- Include examples of good “I don’t know” responses in few-shot prompts
- Reward the model for admitting uncertainty rather than making up answers
- Consider phrases like “It’s important to be accurate rather than comprehensive”
2
Add Response Validation
- Implement automated validation using Correctness scores
- Flag responses with a low Ground Truth Adherence score for human review
- Consider using retrieval-augmented generation (RAG) to ground responses in verified sources
- Use Galileo’s Ground Truth Adherence metric to compare responses against known facts
- Implement confidence thresholds where low-confidence answers trigger fallback responses
- For critical applications, create a verification pipeline with multiple checks
- Consider a “cited response only” approach where all factual claims must reference sources
3
Implement Post-Processing Checks
- Run additional LLM-based checks for factual accuracy before serving responses
- Filter out responses that exceed an uncertainty threshold
- Consider implementing a citation system for factual claims
- Create a separate “fact-checking” LLM call that evaluates the main response
- Implement a multi-agent approach where one agent generates and another verifies
- Use structured output formats that separate facts from opinions/interpretations
- For high-stakes domains, maintain a database of verified facts to check against
4
Design for Transparency
- Clearly communicate the model’s limitations to users
- Provide confidence scores alongside responses
- Distinguish between factual statements and opinions/interpretations
- Consider visual indicators for different levels of certainty
- Use UI elements like confidence meters or color-coding for certainty levels
- Provide “View Sources” options for factual claims
- Design graceful fallbacks when the model is uncertain
- Consider allowing users to provide feedback on factual accuracy
5
Monitor and Continuously Improve
- Use Galileo to track hallucination rates over time
- Collect examples of hallucinations to create targeted test cases
- Regularly audit responses, especially for high-risk domains
- Implement feedback loops to improve the system based on real-world performance
- Set up custom metrics in Galileo specifically for tracking hallucination types
- Create a “hallucination dataset” of challenging examples for regression testing
- Implement user feedback mechanisms to flag potential hallucinations
- Schedule regular reviews of flagged responses to identify patterns
Best practices
Layer your defenses
Combine multiple approaches rather than relying on a single method. Use a combination of prompt engineering, validation, and post-processing checks.
Domain-specific knowledge
For specialized domains, create custom knowledge bases or retrieval systems that contain verified information relevant to your application.
Calibrate confidence
Train your system to accurately represent its confidence level. Models should express appropriate uncertainty when dealing with ambiguous or incomplete information.
Progressive disclosure
Consider revealing information in stages, with increasing verification for more detailed claims. Start with high-confidence facts before providing specifics.
Continuous evaluation
Regularly test your system against new examples of potential hallucinations. Build a test suite of challenging cases that target known weaknesses.
Human-in-the-loop
For critical applications, maintain human oversight for final verification. Design workflows where humans can efficiently review and correct model outputs.
Feedback integration
Create mechanisms to learn from identified hallucinations. Use feedback loops to continuously improve your system’s factual accuracy over time.
Contextual awareness
Adjust verification stringency based on the stakes of the interaction. Apply more rigorous checks for high-risk domains like healthcare or finance.