Correctness
Evaluate factual accuracy in AI outputs using Galileo Guardrail Metrics to detect and prevent hallucinations in your AI systems.
Correctness measures whether a given model response contains factually accurate information.
Correctness is a continuous metric ranging from 0 to 1:
0
1
Low Correctness
The response contains factual errors
High Correctness
The response is factually accurate
This metric is particularly valuable for uncovering open-domain hallucinations: factual errors that don’t relate to any specific documents or context provided to the model.
Calculation Method
Correctness is computed through a multi-step process:
Model Request
Additional evaluation requests are sent to OpenAI’s GPT4-o model to analyze the response.
Prompt Engineering
A carefully engineered chain-of-thought prompt is used to ask the model to evaluate whether the response contains factually accurate information.
Multiple Evaluations
The system requests multiple distinct responses to this prompt to ensure robust evaluation through consensus.
Result Analysis
Each evaluation generates both an explanation of the reasoning and a binary judgment (yes/no) on factual accuracy.
Score Calculation
The final Correctness score is computed as the ratio of ‘yes’ responses to the total number of evaluation responses.
We also surface one of the generated explanations, always choosing one that aligns with the majority judgment among the responses:
- If the score is greater than 0.5, the explanation will provide an argument that the response is factual
- If the score is less than 0.5, the explanation will provide an argument that it is not factual
This metric is computed by prompting an LLM multiple times, and thus requires additional LLM calls to compute, which may impact usage and billing.
Understanding Correctness
How Correctness Differs from Context Adherence
It’s important to understand the distinction between related metrics:
Correctness: Measures whether a model response has factually correct information, regardless of whether that information is contained in the provided context.
Context Adherence: Measures whether the response adheres specifically to the information provided in the context.
Example: In a text-to-SQL scenario, a response could be factually correct (high Correctness) but not derived from the provided context (low Context Adherence). Conversely, a response could faithfully represent the context (high Context Adherence) but contain factual errors if the context itself is incorrect.
Optimizing Your AI System
Addressing Low Correctness Scores
When a response has a low Correctness score, it’s likely that the response contains non-factual information. To improve your system:
Flag and examine potentially non-factual responses: Identify patterns in responses that tend to contain factual errors.
Adjust your prompts: Instruct the model to stick to information it’s given in the context and avoid speculation.
Implement verification steps: Add additional checks for factual accuracy before responses reach end users.
Consider model selection: Some models may be more factually accurate than others for specific domains.
Best Practices
Implement Fact-Checking
For critical applications, implement automated fact-checking against trusted knowledge bases or databases.
Use Grounding Techniques
Instruct models to ground their responses in verifiable information and cite sources when possible.
Monitor Domain-Specific Accuracy
Track Correctness scores across different knowledge domains to identify areas where your model may be less reliable.
Create Factual Guardrails
Develop domain-specific guardrails that can catch common factual errors before they reach users.
When optimizing for Correctness, remember that even human experts can disagree on certain facts. Consider implementing confidence levels for responses, especially in domains with evolving knowledge or subjective elements.