Correctness measures whether a given model response contains factually accurate information.
01
Low Correctness
The response contains factual errorsHigh Correctness
The response is factually accurateCalculation method
Correctness is computed through a multi-step process:1
Model Request
Additional evaluation requests are sent to OpenAI’s GPT4-o model to analyze the response.
2
Prompt Engineering
A carefully engineered chain-of-thought prompt is used to ask the model to evaluate whether the response contains factually accurate information.
3
Multiple Evaluations
The system requests multiple distinct responses to this prompt to ensure robust evaluation through consensus.
4
Result Analysis
Each evaluation generates both an explanation of the reasoning and a binary judgment (yes/no) on factual accuracy.
5
Score Calculation
The final Correctness score is computed as the ratio of ‘yes’ responses to the total number of evaluation responses.
- If the score is greater than 0.5, the explanation will provide an argument that the response is factual
- If the score is less than 0.5, the explanation will provide an argument that it is not factual
This metric is computed by prompting an LLM multiple times, and thus requires additional LLM calls to compute, which may impact usage and billing.
Understanding correctness
How Correctness Differs from Context Adherence
Correctness: Measures whether a model response has factually correct information, regardless of whether that information is contained in the provided context.
Context Adherence: Measures whether the response adheres specifically to the information provided in the context.
Example: In a text-to-SQL scenario, a response could be factually correct (high Correctness) but not derived from the provided context (low Context Adherence). Conversely, a response could faithfully represent the context
(high Context Adherence) but contain factual errors if the context itself is incorrect.
Optimizing your AI system
Addressing Low Correctness Scores
Flag and examine potentially non-factual responses: Identify patterns in responses that tend to contain factual errors.
Adjust your prompts: Instruct the model to stick to information it’s given in the context and avoid speculation.
Implement verification steps: Add additional checks for factual accuracy before responses reach end users.
Consider model selection: Some models may be more factually accurate than others for specific domains.
Best practices
Implement Fact-Checking
For critical applications, implement automated fact-checking against trusted knowledge bases or databases.
Use Grounding Techniques
Instruct models to ground their responses in verifiable information and cite sources when possible.
Monitor Domain-Specific Accuracy
Track Correctness scores across different knowledge domains to identify areas where your model may be less reliable.
Create Factual Guardrails
Develop domain-specific guardrails that can catch common factual errors before they reach users.
When optimizing for Correctness, remember that even human experts can disagree on certain facts. Consider implementing confidence levels for responses, especially in domains with evolving knowledge or subjective elements.