Ground Truth Adherence measures whether a model’s response is semantically equivalent to your reference answer (Ground Truth).
01
Low Adherence
The model's response is semantically different from the Ground TruthHigh Adherence
The model's response is semantically equivalent to the Ground Truth- Evaluating model performance against a benchmark dataset
- Ensuring consistency in critical applications
- Measuring the impact of model or prompt changes
This metric is only supported in experiments, and requires a Ground Truth to be set in the
output
column of your experiment’s dataset.Calculation method
Ground Truth Adherence is computed through a multi-step process:1
Model Request
Additional evaluation requests are sent to OpenAI’s GPT4o model to analyze the semantic relationship between responses.
2
Prompt Engineering
A carefully engineered chain-of-thought prompt asks the model to evaluate whether the response and Ground Truth convey the same meaning.
3
Multiple Evaluations
The system requests multiple distinct responses to this prompt to ensure robust evaluation through consensus.
4
Result Analysis
Each evaluation generates both an explanation of the reasoning and a binary judgment (yes/no) on semantic equivalence.
5
Score Calculation
The final Ground Truth Adherence score is computed as the ratio of ‘yes’ responses to the total number of evaluation responses.
This metric is computed by prompting an LLM multiple times, and thus requires additional LLM calls to compute, which may impact usage and billing.
Understanding ground truth adherence
Differentiating from Other Metrics
Ground Truth Adherence: Measures semantic equivalence to a reference answer.
Correctness: Measures factual accuracy regardless of any reference answer.
Context Adherence: Measures alignment with provided context, not a reference answer.
Optimizing your AI system
Addressing Low Ground Truth Adherence
Analyze divergence patterns: Identify common ways in which responses differ from ground truth.
Refine your prompts: Adjust instructions to guide the model toward your expected output format and content.
Consider few-shot examples: Provide examples in your prompt that demonstrate the desired response pattern.
Evaluate ground truth quality: Ensure your reference answers are clear, consistent, and representative of ideal responses.
Best practices
Maintain Diverse Ground Truths
Create a varied set of reference answers that cover different response styles and edge cases.
Set Clear Evaluation Criteria
Define what constitutes semantic equivalence for your specific use case and domain.
Monitor Across Model Versions
Track Ground Truth Adherence when upgrading models to ensure consistent performance.
Balance with Other Metrics
Use Ground Truth Adherence alongside metrics like Correctness and Instruction Adherence for a complete evaluation.
When optimizing for Ground Truth Adherence, remember that there may be multiple valid ways to express the same information. Consider whether strict adherence to specific wording is necessary, or if semantic equivalence is sufficient for your use case.