Ground Truth Adherence
Measure semantic equivalence between model outputs and reference answers using Galileo’s Guardrail Metrics to ensure alignment with expected responses.
Ground Truth Adherence measures whether a model’s response is semantically equivalent to your reference answer (Ground Truth).
Ground Truth Adherence is a continuous metric ranging from 0 to 1:
0
1
Low Adherence
The model's response is semantically different from the Ground Truth
High Adherence
The model's response is semantically equivalent to the Ground Truth
This metric helps evaluate how closely your model’s outputs match expected or ideal responses, which is particularly valuable for:
- Evaluating model performance against a benchmark dataset
- Ensuring consistency in critical applications
- Measuring the impact of model or prompt changes
This metric requires a Ground Truth to be set. Check out this page to learn how to add a Ground Truth to your runs.
Calculation Method
Ground Truth Adherence is computed through a multi-step process:
Model Request
Additional evaluation requests are sent to OpenAI’s GPT4o model to analyze the semantic relationship between responses.
Prompt Engineering
A carefully engineered chain-of-thought prompt asks the model to evaluate whether the response and Ground Truth convey the same meaning.
Multiple Evaluations
The system requests multiple distinct responses to this prompt to ensure robust evaluation through consensus.
Result Analysis
Each evaluation generates both an explanation of the reasoning and a binary judgment (yes/no) on semantic equivalence.
Score Calculation
The final Ground Truth Adherence score is computed as the ratio of ‘yes’ responses to the total number of evaluation responses.
We also surface one of the generated explanations, always choosing one that aligns with the majority judgment among the responses.
This metric is computed by prompting an LLM multiple times, and thus requires additional LLM calls to compute, which may impact usage and billing.
Understanding Ground Truth Adherence
Differentiating from Other Metrics
It’s important to understand how Ground Truth Adherence differs from related metrics:
Ground Truth Adherence: Measures semantic equivalence to a reference answer.
Correctness: Measures factual accuracy regardless of any reference answer.
Context Adherence: Measures alignment with provided context, not a reference answer.
Optimizing Your AI System
Addressing Low Ground Truth Adherence
When responses have low Ground Truth Adherence scores, your model is generating outputs that differ semantically from your reference answers. To improve your system:
Analyze divergence patterns: Identify common ways in which responses differ from ground truth.
Refine your prompts: Adjust instructions to guide the model toward your expected output format and content.
Consider few-shot examples: Provide examples in your prompt that demonstrate the desired response pattern.
Evaluate ground truth quality: Ensure your reference answers are clear, consistent, and representative of ideal responses.
Best Practices
Maintain Diverse Ground Truths
Create a varied set of reference answers that cover different response styles and edge cases.
Set Clear Evaluation Criteria
Define what constitutes semantic equivalence for your specific use case and domain.
Monitor Across Model Versions
Track Ground Truth Adherence when upgrading models to ensure consistent performance.
Balance with Other Metrics
Use Ground Truth Adherence alongside metrics like Correctness and Instruction Adherence for a complete evaluation.
When optimizing for Ground Truth Adherence, remember that there may be multiple valid ways to express the same information. Consider whether strict adherence to specific wording is necessary, or if semantic equivalence is sufficient for your use case.