Measure semantic equivalence between model outputs and reference answers using Galileo’s Guardrail Metrics to ensure alignment with expected responses.
Ground Truth Adherence measures whether a model’s response is semantically equivalent to your reference answer (Ground Truth).
Ground Truth Adherence is a continuous metric ranging from 0 to 1:
0
1
Low Adherence
The model's response is semantically different from the Ground Truth
High Adherence
The model's response is semantically equivalent to the Ground Truth
This metric helps evaluate how closely your model’s outputs match expected or ideal responses, which is particularly valuable for:
This metric is only supported in experiments, and requires a Ground Truth to be set in the output
column of your experiment’s dataset.
Ground Truth Adherence is computed through a multi-step process:
Model Request
Additional evaluation requests are sent to OpenAI’s GPT4o model to analyze the semantic relationship between responses.
Prompt Engineering
A carefully engineered chain-of-thought prompt asks the model to evaluate whether the response and Ground Truth convey the same meaning.
Multiple Evaluations
The system requests multiple distinct responses to this prompt to ensure robust evaluation through consensus.
Result Analysis
Each evaluation generates both an explanation of the reasoning and a binary judgment (yes/no) on semantic equivalence.
Score Calculation
The final Ground Truth Adherence score is computed as the ratio of ‘yes’ responses to the total number of evaluation responses.
We also surface one of the generated explanations, always choosing one that aligns with the majority judgment among the responses.
This metric is computed by prompting an LLM multiple times, and thus requires additional LLM calls to compute, which may impact usage and billing.
It’s important to understand how Ground Truth Adherence differs from related metrics:
Ground Truth Adherence: Measures semantic equivalence to a reference answer.
Correctness: Measures factual accuracy regardless of any reference answer.
Context Adherence: Measures alignment with provided context, not a reference answer.
When responses have low Ground Truth Adherence scores, your model is generating outputs that differ semantically from your reference answers. To improve your system:
Analyze divergence patterns: Identify common ways in which responses differ from ground truth.
Refine your prompts: Adjust instructions to guide the model toward your expected output format and content.
Consider few-shot examples: Provide examples in your prompt that demonstrate the desired response pattern.
Evaluate ground truth quality: Ensure your reference answers are clear, consistent, and representative of ideal responses.
Create a varied set of reference answers that cover different response styles and edge cases.
Define what constitutes semantic equivalence for your specific use case and domain.
Track Ground Truth Adherence when upgrading models to ensure consistent performance.
Use Ground Truth Adherence alongside metrics like Correctness and Instruction Adherence for a complete evaluation.
When optimizing for Ground Truth Adherence, remember that there may be multiple valid ways to express the same information. Consider whether strict adherence to specific wording is necessary, or if semantic equivalence is sufficient for your use case.
Measure semantic equivalence between model outputs and reference answers using Galileo’s Guardrail Metrics to ensure alignment with expected responses.
Ground Truth Adherence measures whether a model’s response is semantically equivalent to your reference answer (Ground Truth).
Ground Truth Adherence is a continuous metric ranging from 0 to 1:
0
1
Low Adherence
The model's response is semantically different from the Ground Truth
High Adherence
The model's response is semantically equivalent to the Ground Truth
This metric helps evaluate how closely your model’s outputs match expected or ideal responses, which is particularly valuable for:
This metric is only supported in experiments, and requires a Ground Truth to be set in the output
column of your experiment’s dataset.
Ground Truth Adherence is computed through a multi-step process:
Model Request
Additional evaluation requests are sent to OpenAI’s GPT4o model to analyze the semantic relationship between responses.
Prompt Engineering
A carefully engineered chain-of-thought prompt asks the model to evaluate whether the response and Ground Truth convey the same meaning.
Multiple Evaluations
The system requests multiple distinct responses to this prompt to ensure robust evaluation through consensus.
Result Analysis
Each evaluation generates both an explanation of the reasoning and a binary judgment (yes/no) on semantic equivalence.
Score Calculation
The final Ground Truth Adherence score is computed as the ratio of ‘yes’ responses to the total number of evaluation responses.
We also surface one of the generated explanations, always choosing one that aligns with the majority judgment among the responses.
This metric is computed by prompting an LLM multiple times, and thus requires additional LLM calls to compute, which may impact usage and billing.
It’s important to understand how Ground Truth Adherence differs from related metrics:
Ground Truth Adherence: Measures semantic equivalence to a reference answer.
Correctness: Measures factual accuracy regardless of any reference answer.
Context Adherence: Measures alignment with provided context, not a reference answer.
When responses have low Ground Truth Adherence scores, your model is generating outputs that differ semantically from your reference answers. To improve your system:
Analyze divergence patterns: Identify common ways in which responses differ from ground truth.
Refine your prompts: Adjust instructions to guide the model toward your expected output format and content.
Consider few-shot examples: Provide examples in your prompt that demonstrate the desired response pattern.
Evaluate ground truth quality: Ensure your reference answers are clear, consistent, and representative of ideal responses.
Create a varied set of reference answers that cover different response styles and edge cases.
Define what constitutes semantic equivalence for your specific use case and domain.
Track Ground Truth Adherence when upgrading models to ensure consistent performance.
Use Ground Truth Adherence alongside metrics like Correctness and Instruction Adherence for a complete evaluation.
When optimizing for Ground Truth Adherence, remember that there may be multiple valid ways to express the same information. Consider whether strict adherence to specific wording is necessary, or if semantic equivalence is sufficient for your use case.