Ground Truth Adherence measures whether a model’s response is semantically equivalent to your reference answer (Ground Truth).

Ground Truth Adherence is a continuous metric ranging from 0 to 1:

0

1

Low Adherence

The model's response is semantically different from the Ground Truth

High Adherence

The model's response is semantically equivalent to the Ground Truth

This metric helps evaluate how closely your model’s outputs match expected or ideal responses, which is particularly valuable for:

  • Evaluating model performance against a benchmark dataset
  • Ensuring consistency in critical applications
  • Measuring the impact of model or prompt changes

This metric requires a Ground Truth to be set. Check out this page to learn how to add a Ground Truth to your runs.

Calculation Method

Ground Truth Adherence is computed through a multi-step process:

1

Model Request

Additional evaluation requests are sent to OpenAI’s GPT4o model to analyze the semantic relationship between responses.

2

Prompt Engineering

A carefully engineered chain-of-thought prompt asks the model to evaluate whether the response and Ground Truth convey the same meaning.

3

Multiple Evaluations

The system requests multiple distinct responses to this prompt to ensure robust evaluation through consensus.

4

Result Analysis

Each evaluation generates both an explanation of the reasoning and a binary judgment (yes/no) on semantic equivalence.

5

Score Calculation

The final Ground Truth Adherence score is computed as the ratio of ‘yes’ responses to the total number of evaluation responses.

We also surface one of the generated explanations, always choosing one that aligns with the majority judgment among the responses.

This metric is computed by prompting an LLM multiple times, and thus requires additional LLM calls to compute, which may impact usage and billing.

Understanding Ground Truth Adherence

Differentiating from Other Metrics

It’s important to understand how Ground Truth Adherence differs from related metrics:

Ground Truth Adherence: Measures semantic equivalence to a reference answer.

Correctness: Measures factual accuracy regardless of any reference answer.

Context Adherence: Measures alignment with provided context, not a reference answer.

Optimizing Your AI System

Addressing Low Ground Truth Adherence

When responses have low Ground Truth Adherence scores, your model is generating outputs that differ semantically from your reference answers. To improve your system:

Analyze divergence patterns: Identify common ways in which responses differ from ground truth.

Refine your prompts: Adjust instructions to guide the model toward your expected output format and content.

Consider few-shot examples: Provide examples in your prompt that demonstrate the desired response pattern.

Evaluate ground truth quality: Ensure your reference answers are clear, consistent, and representative of ideal responses.

Best Practices

Maintain Diverse Ground Truths

Create a varied set of reference answers that cover different response styles and edge cases.

Set Clear Evaluation Criteria

Define what constitutes semantic equivalence for your specific use case and domain.

Monitor Across Model Versions

Track Ground Truth Adherence when upgrading models to ensure consistent performance.

Balance with Other Metrics

Use Ground Truth Adherence alongside metrics like Correctness and Instruction Adherence for a complete evaluation.

When optimizing for Ground Truth Adherence, remember that there may be multiple valid ways to express the same information. Consider whether strict adherence to specific wording is necessary, or if semantic equivalence is sufficient for your use case.