Correctness measures whether a given model response contains factually accurate information.

Correctness is a continuous metric ranging from 0 to 1:

0

1

Low Correctness

The response contains factual errors

High Correctness

The response is factually accurate

This metric is particularly valuable for uncovering open-domain hallucinations: factual errors that don’t relate to any specific documents or context provided to the model.

Calculation Method

Correctness is computed through a multi-step process:

1

Model Request

Additional evaluation requests are sent to OpenAI’s GPT4-o model to analyze the response.

2

Prompt Engineering

A carefully engineered chain-of-thought prompt is used to ask the model to evaluate whether the response contains factually accurate information.

3

Multiple Evaluations

The system requests multiple distinct responses to this prompt to ensure robust evaluation through consensus.

4

Result Analysis

Each evaluation generates both an explanation of the reasoning and a binary judgment (yes/no) on factual accuracy.

5

Score Calculation

The final Correctness score is computed as the ratio of ‘yes’ responses to the total number of evaluation responses.

We also surface one of the generated explanations, always choosing one that aligns with the majority judgment among the responses:

  • If the score is greater than 0.5, the explanation will provide an argument that the response is factual
  • If the score is less than 0.5, the explanation will provide an argument that it is not factual

This metric is computed by prompting an LLM multiple times, and thus requires additional LLM calls to compute, which may impact usage and billing.

Understanding Correctness

How Correctness Differs from Context Adherence

It’s important to understand the distinction between related metrics:

Correctness: Measures whether a model response has factually correct information, regardless of whether that information is contained in the provided context.

Context Adherence: Measures whether the response adheres specifically to the information provided in the context.

Example: In a text-to-SQL scenario, a response could be factually correct (high Correctness) but not derived from the provided context (low Context Adherence). Conversely, a response could faithfully represent the context (high Context Adherence) but contain factual errors if the context itself is incorrect.

Optimizing Your AI System

Addressing Low Correctness Scores

When a response has a low Correctness score, it’s likely that the response contains non-factual information. To improve your system:

Flag and examine potentially non-factual responses: Identify patterns in responses that tend to contain factual errors.

Adjust your prompts: Instruct the model to stick to information it’s given in the context and avoid speculation.

Implement verification steps: Add additional checks for factual accuracy before responses reach end users.

Consider model selection: Some models may be more factually accurate than others for specific domains.

Best Practices

Implement Fact-Checking

For critical applications, implement automated fact-checking against trusted knowledge bases or databases.

Use Grounding Techniques

Instruct models to ground their responses in verifiable information and cite sources when possible.

Monitor Domain-Specific Accuracy

Track Correctness scores across different knowledge domains to identify areas where your model may be less reliable.

Create Factual Guardrails

Develop domain-specific guardrails that can catch common factual errors before they reach users.

When optimizing for Correctness, remember that even human experts can disagree on certain facts. Consider implementing confidence levels for responses, especially in domains with evolving knowledge or subjective elements.