Problems response quality metrics help you solve
- You’re not sure whether the model’s answers are actually correct. Correctness helps you spot factual mistakes in responses, even when there is no single reference answer.
- You have reference answers and want to know how close the model gets. Ground Truth Adherence tells you when a response is semantically equivalent to your gold answer and when it drifts away.
- The model keeps ignoring or twisting your instructions. Instruction Adherence highlights where responses fail to follow the structure, constraints, or style you asked for.
Diagnose your response quality problem
Not sure which metric to start with? Walk through these symptoms to find the right one.Responses contain factual errors
Responses contain factual errors
Diagnosis: The model is generating incorrect information — either from outdated training data or hallucination.Start with: Correctness to systematically identify factually wrong statements.When to use this vs. Ground Truth Adherence: Use Correctness when you don’t have reference answers and want to catch general factual mistakes. Use Ground Truth Adherence when you have gold-standard answers to compare against.
Responses don't match expected answers
Responses don't match expected answers
Diagnosis: You have reference answers (from experts, previous systems, or test datasets) and the model’s outputs don’t align with them.Start with: Ground Truth Adherence to measure semantic similarity to your gold answers.Note: This metric requires ground truth data, so it’s primarily used in experiments and test suites, not real-time production monitoring.
The model ignores my prompt rules
The model ignores my prompt rules
Diagnosis: You’ve specified constraints (format, length, tone, prohibited topics) but the model keeps violating them.Start with: Instruction Adherence to detect when responses break your prompt’s rules.Common instruction failures: Ignoring format requirements (JSON, bullets, tables), exceeding length limits, using wrong tone, mentioning prohibited topics, skipping required elements.
| Name | Description | Supported Nodes | When to Use | Example Use Case |
|---|---|---|---|---|
| Ground Truth Adherence | Measures how well the response aligns with established ground truth. This metric is only available for experiments as it needs ground truth set in your dataset. | Trace | When evaluating model responses against known correct answers. | A customer service AI that must provide accurate product specifications from an official catalog. |
| Correctness (factuality) | Evaluates the factual accuracy of information provided in the response. | LLM span | When accuracy of information is critical to your application. | A medical information system providing drug interaction details to healthcare professionals. |
| Instruction Adherence | Assesses whether the model followed the instructions in your prompt template. | LLM span | When using complex prompts and need to verify the model is following all instructions. | A content generation system that must follow specific brand guidelines and formatting requirements. |