Skip to main content
Response quality metrics help you measure how well your AI system answers user questions, follows instructions, and provides useful information in any setting — with or without RAG.

Problems response quality metrics help you solve

  • You’re not sure whether the model’s answers are actually correct. Correctness helps you spot factual mistakes in responses, even when there is no single reference answer.
  • You have reference answers and want to know how close the model gets. Ground Truth Adherence tells you when a response is semantically equivalent to your gold answer and when it drifts away.
  • The model keeps ignoring or twisting your instructions. Instruction Adherence highlights where responses fail to follow the structure, constraints, or style you asked for.

Diagnose your response quality problem

Not sure which metric to start with? Walk through these symptoms to find the right one.
Diagnosis: The model is generating incorrect information — either from outdated training data or hallucination.Start with: Correctness to systematically identify factually wrong statements.When to use this vs. Ground Truth Adherence: Use Correctness when you don’t have reference answers and want to catch general factual mistakes. Use Ground Truth Adherence when you have gold-standard answers to compare against.
Diagnosis: You have reference answers (from experts, previous systems, or test datasets) and the model’s outputs don’t align with them.Start with: Ground Truth Adherence to measure semantic similarity to your gold answers.Note: This metric requires ground truth data, so it’s primarily used in experiments and test suites, not real-time production monitoring.
Diagnosis: You’ve specified constraints (format, length, tone, prohibited topics) but the model keeps violating them.Start with: Instruction Adherence to detect when responses break your prompt’s rules.Common instruction failures: Ignoring format requirements (JSON, bullets, tables), exceeding length limits, using wrong tone, mentioning prohibited topics, skipping required elements.
High Correctness + Low Instruction Adherence? The model knows the right answer but isn’t presenting it the way you asked. This is a prompt engineering issue — your instructions may need to be more explicit or positioned differently (system prompt vs. user message).
Response Quality metrics work with or without retrieved context. If you’re building a RAG system, combine these with RAG metrics — use RAG metrics to evaluate retrieval and grounding, and Response Quality metrics to evaluate factual accuracy and instruction-following.
Below is a quick reference table of these Response Quality metrics:
NameDescriptionSupported NodesWhen to UseExample Use Case
Ground Truth AdherenceMeasures how well the response aligns with established ground truth.

This metric is only available for experiments as it needs ground truth set in your dataset.
TraceWhen evaluating model responses against known correct answers.A customer service AI that must provide accurate product specifications from an official catalog.
Correctness (factuality)Evaluates the factual accuracy of information provided in the response.LLM spanWhen accuracy of information is critical to your application.A medical information system providing drug interaction details to healthcare professionals.
Instruction AdherenceAssesses whether the model followed the instructions in your prompt template.LLM spanWhen using complex prompts and need to verify the model is following all instructions.A content generation system that must follow specific brand guidelines and formatting requirements.

Next steps