Metrics Overview

Galileo comes with a set of ready to use out-of-the-box metrics that allow you to see how your AI is performing. With these metrics, you can quickly spot problems, track improvements, and make your AI work better for your users. You can then expand these metrics with custom metrics, creating using LLM-as-a-judge, or custom code-based metrics. To calculate out-of-the-box metrics, or LLM-as-a-judge metrics, you will need to configure an integration with an LLM, or to Luna-2. Connect Galileo to your language model by adding your API key on the integrations page from within the Galileo application. You can improve the metric calculation based on your requirements using continuous learning via human feedback (CLHF). This allows you to continuously provide feedback in natural language that automatically improves the metrics to align better with your domain, or expected inputs and outputs. Metrics can be used with experiments, and Log streams.

Using metrics effectively

To get the most value from Galileo’s metrics:

Start with key metrics - Focus on metrics most relevant to your use case
Establish baselines - Understand your current performance before making changes
Track trends over time - Monitor how metrics change as you iterate on your system
Combine multiple metrics - Look at related metrics together for a more complete picture
Set thresholds - Define acceptable ranges for critical metrics
Improve the metrics - Use CLHF to continuously improve the metrics

Out-of-the-Box metric categories

Our metrics can be broken down into five key categories, each addressing a specific aspect of AI system performance. Many times, folks benefit from using metrics from more than one category, depending on the metrics that matter most to them. Galileo also supports custom metrics that are able to be implemented alongside the out-of-the-box metric options.

Category	Description	When to Use	Example Use Cases
Agentic	Metrics that evaluate how effectively AI agents perform tasks, use tools, and progress toward goals.	When building and optimizing AI systems that take actions, make decisions, or use tools to accomplish tasks.	Evaluating a travel planning agent’s ability to book complete itineraries Assessing a coding assistant’s appropriate use of APIs and libraries Measuring a data analysis agent’s tool selection effectiveness
Expression and Readability	Metrics that evaluate the style, tone, clarity, and overall presentation of AI-generated content.	When the format, tone, and presentation of AI outputs are important for user experience or brand consistency.	Ensuring a luxury brand chatbot maintains a sophisticated tone Verifying educational content is presented at the appropriate reading level Measuring clarity and conciseness in technical documentation generation
Model Confidence	Metrics that measure how certain or uncertain your AI model is about its responses.	When you want to flag low-confidence responses for review, improve system reliability, or better understand model uncertainty.	Flagging uncertain answers in a customer support chatbot for human review Identifying low-confidence predictions in a medical diagnosis assistant Improving user trust by surfacing confidence scores in AI-generated content
Response Quality	Metrics that assess the accuracy, completeness, relevance, and overall quality of AI-generated responses.	When evaluating how well AI systems answer questions, follow instructions, or provide information based on context.	Measuring factual accuracy in a medical information system Evaluating how well a RAG system uses retrieved information Assessing if customer service responses address all parts of a query
Safety and Compliance	Metrics that identify potential risks, harmful content, bias, or privacy concerns in AI interactions.	When ensuring AI systems meet regulatory requirements, protect user privacy, and avoid generating harmful or biased content.	Detecting PII in healthcare chatbot conversations Identifying potential prompt injection attacks in public-facing systems Measuring bias in hiring or loan approval recommendation systems

Agentic performance metrics

Use these metrics to evaluate how well your AI agents use tools, make decisions, and accomplish multi-step tasks. They’re a good fit when you’re building agents that need to interact with external systems or complete complex workflows.

Name	Description	When to use	Example use case
Action advancement	Measures how effectively each action advances toward the goal.	When assessing whether an agent is making meaningful progress in multi-step tasks.	A travel planning agent that needs to book flights, hotels, and activities in the correct sequence.
Action completion	Determines whether the agent successfully accomplished all of the user’s goals.	To assess whether an agent completed the desired goal.	A coding agent that is seeking to close engineering tickets.
Agent efficiency	Determines if an agent provides a precise answer or resolution to every user ask, with an efficient path.	To assess if an agent is taking the most efficient path to a solution.	A complex multi-agent chatbot that needs a fast response.
Agent flow	Measures the correctness and coherence of an agentic trajectory by validating it against user-specified natural language tests.	To assess a multi-agent system, or a system with multiple tools.	An internal process agent that needs to follow strict process rules.
Conversation quality	A binary metric that assesses whether a chatbot interaction left the user feeling satisfied and positive or frustrated and dissatisfied.	When building customer facing chatbots.	A health insurance chatbot.
Intent change	Measures a significant shift in the user’s primary conversational goal or workflow during a session, relative to their initial stated intent.	To analyze a holistic view across an entire user session to understand what capabilities a user interacts with in a single session.	A multi-purpose chatbot for a bank.
Tool error	Detects errors or failures during the execution of tools.	When implementing AI agents that use tools and want to track error rates.	A coding assistant that uses external APIs to run code and must handle and report execution errors appropriately.
Tool selection quality	Evaluates whether the agent selected the most appropriate tools for the task.	When optimizing agent systems for effective tool usage.	A data analysis agent that must choose the right visualization or statistical method based on the data type and user question.

Expression and readability metrics

Use these metrics to assess the style, tone, and clarity of your AI’s generated content. They’re helpful when you want your AI to communicate clearly, match your brand’s voice, or produce content that’s easy for users to understand.

Name	Description	When to Use	Example Use Case
Tone	Evaluates the emotional tone and style of the response.	When the style and tone of AI responses matter for your brand or user experience.	A luxury brand’s customer service chatbot that must maintain a sophisticated, professional tone consistent with the brand image.
BLEU & ROUGE	Standard NLP metrics for evaluating text generation quality. These metrics are only available for experiments as they need ground truth set in your dataset.	When you want to quantitatively assess the similarity between generated and reference texts.	Evaluating the quality of machine-translated or summarization outputs against human-written references.

Model confidence metrics

Use these metrics to understand how certain or uncertain your AI model is about its answers. They’re useful when you want to flag low-confidence responses for review or improve your system’s reliability.

Name	Description	When to Use	Example Use Case
Uncertainty	Measures the model’s confidence in its generated response.	When you want to understand how certain the model is about its answers.	Flagging responses where the model is unsure, so a human can review them before sending to a user.
Prompt Perplexity	Evaluates how difficult or unusual the prompt is for the model to process.	When you want to identify prompts that may confuse the model or lead to lower-quality responses.	Detecting outlier prompts in a customer support chatbot to improve prompt engineering.

Response quality metrics

Use these metrics to evaluate how well your AI answers user questions and follows instructions. They’re especially helpful when you want to ensure your system is providing accurate, complete, and relevant responses.

Name	Description	When to Use	Example Use Case
Chunk Attribution Utilization	Assesses whether the response uses the retrieved chunks in its response, and properly attributes information to source documents.	When implementing RAG systems and want to ensure proper attribution and that retrieved information is used efficiently.	A legal research assistant that must cite specific cases and statutes when providing legal information.
Completeness	Measures whether the response addresses all aspects of the user’s query.	When evaluating if responses fully address the user’s intent.	A healthcare chatbot that must address all symptoms mentioned by a patient when suggesting next steps.
Context Adherence	Measures how well the response aligns with the provided context.	When you want to ensure the model is grounding its responses in the provided context.	A financial advisor bot that must base investment recommendations on the client’s specific financial situation and goals.
Context Relevance (Query Adherence)	Evaluates whether the retrieved context is relevant to the user’s query.	When assessing the quality of your retrieval system’s results.	An internal knowledge base search that retrieves company policies relevant to specific employee questions.
Correctness (factuality)	Evaluates the factual accuracy of information provided in the response.	When accuracy of information is critical to your application.	A medical information system providing drug interaction details to healthcare professionals.
Ground Truth Adherence	Measures how well the response aligns with established ground truth. This metric is only available for experiments as it needs ground truth set in your dataset.	When evaluating model responses against known correct answers.	A customer service AI that must provide accurate product specifications from an official catalog.
Instruction Adherence	Assesses whether the model followed the instructions in your prompt template.	When using complex prompts and need to verify the model is following all instructions.	A content generation system that must follow specific brand guidelines and formatting requirements.

Safety and compliance metrics

Use these metrics to identify potential risks, harmful content, or compliance issues in your AI’s responses. They’re important when you need to protect users, meet regulatory requirements, or avoid generating biased or unsafe content.

Name	Description	When to Use	Example Use Case
PII / CPNI / PHI	Identifies personally identifiable or sensitive information in prompts and responses.	When handling potentially sensitive data or in regulated industries.	A healthcare chatbot that must detect and redact patient information in conversation logs.
Prompt Injection	Detects attempts to manipulate the model through malicious prompts.	When allowing user input to be processed directly by your AI system.	A public-facing AI assistant that needs protection from users trying to bypass content filters or extract sensitive information.
Sexism / Bias	Detects gender-based bias or discriminatory content.	When ensuring AI outputs are free from bias and discrimination.	A resume screening assistant that must evaluate job candidates without gender or demographic bias.
Toxicity	Identifies harmful, offensive, or inappropriate content.	When monitoring AI outputs for harmful content or implementing content filtering.	A social media content moderation system that must detect and flag potentially harmful user-generated content.

Next steps

Custom LLM-as-a-judge metrics

Learn how to create evaluation metrics using LLMs to judge the quality of responses

Custom code-based metrics

Learn how to create, register, and use custom metrics to evaluate your LLM applications

Customizing Your LLM-Powered Metrics via CLHF

Learn how to customize your LLM-powered metrics with Continuous Learning via Human Feedback

Overview

Get Started

How-to Guides

Concepts

References

Metrics Overview

Using metrics effectively

Out-of-the-Box metric categories

Agentic performance metrics

Expression and readability metrics

Model confidence metrics

Response quality metrics

Safety and compliance metrics

Next steps

Custom LLM-as-a-judge metrics

Custom code-based metrics

Customizing Your LLM-Powered Metrics via CLHF

Overview

Get Started

How-to Guides

Concepts

References

​Using metrics effectively

​Out-of-the-Box metric categories

​Agentic performance metrics

​Expression and readability metrics

​Model confidence metrics

​Response quality metrics

​Safety and compliance metrics

​Next steps

Custom LLM-as-a-judge metrics

Custom code-based metrics

Customizing Your LLM-Powered Metrics via CLHF

Using metrics effectively

Out-of-the-Box metric categories

Agentic performance metrics

Expression and readability metrics

Model confidence metrics

Response quality metrics

Safety and compliance metrics

Next steps