Metrics Overview

Galileo comes with a set of ready-to-use metrics that allow you to see how your AI is performing. With these metrics, you can quickly spot problems, track improvements, and make your AI work better for your users.

To calculate metrics, you will need to configure an integration with an LLM, or to Luna-2. Connect Galileo to your language model by adding your API key on the integrations page from within the Galileo application.

Metrics can be used with experiments, and Log streams.

Using metrics effectively

To get the most value from Galileo’s metrics:

Start with key metrics - Focus on metrics most relevant to your use case
Establish baselines - Understand your current performance before making changes
Track trends over time - Monitor how metrics change as you iterate on your system
Combine multiple metrics - Look at related metrics together for a more complete picture
Set thresholds - Define acceptable ranges for critical metrics

Out-of-the-Box metric categories

Our metrics can be broken down into five key categories, each addressing a specific aspect of AI system performance. Many times, folks benefit from using metrics from more than one category, depending on the metrics that matter most to them. Galileo also supports custom metrics that are able to be implemented alongside the out-of-the-box metric options.

Category	Description	When to Use	Example Use Cases
Agentic	Metrics that evaluate how effectively AI agents perform tasks, use tools, and progress toward goals.	When building and optimizing AI systems that take actions, make decisions, or use tools to accomplish tasks.	Evaluating a travel planning agent’s ability to book complete itineraries Assessing a coding assistant’s appropriate use of APIs and libraries Measuring a data analysis agent’s tool selection effectiveness
Expression and Readability	Metrics that evaluate the style, tone, clarity, and overall presentation of AI-generated content.	When the format, tone, and presentation of AI outputs are important for user experience or brand consistency.	Ensuring a luxury brand chatbot maintains a sophisticated tone Verifying educational content is presented at the appropriate reading level Measuring clarity and conciseness in technical documentation generation
Model Confidence	Metrics that measure how certain or uncertain your AI model is about its responses.	When you want to flag low-confidence responses for review, improve system reliability, or better understand model uncertainty.	Flagging uncertain answers in a customer support chatbot for human review Identifying low-confidence predictions in a medical diagnosis assistant Improving user trust by surfacing confidence scores in AI-generated content
Response Quality	Metrics that assess the accuracy, completeness, relevance, and overall quality of AI-generated responses.	When evaluating how well AI systems answer questions, follow instructions, or provide information based on context.	Measuring factual accuracy in a medical information system Evaluating how well a RAG system uses retrieved information Assessing if customer service responses address all parts of a query
Safety and Compliance	Metrics that identify potential risks, harmful content, bias, or privacy concerns in AI interactions.	When ensuring AI systems meet regulatory requirements, protect user privacy, and avoid generating harmful or biased content.	Detecting PII in healthcare chatbot conversations Identifying potential prompt injection attacks in public-facing systems Measuring bias in hiring or loan approval recommendation systems

Agentic performance metrics

Use these metrics to evaluate how well your AI agents use tools, make decisions, and accomplish multi-step tasks. They’re a good fit when you’re building agents that need to interact with external systems or complete complex workflows.

Name	Description	When to Use	Example Use Case
Tool Error	Detects errors or failures during the execution of tools.	When implementing AI agents that use tools and want to track error rates.	A coding assistant that uses external APIs to run code and must handle and report execution errors appropriately.
Tool Selection Quality	Evaluates whether the agent selected the most appropriate tools for the task.	When optimizing agent systems for effective tool usage.	A data analysis agent that must choose the right visualization or statistical method based on the data type and user question.
Action Advancement	Measures how effectively each action advances toward the goal.	When assessing whether an agent is making meaningful progress in multi-step tasks.	A travel planning agent that needs to book flights, hotels, and activities in the correct sequence.
Action Completion	Determines whether the agent successfully accomplished all of the user’s goals.	To assess whether an agent completed the desired goal.	A coding agent that is seeking to close engineering tickets.

Expression and readability metrics

Use these metrics to assess the style, tone, and clarity of your AI’s generated content. They’re helpful when you want your AI to communicate clearly, match your brand’s voice, or produce content that’s easy for users to understand.

Name	Description	When to Use	Example Use Case
Tone	Evaluates the emotional tone and style of the response.	When the style and tone of AI responses matter for your brand or user experience.	A luxury brand’s customer service chatbot that must maintain a sophisticated, professional tone consistent with the brand image.
BLEU & ROUGE	Standard NLP metrics for evaluating text generation quality. These metrics are only available for experiments as they need ground truth set in your dataset.	When you want to quantitatively assess the similarity between generated and reference texts.	Evaluating the quality of machine-translated or summarization outputs against human-written references.

Model confidence metrics

Use these metrics to understand how certain or uncertain your AI model is about its answers. They’re useful when you want to flag low-confidence responses for review or improve your system’s reliability.

Name	Description	When to Use	Example Use Case
Uncertainty	Measures the model’s confidence in its generated response.	When you want to understand how certain the model is about its answers.	Flagging responses where the model is unsure, so a human can review them before sending to a user.
Prompt Perplexity	Evaluates how difficult or unusual the prompt is for the model to process.	When you want to identify prompts that may confuse the model or lead to lower-quality responses.	Detecting outlier prompts in a customer support chatbot to improve prompt engineering.

Response quality metrics

Use these metrics to evaluate how well your AI answers user questions and follows instructions. They’re especially helpful when you want to ensure your system is providing accurate, complete, and relevant responses.

Name	Description	When to Use	Example Use Case
Chunk Attribution	Assesses whether the response properly attributes information to source documents.	When implementing RAG systems and want to ensure proper attribution.	A legal research assistant that must cite specific cases and statutes when providing legal information.
Chunk Utilization	Measures how effectively the model uses the retrieved chunks in its response.	When optimizing RAG performance to ensure retrieved information is used efficiently.	A technical support chatbot that needs to incorporate relevant product documentation in troubleshooting responses.
Completeness	Measures whether the response addresses all aspects of the user’s query.	When evaluating if responses fully address the user’s intent.	A healthcare chatbot that must address all symptoms mentioned by a patient when suggesting next steps.
Context Adherence	Measures how well the response aligns with the provided context.	When you want to ensure the model is grounding its responses in the provided context.	A financial advisor bot that must base investment recommendations on the client’s specific financial situation and goals.
Context Relevance (Query Adherence)	Evaluates whether the retrieved context is relevant to the user’s query.	When assessing the quality of your retrieval system’s results.	An internal knowledge base search that retrieves company policies relevant to specific employee questions.
Correctness (factuality)	Evaluates the factual accuracy of information provided in the response.	When accuracy of information is critical to your application.	A medical information system providing drug interaction details to healthcare professionals.
Ground Truth Adherence	Measures how well the response aligns with established ground truth. This metric is only available for experiments as it needs ground truth set in your dataset.	When evaluating model responses against known correct answers.	A customer service AI that must provide accurate product specifications from an official catalog.
Instruction Adherence	Assesses whether the model followed the instructions in your prompt template.	When using complex prompts and need to verify the model is following all instructions.	A content generation system that must follow specific brand guidelines and formatting requirements.

Safety and compliance metrics

Use these metrics to identify potential risks, harmful content, or compliance issues in your AI’s responses. They’re important when you need to protect users, meet regulatory requirements, or avoid generating biased or unsafe content.

Name	Description	When to Use	Example Use Case
PII / CPNI / PHI	Identifies personally identifiable or sensitive information in prompts and responses.	When handling potentially sensitive data or in regulated industries.	A healthcare chatbot that must detect and redact patient information in conversation logs.
Prompt Injection	Detects attempts to manipulate the model through malicious prompts.	When allowing user input to be processed directly by your AI system.	A public-facing AI assistant that needs protection from users trying to bypass content filters or extract sensitive information.
Sexism / Bias	Detects gender-based bias or discriminatory content.	When ensuring AI outputs are free from bias and discrimination.	A resume screening assistant that must evaluate job candidates without gender or demographic bias.
Toxicity	Identifies harmful, offensive, or inappropriate content.	When monitoring AI outputs for harmful content or implementing content filtering.	A social media content moderation system that must detect and flag potentially harmful user-generated content.

Overview

Get Started

How-to Guides

Cookbooks

Integrations

Concepts

SDK/API Reference

References

Metrics Overview

Using metrics effectively

Out-of-the-Box metric categories

Agentic performance metrics

Expression and readability metrics

Model confidence metrics

Response quality metrics

Safety and compliance metrics

Overview

Get Started

How-to Guides

Cookbooks

Integrations

Concepts

SDK/API Reference

References

​Using metrics effectively

​Out-of-the-Box metric categories

​Agentic performance metrics

​Expression and readability metrics

​Model confidence metrics

​Response quality metrics

​Safety and compliance metrics

Using metrics effectively

Out-of-the-Box metric categories

Agentic performance metrics

Expression and readability metrics

Model confidence metrics

Response quality metrics

Safety and compliance metrics