Explore Galileo’s comprehensive metrics framework for evaluating and improving AI system performance across multiple dimensions.
Galileo comes with a set of ready-to-use metrics that allow you to see how your AI is performing. With these metrics, you can quickly spot problems, track improvements, and make your AI work better for your users.
To calculate metrics, you will need to configure an integration with an LLM, or to Luna 2. Connect Galileo to your language model by adding your API key on the integrations page from within the Galileo application.
Metrics can be used with experiments, and log streams.
To get the most value from Galileo’s metrics:
Our metrics can be broken down into five key categories, each addressing a specific aspect of AI system performance. Many times, folks benefit from using metrics from more than one category, depending on the metrics that matter most to them. Galileo also supports custom metrics that are able to be implemented alongside the out-of-the-box metric options.
Category | Description | When to Use | Example Use Cases |
---|---|---|---|
Agentic | Metrics that evaluate how effectively AI agents perform tasks, use tools, and progress toward goals. | When building and optimizing AI systems that take actions, make decisions, or use tools to accomplish tasks. |
|
Expression and Readability | Metrics that evaluate the style, tone, clarity, and overall presentation of AI-generated content. | When the format, tone, and presentation of AI outputs are important for user experience or brand consistency. |
|
Model Confidence | Metrics that measure how certain or uncertain your AI model is about its responses. | When you want to flag low-confidence responses for review, improve system reliability, or better understand model uncertainty. |
|
Response Quality | Metrics that assess the accuracy, completeness, relevance, and overall quality of AI-generated responses. | When evaluating how well AI systems answer questions, follow instructions, or provide information based on context. |
|
Safety and Compliance | Metrics that identify potential risks, harmful content, bias, or privacy concerns in AI interactions. | When ensuring AI systems meet regulatory requirements, protect user privacy, and avoid generating harmful or biased content. |
|
Use these metrics to evaluate how well your AI agents use tools, make decisions, and accomplish multi-step tasks. They’re a good fit when you’re building agents that need to interact with external systems or complete complex workflows.
Name | Description | When to Use | Example Use Case |
---|---|---|---|
Tool Error | Detects errors or failures during the execution of tools. | When implementing AI agents that use tools and want to track error rates. | A coding assistant that uses external APIs to run code and must handle and report execution errors appropriately. |
Tool Selection Quality | Evaluates whether the agent selected the most appropriate tools for the task. | When optimizing agent systems for effective tool usage. | A data analysis agent that must choose the right visualization or statistical method based on the data type and user question. |
Action Advancement | Measures how effectively each action advances toward the goal. | When assessing whether an agent is making meaningful progress in multi-step tasks. | A travel planning agent that needs to book flights, hotels, and activities in the correct sequence. |
Action Completion | Determines whether the agent successfully accomplished all of the user’s goals. | To assess whether an agent completed the desired goal. | A coding agent that is seeking to close engineering tickets. |
Use these metrics to assess the style, tone, and clarity of your AI’s generated content. They’re helpful when you want your AI to communicate clearly, match your brand’s voice, or produce content that’s easy for users to understand.
Name | Description | When to Use | Example Use Case |
---|---|---|---|
Tone | Evaluates the emotional tone and style of the response. | When the style and tone of AI responses matter for your brand or user experience. | A luxury brand’s customer service chatbot that must maintain a sophisticated, professional tone consistent with the brand image. |
BLEU & ROUGE | Standard NLP metrics for evaluating text generation quality. These metrics are only available for experiments as they need ground truth set in your dataset. | When you want to quantitatively assess the similarity between generated and reference texts. | Evaluating the quality of machine-translated or summarization outputs against human-written references. |
Use these metrics to understand how certain or uncertain your AI model is about its answers. They’re useful when you want to flag low-confidence responses for review or improve your system’s reliability.
Name | Description | When to Use | Example Use Case |
---|---|---|---|
Uncertainty | Measures the model’s confidence in its generated response. | When you want to understand how certain the model is about its answers. | Flagging responses where the model is unsure, so a human can review them before sending to a user. |
Prompt Perplexity | Evaluates how difficult or unusual the prompt is for the model to process. | When you want to identify prompts that may confuse the model or lead to lower-quality responses. | Detecting outlier prompts in a customer support chatbot to improve prompt engineering. |
Use these metrics to evaluate how well your AI answers user questions and follows instructions. They’re especially helpful when you want to ensure your system is providing accurate, complete, and relevant responses.
Name | Description | When to Use | Example Use Case |
---|---|---|---|
Chunk Attribution | Assesses whether the response properly attributes information to source documents. | When implementing RAG systems and want to ensure proper attribution. | A legal research assistant that must cite specific cases and statutes when providing legal information. |
Chunk Utilization | Measures how effectively the model uses the retrieved chunks in its response. | When optimizing RAG performance to ensure retrieved information is used efficiently. | A technical support chatbot that needs to incorporate relevant product documentation in troubleshooting responses. |
Completeness | Measures whether the response addresses all aspects of the user’s query. | When evaluating if responses fully address the user’s intent. | A healthcare chatbot that must address all symptoms mentioned by a patient when suggesting next steps. |
Context Adherence | Measures how well the response aligns with the provided context. | When you want to ensure the model is grounding its responses in the provided context. | A financial advisor bot that must base investment recommendations on the client’s specific financial situation and goals. |
Context Relevance (Query Adherence) | Evaluates whether the retrieved context is relevant to the user’s query. | When assessing the quality of your retrieval system’s results. | An internal knowledge base search that retrieves company policies relevant to specific employee questions. |
Correctness (factuality) | Evaluates the factual accuracy of information provided in the response. | When accuracy of information is critical to your application. | A medical information system providing drug interaction details to healthcare professionals. |
Ground Truth Adherence | Measures how well the response aligns with established ground truth. This metric is only available for experiments as it needs ground truth set in your dataset. | When evaluating model responses against known correct answers. | A customer service AI that must provide accurate product specifications from an official catalog. |
Instruction Adherence | Assesses whether the model followed the instructions in your prompt template. | When using complex prompts and need to verify the model is following all instructions. | A content generation system that must follow specific brand guidelines and formatting requirements. |
Use these metrics to identify potential risks, harmful content, or compliance issues in your AI’s responses. They’re important when you need to protect users, meet regulatory requirements, or avoid generating biased or unsafe content.
Name | Description | When to Use | Example Use Case |
---|---|---|---|
PII / CPNI / PHI | Identifies personally identifiable or sensitive information in prompts and responses. | When handling potentially sensitive data or in regulated industries. | A healthcare chatbot that must detect and redact patient information in conversation logs. |
Prompt Injection | Detects attempts to manipulate the model through malicious prompts. | When allowing user input to be processed directly by your AI system. | A public-facing AI assistant that needs protection from users trying to bypass content filters or extract sensitive information. |
Sexism / Bias | Detects gender-based bias or discriminatory content. | When ensuring AI outputs are free from bias and discrimination. | A resume screening assistant that must evaluate job candidates without gender or demographic bias. |
Toxicity | Identifies harmful, offensive, or inappropriate content. | When monitoring AI outputs for harmful content or implementing content filtering. | A social media content moderation system that must detect and flag potentially harmful user-generated content. |
Explore Galileo’s comprehensive metrics framework for evaluating and improving AI system performance across multiple dimensions.
Galileo comes with a set of ready-to-use metrics that allow you to see how your AI is performing. With these metrics, you can quickly spot problems, track improvements, and make your AI work better for your users.
To calculate metrics, you will need to configure an integration with an LLM, or to Luna 2. Connect Galileo to your language model by adding your API key on the integrations page from within the Galileo application.
Metrics can be used with experiments, and log streams.
To get the most value from Galileo’s metrics:
Our metrics can be broken down into five key categories, each addressing a specific aspect of AI system performance. Many times, folks benefit from using metrics from more than one category, depending on the metrics that matter most to them. Galileo also supports custom metrics that are able to be implemented alongside the out-of-the-box metric options.
Category | Description | When to Use | Example Use Cases |
---|---|---|---|
Agentic | Metrics that evaluate how effectively AI agents perform tasks, use tools, and progress toward goals. | When building and optimizing AI systems that take actions, make decisions, or use tools to accomplish tasks. |
|
Expression and Readability | Metrics that evaluate the style, tone, clarity, and overall presentation of AI-generated content. | When the format, tone, and presentation of AI outputs are important for user experience or brand consistency. |
|
Model Confidence | Metrics that measure how certain or uncertain your AI model is about its responses. | When you want to flag low-confidence responses for review, improve system reliability, or better understand model uncertainty. |
|
Response Quality | Metrics that assess the accuracy, completeness, relevance, and overall quality of AI-generated responses. | When evaluating how well AI systems answer questions, follow instructions, or provide information based on context. |
|
Safety and Compliance | Metrics that identify potential risks, harmful content, bias, or privacy concerns in AI interactions. | When ensuring AI systems meet regulatory requirements, protect user privacy, and avoid generating harmful or biased content. |
|
Use these metrics to evaluate how well your AI agents use tools, make decisions, and accomplish multi-step tasks. They’re a good fit when you’re building agents that need to interact with external systems or complete complex workflows.
Name | Description | When to Use | Example Use Case |
---|---|---|---|
Tool Error | Detects errors or failures during the execution of tools. | When implementing AI agents that use tools and want to track error rates. | A coding assistant that uses external APIs to run code and must handle and report execution errors appropriately. |
Tool Selection Quality | Evaluates whether the agent selected the most appropriate tools for the task. | When optimizing agent systems for effective tool usage. | A data analysis agent that must choose the right visualization or statistical method based on the data type and user question. |
Action Advancement | Measures how effectively each action advances toward the goal. | When assessing whether an agent is making meaningful progress in multi-step tasks. | A travel planning agent that needs to book flights, hotels, and activities in the correct sequence. |
Action Completion | Determines whether the agent successfully accomplished all of the user’s goals. | To assess whether an agent completed the desired goal. | A coding agent that is seeking to close engineering tickets. |
Use these metrics to assess the style, tone, and clarity of your AI’s generated content. They’re helpful when you want your AI to communicate clearly, match your brand’s voice, or produce content that’s easy for users to understand.
Name | Description | When to Use | Example Use Case |
---|---|---|---|
Tone | Evaluates the emotional tone and style of the response. | When the style and tone of AI responses matter for your brand or user experience. | A luxury brand’s customer service chatbot that must maintain a sophisticated, professional tone consistent with the brand image. |
BLEU & ROUGE | Standard NLP metrics for evaluating text generation quality. These metrics are only available for experiments as they need ground truth set in your dataset. | When you want to quantitatively assess the similarity between generated and reference texts. | Evaluating the quality of machine-translated or summarization outputs against human-written references. |
Use these metrics to understand how certain or uncertain your AI model is about its answers. They’re useful when you want to flag low-confidence responses for review or improve your system’s reliability.
Name | Description | When to Use | Example Use Case |
---|---|---|---|
Uncertainty | Measures the model’s confidence in its generated response. | When you want to understand how certain the model is about its answers. | Flagging responses where the model is unsure, so a human can review them before sending to a user. |
Prompt Perplexity | Evaluates how difficult or unusual the prompt is for the model to process. | When you want to identify prompts that may confuse the model or lead to lower-quality responses. | Detecting outlier prompts in a customer support chatbot to improve prompt engineering. |
Use these metrics to evaluate how well your AI answers user questions and follows instructions. They’re especially helpful when you want to ensure your system is providing accurate, complete, and relevant responses.
Name | Description | When to Use | Example Use Case |
---|---|---|---|
Chunk Attribution | Assesses whether the response properly attributes information to source documents. | When implementing RAG systems and want to ensure proper attribution. | A legal research assistant that must cite specific cases and statutes when providing legal information. |
Chunk Utilization | Measures how effectively the model uses the retrieved chunks in its response. | When optimizing RAG performance to ensure retrieved information is used efficiently. | A technical support chatbot that needs to incorporate relevant product documentation in troubleshooting responses. |
Completeness | Measures whether the response addresses all aspects of the user’s query. | When evaluating if responses fully address the user’s intent. | A healthcare chatbot that must address all symptoms mentioned by a patient when suggesting next steps. |
Context Adherence | Measures how well the response aligns with the provided context. | When you want to ensure the model is grounding its responses in the provided context. | A financial advisor bot that must base investment recommendations on the client’s specific financial situation and goals. |
Context Relevance (Query Adherence) | Evaluates whether the retrieved context is relevant to the user’s query. | When assessing the quality of your retrieval system’s results. | An internal knowledge base search that retrieves company policies relevant to specific employee questions. |
Correctness (factuality) | Evaluates the factual accuracy of information provided in the response. | When accuracy of information is critical to your application. | A medical information system providing drug interaction details to healthcare professionals. |
Ground Truth Adherence | Measures how well the response aligns with established ground truth. This metric is only available for experiments as it needs ground truth set in your dataset. | When evaluating model responses against known correct answers. | A customer service AI that must provide accurate product specifications from an official catalog. |
Instruction Adherence | Assesses whether the model followed the instructions in your prompt template. | When using complex prompts and need to verify the model is following all instructions. | A content generation system that must follow specific brand guidelines and formatting requirements. |
Use these metrics to identify potential risks, harmful content, or compliance issues in your AI’s responses. They’re important when you need to protect users, meet regulatory requirements, or avoid generating biased or unsafe content.
Name | Description | When to Use | Example Use Case |
---|---|---|---|
PII / CPNI / PHI | Identifies personally identifiable or sensitive information in prompts and responses. | When handling potentially sensitive data or in regulated industries. | A healthcare chatbot that must detect and redact patient information in conversation logs. |
Prompt Injection | Detects attempts to manipulate the model through malicious prompts. | When allowing user input to be processed directly by your AI system. | A public-facing AI assistant that needs protection from users trying to bypass content filters or extract sensitive information. |
Sexism / Bias | Detects gender-based bias or discriminatory content. | When ensuring AI outputs are free from bias and discrimination. | A resume screening assistant that must evaluate job candidates without gender or demographic bias. |
Toxicity | Identifies harmful, offensive, or inappropriate content. | When monitoring AI outputs for harmful content or implementing content filtering. | A social media content moderation system that must detect and flag potentially harmful user-generated content. |