Explore Galileo’s comprehensive metrics framework for evaluating and improving AI system performance across multiple dimensions
Open the integrations page
Add an integration
Add the settings
Category | Description | When to Use | Example Use Cases |
---|---|---|---|
Agentic | Metrics that evaluate how effectively AI agents perform tasks, use tools, and progress toward goals. | When building and optimizing AI systems that take actions, make decisions, or use tools to accomplish tasks. |
|
Expression and Readability | Metrics that evaluate the style, tone, clarity, and overall presentation of AI-generated content. | When the format, tone, and presentation of AI outputs are important for user experience or brand consistency. |
|
Model Confidence | Metrics that measure how certain or uncertain your AI model is about its responses. | When you want to flag low-confidence responses for review, improve system reliability, or better understand model uncertainty. |
|
Response Quality | Metrics that assess the accuracy, completeness, relevance, and overall quality of AI-generated responses. | When evaluating how well AI systems answer questions, follow instructions, or provide information based on context. |
|
Safety and Compliance | Metrics that identify potential risks, harmful content, bias, or privacy concerns in AI interactions. | When ensuring AI systems meet regulatory requirements, protect user privacy, and avoid generating harmful or biased content. |
|
Name | Description | When to use | Example use case |
---|---|---|---|
Action advancement | Measures how effectively each action advances toward the goal. | When assessing whether an agent is making meaningful progress in multi-step tasks. | A travel planning agent that needs to book flights, hotels, and activities in the correct sequence. |
Action completion | Determines whether the agent successfully accomplished all of the user’s goals. | To assess whether an agent completed the desired goal. | A coding agent that is seeking to close engineering tickets. |
Agent efficiency | Determines if an agent provides a precise answer or resolution to every user ask, with an efficient path. | To assess if an agent is taking the most efficient path to a solution. | A complex multi-agent chatbot that needs a fast response. |
Agent flow | Measures the correctness and coherence of an agentic trajectory by validating it against user-specified natural language tests. | To assess a multi-agent system, or a system with multiple tools. | An internal process agent that needs to follow strict process rules. |
Conversation quality | A binary metric that assesses whether a chatbot interaction left the user feeling satisfied and positive or frustrated and dissatisfied. | When building customer facing chatbots. | A health insurance chatbot. |
Intent change | Measures a significant shift in the user’s primary conversational goal or workflow during a session, relative to their initial stated intent. | To analyze a holistic view across an entire user session to understand what capabilities a user interacts with in a single session. | A multi-purpose chatbot for a bank. |
Tool error | Detects errors or failures during the execution of tools. | When implementing AI agents that use tools and want to track error rates. | A coding assistant that uses external APIs to run code and must handle and report execution errors appropriately. |
Tool selection quality | Evaluates whether the agent selected the most appropriate tools for the task. | When optimizing agent systems for effective tool usage. | A data analysis agent that must choose the right visualization or statistical method based on the data type and user question. |
Name | Description | When to Use | Example Use Case |
---|---|---|---|
Tone | Evaluates the emotional tone and style of the response. | When the style and tone of AI responses matter for your brand or user experience. | A luxury brand’s customer service chatbot that must maintain a sophisticated, professional tone consistent with the brand image. |
BLEU & ROUGE | Standard NLP metrics for evaluating text generation quality. These metrics are only available for experiments as they need ground truth set in your dataset. | When you want to quantitatively assess the similarity between generated and reference texts. | Evaluating the quality of machine-translated or summarization outputs against human-written references. |
Name | Description | When to Use | Example Use Case |
---|---|---|---|
Uncertainty | Measures the model’s confidence in its generated response. | When you want to understand how certain the model is about its answers. | Flagging responses where the model is unsure, so a human can review them before sending to a user. |
Prompt Perplexity | Evaluates how difficult or unusual the prompt is for the model to process. | When you want to identify prompts that may confuse the model or lead to lower-quality responses. | Detecting outlier prompts in a customer support chatbot to improve prompt engineering. |
Name | Description | When to Use | Example Use Case |
---|---|---|---|
Chunk Attribution Utilization | Assesses whether the response uses the retrieved chunks in its response, and properly attributes information to source documents. | When implementing RAG systems and want to ensure proper attribution and that retrieved information is used efficiently. | A legal research assistant that must cite specific cases and statutes when providing legal information. |
Completeness | Measures whether the response addresses all aspects of the user’s query. | When evaluating if responses fully address the user’s intent. | A healthcare chatbot that must address all symptoms mentioned by a patient when suggesting next steps. |
Context Adherence | Measures how well the response aligns with the provided context. | When you want to ensure the model is grounding its responses in the provided context. | A financial advisor bot that must base investment recommendations on the client’s specific financial situation and goals. |
Context Relevance (Query Adherence) | Evaluates whether the retrieved context is relevant to the user’s query. | When assessing the quality of your retrieval system’s results. | An internal knowledge base search that retrieves company policies relevant to specific employee questions. |
Correctness (factuality) | Evaluates the factual accuracy of information provided in the response. | When accuracy of information is critical to your application. | A medical information system providing drug interaction details to healthcare professionals. |
Ground Truth Adherence | Measures how well the response aligns with established ground truth. This metric is only available for experiments as it needs ground truth set in your dataset. | When evaluating model responses against known correct answers. | A customer service AI that must provide accurate product specifications from an official catalog. |
Instruction Adherence | Assesses whether the model followed the instructions in your prompt template. | When using complex prompts and need to verify the model is following all instructions. | A content generation system that must follow specific brand guidelines and formatting requirements. |
Name | Description | When to Use | Example Use Case |
---|---|---|---|
PII / CPNI / PHI | Identifies personally identifiable or sensitive information in prompts and responses. | When handling potentially sensitive data or in regulated industries. | A healthcare chatbot that must detect and redact patient information in conversation logs. |
Prompt Injection | Detects attempts to manipulate the model through malicious prompts. | When allowing user input to be processed directly by your AI system. | A public-facing AI assistant that needs protection from users trying to bypass content filters or extract sensitive information. |
Sexism / Bias | Detects gender-based bias or discriminatory content. | When ensuring AI outputs are free from bias and discrimination. | A resume screening assistant that must evaluate job candidates without gender or demographic bias. |
Toxicity | Identifies harmful, offensive, or inappropriate content. | When monitoring AI outputs for harmful content or implementing content filtering. | A social media content moderation system that must detect and flag potentially harmful user-generated content. |