Explore Galileo’s comprehensive out-of-the-box metrics for evaluating and improving AI system performance across multiple dimensions
Name | Category | Description | When to Use | Example Use Case |
---|---|---|---|---|
Action Advancement | Agentic | Measures how effectively each action advances toward the goal. | When assessing whether an agent is making meaningful progress in multi-step tasks. | A travel planning agent that needs to book flights, hotels, and activities in the correct sequence. |
Action Completion | Agentic | Measures whether the agent completed the intended action. | When evaluating agent task completion rates and success. | An e-commerce assistant that needs to successfully add items to cart, apply discounts, and complete checkout. |
Chunk Attribution | Response Quality | Assesses whether the response properly attributes information to source documents. | When implementing RAG systems and want to ensure proper attribution. | A legal research assistant that must cite specific cases and statutes when providing legal information. |
Chunk Utilization | Response Quality | Measures how effectively the model uses the retrieved chunks in its response. | When optimizing RAG performance to ensure retrieved information is used efficiently. | A technical support chatbot that needs to incorporate relevant product documentation in troubleshooting responses. |
Completeness | Response Quality | Measures whether the response addresses all aspects of the user’s query. | When evaluating if responses fully address the user’s intent. | A healthcare chatbot that must address all symptoms mentioned by a patient when suggesting next steps. |
Context Adherence | Response Quality | Measures how well the response aligns with the provided context. | When you want to ensure the model is grounding its responses in the provided context. | A financial advisor bot that must base investment recommendations on the client’s specific financial situation and goals. |
Context Relevance (Query Adherence) | Response Quality | Evaluates whether the retrieved context is relevant to the user’s query. | When assessing the quality of your retrieval system’s results. | An internal knowledge base search that retrieves company policies relevant to specific employee questions. |
Correctness (factuality) | Response Quality | Evaluates the factual accuracy of information provided in the response. | When accuracy of information is critical to your application. | A medical information system providing drug interaction details to healthcare professionals. |
Ground Truth Adherence | Response Quality | Measures how well the response aligns with established ground truth. | When evaluating model responses against known correct answers. | A customer service AI that must provide accurate product specifications from an official catalog. |
Instruction Adherence | Response Quality | Assesses whether the model followed the instructions in your prompt template. | When using complex prompts and need to verify the model is following all instructions. | A content generation system that must follow specific brand guidelines and formatting requirements. |
PII / CPNI / PHI | Safety and Compliance | Identifies personally identifiable or sensitive information in prompts and responses. | When handling potentially sensitive data or in regulated industries. | A healthcare chatbot that must detect and redact patient information in conversation logs. |
Prompt Injection | Safety and Compliance | Detects attempts to manipulate the model through malicious prompts. | When allowing user input to be processed directly by your AI system. | A public-facing AI assistant that needs protection from users trying to bypass content filters or extract sensitive information. |
Sexism / Bias | Safety and Compliance | Detects gender-based bias or discriminatory content. | When ensuring AI outputs are free from bias and discrimination. | A resume screening assistant that must evaluate job candidates without gender or demographic bias. |
Tool Errors | Agentic | Detects errors or failures during the execution of tools. | When implementing AI agents that use tools and want to track error rates. | A coding assistant that uses external APIs to run code and must handle and report execution errors appropriately. |
Tool Selection Quality | Agentic | Evaluates whether the agent selected the most appropriate tools for the task. | When optimizing agent systems for effective tool usage. | A data analysis agent that must choose the right visualization or statistical method based on the data type and user question. |
Toxicity | Safety and Compliance | Identifies harmful, offensive, or inappropriate content. | When monitoring AI outputs for harmful content or implementing content filtering. | A social media content moderation system that must detect and flag potentially harmful user-generated content. |
Tone | Expression and Readability | Evaluates the emotional tone and style of the response. | When the style and tone of AI responses matter for your brand or user experience. | A luxury brand’s customer service chatbot that must maintain a sophisticated, professional tone consistent with the brand image. |