Galileo provides a comprehensive suite of pre-built metrics designed to evaluate various aspects of AI system performance without requiring custom implementation. These metrics span across categories including response quality, agentic capabilities, safety and compliance, and Expression and Readability. each metric addresses specific evaluation needs, from measuring factual correctness to detecting potential biases or tracking tool usage effectiveness. The table below outlines all available native metrics, their purposes, and practical use cases to help you select the right measurements for your AI applications.

NameCategoryDescriptionWhen to UseExample Use Case
Action Advancement AgenticMeasures how effectively each action advances toward the goal.When assessing whether an agent is making meaningful progress in multi-step tasks.A travel planning agent that needs to book flights, hotels, and activities in the correct sequence.
Action Completion AgenticMeasures whether the agent completed the intended action.When evaluating agent task completion rates and success.An e-commerce assistant that needs to successfully add items to cart, apply discounts, and complete checkout.
Chunk Attribution Response QualityAssesses whether the response properly attributes information to source documents.When implementing RAG systems and want to ensure proper attribution.A legal research assistant that must cite specific cases and statutes when providing legal information.
Chunk Utilization Response QualityMeasures how effectively the model uses the retrieved chunks in its response.When optimizing RAG performance to ensure retrieved information is used efficiently.A technical support chatbot that needs to incorporate relevant product documentation in troubleshooting responses.
Completeness Response QualityMeasures whether the response addresses all aspects of the user's query.When evaluating if responses fully address the user's intent.A healthcare chatbot that must address all symptoms mentioned by a patient when suggesting next steps.
Context Adherence Response QualityMeasures how well the response aligns with the provided context.When you want to ensure the model is grounding its responses in the provided context.A financial advisor bot that must base investment recommendations on the client's specific financial situation and goals.
Context Relevance (Query Adherence) Response QualityEvaluates whether the retrieved context is relevant to the user's query.When assessing the quality of your retrieval system's results.An internal knowledge base search that retrieves company policies relevant to specific employee questions.
Correctness (factuality) Response QualityEvaluates the factual accuracy of information provided in the response.When accuracy of information is critical to your application.A medical information system providing drug interaction details to healthcare professionals.
Ground Truth Adherence Response QualityMeasures how well the response aligns with established ground truth.When evaluating model responses against known correct answers.A customer service AI that must provide accurate product specifications from an official catalog.
Instruction Adherence Response QualityAssesses whether the model followed the instructions in your prompt template.When using complex prompts and need to verify the model is following all instructions.A content generation system that must follow specific brand guidelines and formatting requirements.
PII / CPNI / PHI Safety and ComplianceIdentifies personally identifiable or sensitive information in prompts and responses.When handling potentially sensitive data or in regulated industries.A healthcare chatbot that must detect and redact patient information in conversation logs.
Prompt Injection Safety and ComplianceDetects attempts to manipulate the model through malicious prompts.When allowing user input to be processed directly by your AI system.A public-facing AI assistant that needs protection from users trying to bypass content filters or extract sensitive information.
Sexism / Bias Safety and ComplianceDetects gender-based bias or discriminatory content.When ensuring AI outputs are free from bias and discrimination.A resume screening assistant that must evaluate job candidates without gender or demographic bias.
Tool Errors AgenticDetects errors or failures during the execution of tools.When implementing AI agents that use tools and want to track error rates.A coding assistant that uses external APIs to run code and must handle and report execution errors appropriately.
Tool Selection Quality AgenticEvaluates whether the agent selected the most appropriate tools for the task.When optimizing agent systems for effective tool usage.A data analysis agent that must choose the right visualization or statistical method based on the data type and user question.
Toxicity Safety and ComplianceIdentifies harmful, offensive, or inappropriate content.When monitoring AI outputs for harmful content or implementing content filtering.A social media content moderation system that must detect and flag potentially harmful user-generated content.
Tone Expression and ReadabilityEvaluates the emotional tone and style of the response.When the style and tone of AI responses matter for your brand or user experience.A luxury brand's customer service chatbot that must maintain a sophisticated, professional tone consistent with the brand image.