Metrics Comparison

Galileo provides a comprehensive suite of pre-built metrics designed to evaluate various aspects of AI system performance without requiring custom implementation. These metrics span across categories including response quality, agentic capabilities, safety and compliance, and Expression and Readability. each metric addresses specific evaluation needs, from measuring factual correctness to detecting potential biases or tracking tool usage effectiveness. The table below outlines all available native metrics, their purposes, and practical use cases to help you select the right measurements for your AI applications.

Name	Category	Description	When to Use	Example Use Case
Action Advancement	Agentic	Measures how effectively each action advances toward the goal.	When assessing whether an agent is making meaningful progress in multi-step tasks.	A travel planning agent that needs to book flights, hotels, and activities in the correct sequence.
Action Completion	Agentic	Measures whether the agent completed the intended action.	When evaluating agent task completion rates and success.	An e-commerce assistant that needs to successfully add items to cart, apply discounts, and complete checkout.
Chunk Attribution	Response Quality	Assesses whether the response properly attributes information to source documents.	When implementing RAG systems and want to ensure proper attribution.	A legal research assistant that must cite specific cases and statutes when providing legal information.
Chunk Utilization	Response Quality	Measures how effectively the model uses the retrieved chunks in its response.	When optimizing RAG performance to ensure retrieved information is used efficiently.	A technical support chatbot that needs to incorporate relevant product documentation in troubleshooting responses.
Completeness	Response Quality	Measures whether the response addresses all aspects of the user's query.	When evaluating if responses fully address the user's intent.	A healthcare chatbot that must address all symptoms mentioned by a patient when suggesting next steps.
Context Adherence	Response Quality	Measures how well the response aligns with the provided context.	When you want to ensure the model is grounding its responses in the provided context.	A financial advisor bot that must base investment recommendations on the client's specific financial situation and goals.
Context Relevance (Query Adherence)	Response Quality	Evaluates whether the retrieved context is relevant to the user's query.	When assessing the quality of your retrieval system's results.	An internal knowledge base search that retrieves company policies relevant to specific employee questions.
Correctness (factuality)	Response Quality	Evaluates the factual accuracy of information provided in the response.	When accuracy of information is critical to your application.	A medical information system providing drug interaction details to healthcare professionals.
Ground Truth Adherence	Response Quality	Measures how well the response aligns with established ground truth.	When evaluating model responses against known correct answers.	A customer service AI that must provide accurate product specifications from an official catalog.
Instruction Adherence	Response Quality	Assesses whether the model followed the instructions in your prompt template.	When using complex prompts and need to verify the model is following all instructions.	A content generation system that must follow specific brand guidelines and formatting requirements.
PII / CPNI / PHI	Safety and Compliance	Identifies personally identifiable or sensitive information in prompts and responses.	When handling potentially sensitive data or in regulated industries.	A healthcare chatbot that must detect and redact patient information in conversation logs.
Prompt Injection	Safety and Compliance	Detects attempts to manipulate the model through malicious prompts.	When allowing user input to be processed directly by your AI system.	A public-facing AI assistant that needs protection from users trying to bypass content filters or extract sensitive information.
Sexism / Bias	Safety and Compliance	Detects gender-based bias or discriminatory content.	When ensuring AI outputs are free from bias and discrimination.	A resume screening assistant that must evaluate job candidates without gender or demographic bias.
Tool Errors	Agentic	Detects errors or failures during the execution of tools.	When implementing AI agents that use tools and want to track error rates.	A coding assistant that uses external APIs to run code and must handle and report execution errors appropriately.
Tool Selection Quality	Agentic	Evaluates whether the agent selected the most appropriate tools for the task.	When optimizing agent systems for effective tool usage.	A data analysis agent that must choose the right visualization or statistical method based on the data type and user question.
Toxicity	Safety and Compliance	Identifies harmful, offensive, or inappropriate content.	When monitoring AI outputs for harmful content or implementing content filtering.	A social media content moderation system that must detect and flag potentially harmful user-generated content.
Tone	Expression and Readability	Evaluates the emotional tone and style of the response.	When the style and tone of AI responses matter for your brand or user experience.	A luxury brand's customer service chatbot that must maintain a sophisticated, professional tone consistent with the brand image.

Overview

Getting Started

SDK/API

How-to Guides

Cookbooks

Integrations

Concepts

API Reference

References

Metrics Comparison