Metrics Overview
Explore Galileo’s comprehensive metrics framework for evaluating and improving AI system performance across multiple dimensions.
Galileo provides a robust set of metrics to evaluate and improve your AI systems across multiple dimensions. These metrics help you identify issues, understand performance patterns, and implement targeted improvements to enhance your AI applications.
Metric Categories
Our metrics are organized into five key categories, each addressing a specific aspect of AI system performance:
Response Quality
Metrics that evaluate how well your AI system responds to user queries, focusing on accuracy, relevance, and comprehensiveness.
Expression & Readability
Metrics that assess the linguistic quality and readability of AI-generated content, including tone and fluency.
Safety & Compliance
Metrics that identify potential risks related to harmful content, PII exposure, and compliance violations.
Model Confidence
Metrics that measure how confident the model is in its responses, helping identify areas of uncertainty.
Agentic Performance
Metrics specifically designed for evaluating AI agents that use tools to complete tasks.
Response Quality Metrics
These metrics help you understand how well your AI system is responding to user queries:
Completeness
Measures whether the response addresses all aspects of the user’s query.
Correctness
Evaluates the factual accuracy of information provided in the response.
Instruction Adherence
Assesses whether the model followed the instructions in your prompt template.
Ground Truth Adherence
Measures how well the response aligns with established ground truth.
Chunk Relevance
Evaluates whether the retrieved chunks are relevant to the user’s query.
Chunk Attribution
Assesses whether the response properly attributes information to source documents.
Chunk Utilization
Measures how effectively the model uses the retrieved chunks in its response.
Safety and Compliance Metrics
These metrics help identify potential risks and compliance issues:
PII Detection
Identifies personally identifiable information in prompts and responses.
Prompt Injection
Detects attempts to manipulate the model through malicious prompts.
Toxicity
Identifies harmful, offensive, or inappropriate content.
Sexism
Detects gender-based bias or discriminatory content.
Model Confidence Metrics
These metrics help you understand the model’s certainty in its responses:
Uncertainty
Measures the model’s confidence in its generated response.
Prompt Perplexity
Evaluates how difficult or unusual the prompt is for the model to process.
Agentic Performance Metrics
These metrics are specifically designed for AI agents that use tools:
Tool Error
Detects errors or failures during the execution of tools.
Tool Selection Quality
Evaluates whether the agent selected the most appropriate tools for the task.
Action Advancement
Measures how effectively each action advances toward the goal.
Expression and Readability Metrics
These metrics assess the linguistic quality of AI-generated content:
Tone
Evaluates the emotional tone and style of the response.
BLEU & ROUGE
Standard NLP metrics for evaluating text generation quality.
Using Metrics Effectively
To get the most value from Galileo’s metrics:
- Start with key metrics - Focus on metrics most relevant to your use case
- Establish baselines - Understand your current performance before making changes
- Track trends over time - Monitor how metrics change as you iterate on your system
- Combine multiple metrics - Look at related metrics together for a more complete picture
- Set thresholds - Define acceptable ranges for critical metrics