Skip to main content
Galileo comes with a set of ready to use Out-of-the-Box metrics that allow you to see how your AI is performing. With these metrics, you can quickly spot problems, track improvements, and make your AI work better for your users. These metrics apply to different node types (such as session, trace, or different span types), depending on the metric. You can then expand these metrics with custom metrics, creating using LLM-as-a-judge, or custom code-based metrics. To calculate Out-of-the-Box metrics, or LLM-as-a-judge metrics, you first need to configure an integration with an LLM, or to Luna-2. Connect Galileo to your language model by adding your API key on the integrations page from within the Galileo application. You can improve the metric calculation based on your requirements using continuous learning via human feedback (CLHF). This allows you to continuously provide feedback in natural language that automatically improves the metrics to align better with your domain, or expected inputs and outputs. Metrics can be used with experiments, and Log streams.

Configure Galileo for Out-of-the-Box and LLM-as-a-judge metrics

Most Out-of-the-Box metrics and all LLM-as-a-judge metrics are LLM-based metrics.
LLM-based metrics use an LLM to evaluate inputs and outputs. To use these metrics from Log streams or Experiments, you first need to configure an integration with an LLM platform.
1

Select the user menu

Select the user menu in the bottom left.The user menu
2

Open the integrations page

Select Integrations from the user menu.
The integrations menu
3

Add an integration

Locate the option for the LLM platform you are using, then select the +Add Integration button.The add integration button
4

Add the settings

Set the relevant settings for your integration, such as your API keys or endpoints. Then select Save.
The OpenAI integrations pane

Using metrics effectively

To get the most value from Galileo’s metrics:
  1. Start with key metrics - Focus on metrics most relevant to your use case
  2. Establish baselines - Understand your current performance before making changes
  3. Track trends over time - Monitor how metrics change as you iterate on your system
  4. Combine multiple metrics - Look at related metrics together for a more complete picture
  5. Set thresholds - Define acceptable ranges for critical metrics
  6. Improve the metrics - Use CLHF to continuously improve the metrics

Out-of-the-Box metric categories

Our metrics can be broken down into five key categories, each addressing a specific aspect of AI system performance. Many times, folks benefit from using metrics from more than one category, depending on the metrics that matter most to them. Galileo also supports custom metrics that are able to be implemented alongside the Out-of-the-Box metric options.

1. Agentic performance metrics

Agentic Performance Metrics evaluate how effectively AI agents perform tasks, use tools, and progress toward goals.
When to UseExample Use Cases
When building and optimizing AI systems that take actions, make decisions, or use tools to accomplish tasks.
  • Evaluating a travel planning agent’s ability to book complete itineraries
  • Assessing a coding assistant’s appropriate use of APIs and libraries
  • Measuring a data analysis agent’s tool selection effectiveness

2. Expression and readability metrics

Expression And Readability Metrics assess the style, tone, clarity, and overall presentation of AI-generated content.
When to UseExample Use Cases
When the format, tone, and presentation of AI outputs are important for user experience or brand consistency.
  • Ensuring a luxury brand chatbot maintains a sophisticated tone
  • Verifying educational content is presented at the appropriate reading level
  • Measuring clarity and conciseness in technical documentation generation

3. Model confidence metrics

Model Confidence Metrics measure how certain or uncertain your AI model is about its responses.
When to UseExample Use Cases
When you want to flag low-confidence responses for review, improve system reliability, or better understand model uncertainty.
  • Flagging uncertain answers in a customer support chatbot for human review
  • Identifying low-confidence predictions in a medical diagnosis assistant
  • Improving user trust by surfacing confidence scores in AI-generated content

4. Response quality metrics

Response Quality Metrics assess the accuracy, completeness, relevance, and overall quality of AI-generated responses.
When to UseExample Use Cases
When evaluating how well AI systems answer questions, follow instructions, or provide information based on context.
  • Measuring factual accuracy in a medical information system
  • Evaluating how well a RAG system uses retrieved information
  • Assessing if customer service responses address all parts of a query

5. Safety and compliance metrics

Safety And Compliance Metrics identify potential risks, harmful content, bias, or privacy concerns in AI interactions.
When to UseExample Use Cases
When ensuring AI systems meet regulatory requirements, protect user privacy, and avoid generating harmful or biased content.
  • Detecting PII in healthcare chatbot conversations
  • Identifying potential prompt injection attacks in public-facing systems
  • Measuring bias in hiring or loan approval recommendation systems
The Metrics Comparison provides a full list of the Out-of-the-Box metrics for each category.

Next steps