Metrics Basics

Galileo AI provides a set of built-in, preset metrics designed to evaluate various aspects of LLM, agent, and retrieval-based workflows. This guide provides a reference for Galileo AI’s preset metrics, including and SDK slugs.

Out-of-the-box metrics reference

The table below summarizes gives the constants used in code to access each metric. To use these metrics, import the relevant enum.

from galileo.schema.metrics import GalileoScorers

LLM-as-a-judge Metrics

Metric	Enum Value
Action Advancement	`GalileoScorers.action_advancement`
Action Completion	`GalileoScorers.action_completion`
BLEU	`GalileoScorers.bleu`
Chunk Attribution Utilization	`GalileoScorers.chunk_attribution_utilization`
Completeness	`GalileoScorers.completeness`
Context Adherence	`GalileoScorers.context_adherence`
Context Relevance (Query Adherence)	`GalileoScorers.context_relevance`
Correctness (factuality)	`GalileoScorers.correctness`
Ground Truth Adherence	`GalileoScorers.ground_truth_adherence`
Instruction Adherence	`GalileoScorers.instruction_adherence`
Prompt Injection	`GalileoScorers.prompt_injection`
Prompt Perplexity	`GalileoScorers.prompt_perplexity`
ROUGE	`GalileoScorers.rouge`
Sexism / Bias	`GalileoScorers.input_sexism`, `GalileoScorers.output_sexism`
Tool Errors	`GalileoScorers.tool_error_rate`
Tool Selection Quality	`GalileoScorers.tool_selection_quality`
Toxicity	`GalileoScorers.input_toxicity`, `GalileoScorers.output_toxicity`
Uncertainty	`GalileoScorers.uncertainty`

Luna-2 metrics

If you are using the Galileo Luna-2 model, then use these metric values.

Metric	Enum Value
Action Advancement	`GalileoScorers.action_advancement_luna`
Action Completion	`GalileoScorers.action_completion_luna`
Chunk Attribution Utilization	`GalileoScorers.chunk_attribution_utilization_luna`
Completeness	`GalileoScorers.completeness_luna`
Context Adherence	`GalileoScorers.context_adherence_luna`
PII (personally identifiable information)	`GalileoScorers.input_pii`, `GalileoScorers.output_pii`
Prompt Injection	`GalileoScorers.prompt_injection_luna`
Sexism / Bias	`GalileoScorers.input_sexism_luna`, `GalileoScorers.output_sexism_luna`
Tone	`GalileoScorers.input_tone`, `GalileoScorers.output_tone`
Tool Errors	`GalileoScorers.tool_error_rate_luna`
Tool Selection Quality	`GalileoScorers.tool_selection_quality_luna`
Toxicity	`GalileoScorers.input_toxicity_luna`, `GalileoScorers.output_toxicity_luna`

How do I use metrics in the SDK?

The run experiment function (Python, TypeScript) takes a list of metrics as part of its arguments.

Preset metrics

Supply a list of one or more metric names into the run_experiment function as shown below:

import os
from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai
from galileo.schema.metrics import GalileoScorers

dataset = get_dataset(name="fictional_character_names")

# Define a custom "runner" function for your experiment. 
def my_custom_llm_runner(input):
  client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
  return client.chat.completions.create(
        model="gpt-4o",
        messages=[
          { "role": "system", "content": "You are a great storyteller." },
          { "role": "user", "content": f"Write a story about {input["topic"]}" },
        ],
    ).choices[0].message.content

# Run the experiment!
results = run_experiment(
	"test-experiment",
	project="my-test-project-1",
	dataset=dataset,
	function=my_custom_llm_runner,
	metrics=[ 
        # List metrics here
        GalileoScorers.action_advancement,
        GalileoScorers.completeness,
        GalileoScorers.instruction_adherence
    ], 
)

For more information, read about running experiments with the Galileo SDKs.

Custom metrics

You can use custom metrics in the same way as Galileo’s preset metrics. At a high level, this involves the following steps:

Create your metric in the Galileo Console (or in code). Your custom metric will return a numerical score based on its input.
Pass the name of your new metric into the run experiment, like in the example below.

For example, if you have a metric called "Compliance - do not recommend any financial actions": A metric called Compliance - do not recommend any financial actions

You would pass this to an experiment like this:

from galileo.experiments import run_experiment

results = run_experiment(
	"finance-experiment",
	dataset=dataset,
	function=llm_call,
	metrics=["Compliance - do not recommend any financial actions"],
	project="my-project",
)

Custom metrics provide the flexibility to define precisely what you want to measure, enabling deep analysis and targeted improvement. For a detailed walkthrough on creating them, see Custom Metrics.

Ground truth data

Ground truth is the authoritative, validated answer or label used to benchmark model performance. For LLM metrics, this often means a gold-standard answer, fact, or supporting evidence against which outputs are compared. The following metrics require ground truth data to compute their scores, as they involve direct comparison to a reference answer, label, or fact.

These metrics are only supported in experiments, as they require the ground truth to be set in the dataset used by the experiment.

To set the ground truth, set this in the output of your dataset either in the Galileo Console, or in code.

Are metrics LLM-agnostic?

Yes, all metrics are designed to work across any LLM integrated with Galileo.

Overview

Get Started

How-to Guides

Cookbooks

Integrations

Concepts

SDK/API Reference

References

Out-of-the-box metrics reference

LLM-as-a-judge Metrics

Luna-2 metrics

How do I use metrics in the SDK?

Preset metrics

Custom metrics

Ground truth data

Are metrics LLM-agnostic?

References

Overview

Get Started

How-to Guides

Cookbooks

Integrations

Concepts

SDK/API Reference

References

​Out-of-the-box metrics reference

​LLM-as-a-judge Metrics

​Luna-2 metrics

​How do I use metrics in the SDK?

​Preset metrics

​Custom metrics

​Ground truth data

​Are metrics LLM-agnostic?

​References

Out-of-the-box metrics reference

LLM-as-a-judge Metrics

Luna-2 metrics

How do I use metrics in the SDK?

Preset metrics

Custom metrics

Ground truth data

Are metrics LLM-agnostic?

References