Galileo AI provides a set of built-in, preset metrics designed to evaluate various aspects of LLM, agent, and retrieval-based workflows. This guide provides a reference for Galileo AI’s preset metrics, including and SDK slugs.

Metrics reference

The table below summarizes gives the constants used in code to access each metric. To use these metrics, import the relevant enum.
from galileo.schema.metrics import GalileoScorers

Metrics

MetricEnum Value
Action AdvancementGalileoScorers.action_advancement
Action CompletionGalileoScorers.action_completion
BLEUGalileoScorers.bleu
Chunk AttributionGalileoScorers.chunk_attribution_utilization
Chunk UtilizationGalileoScorers.chunk_attribution_utilization
CompletenessGalileoScorers.completeness
Context AdherenceGalileoScorers.context_adherence
Context Relevance (Query Adherence)GalileoScorers.context_relevance
Correctness (factuality)GalileoScorers.correctness
Ground Truth AdherenceGalileoScorers.ground_truth_adherence
Instruction AdherenceGalileoScorers.instruction_adherence
Prompt InjectionGalileoScorers.prompt_injection
Prompt PerplexityGalileoScorers.prompt_perplexity
ROUGEGalileoScorers.rouge
Sexism / BiasGalileoScorers.input_sexism, GalileoScorers.output_sexism
ToneGalileoScorers.input_tone, GalileoScorers.output_tone
Tool ErrorsGalileoScorers.tool_error_rate
Tool Selection QualityGalileoScorers.tool_selection_quality
ToxicityGalileoScorers.input_toxicity, GalileoScorers.output_toxicity

Luna metrics

If you are using the Galileo Luna-2 model, then use these metric values.
MetricEnum Value
Action AdvancementGalileoScorers.action_advancement_luna
Action CompletionGalileoScorers.action_completion_luna
Chunk AttributionGalileoScorers.chunk_attribution_utilization_luna
Chunk UtilizationGalileoScorers.chunk_attribution_utilization_luna
CompletenessGalileoScorers.completeness_luna
Context AdherenceGalileoScorers.context_adherence_luna
Prompt InjectionGalileoScorers.prompt_injection_luna
Sexism / BiasGalileoScorers.input_sexism_luna, GalileoScorers.output_sexism_luna
ToneGalileoScorers.input_tone_luna, GalileoScorers.output_tone_luna
Tool ErrorsGalileoScorers.tool_error_rate_luna
Tool Selection QualityGalileoScorers.tool_selection_quality_luna
ToxicityGalileoScorers.input_toxicity_luna, GalileoScorers.output_toxicity_luna

How do I use metrics in the SDK?

The run experiment function (Python, TypeScript) takes a list of metrics as part of its arguments.

Preset metrics

Supply a list of one or more metric names into the run_experiment function as shown below:
import os
from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai
from galileo.schema.metrics import GalileoScorers

dataset = get_dataset(name="fictional_character_names")

# Define a custom "runner" function for your experiment. 
def my_custom_llm_runner(input):
  client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
  return client.chat.completions.create(
        model="gpt-4o",
        messages=[
          { "role": "system", "content": "You are a great storyteller." },
          { "role": "user", "content": f"Write a story about {input["topic"]}" },
        ],
    ).choices[0].message.content

# Run the experiment!
results = run_experiment(
	"test-experiment",
	project="my-test-project-1",
	dataset=dataset,
	function=my_custom_llm_runner,
	metrics=[ 
        # List metrics here
        GalileoScorers.action_advancement,
        GalileoScorers.completeness,
        GalileoScorers.instruction_adherence
    ], 
)
For more information, read about running experiments with the Galileo SDKs.

Custom metrics

You can use custom metrics in the same way as Galileo’s preset metrics. At a high level, this involves the following steps:
  1. Create your metric in the Galileo Console (or in code). Your custom metric will return a numerical score based on its input.
  2. Pass the name of your new metric into the run experiment, like in the example below.
In this example, we reference a custom metric that was saved in the console with the name My custom metric. A custom metric in the console with the description "counts the length of the input and output"
import os
from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai

dataset = get_dataset(name="fictional_character_names")

# Define a custom "runner" function for your experiment. 
def my_custom_llm_runner(input):
  client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
  return client.chat.completions.create(
        model="gpt-4o",
        messages=[
          { "role": "system", "content": "You are a great storyteller." },
          { "role": "user", "content": f"Write a story about {input["topic"]}" },
        ],
    ).choices[0].message.content

# Run the experiment!
results = run_experiment(
	"test-experiment",
	project="my-test-project-1",
	dataset=dataset,
	function=my_custom_llm_runner,
	metrics=[ 
        "My custom metric" # List your custom metrics by name here
    ], 
)
Custom metrics provide the flexibility to define precisely what you want to measure, enabling deep analysis and targeted improvement. For a detailed walkthrough on creating them, see Custom Metrics.

Ground truth data

Ground truth is the authoritative, validated answer or label used to benchmark model performance. For LLM metrics, this often means a gold-standard answer, fact, or supporting evidence against which outputs are compared. The following metrics require ground truth data to compute their scores, as they involve direct comparison to a reference answer, label, or fact.
These metrics are only supported in experiments, as they require the ground truth to be set in the dataset used by the experiment.
To set the ground truth, set this in the output of your dataset either in the Galileo Console, or in code.

Are metrics LLM-agnostic?

Yes, all metrics are designed to work across any LLM integrated with Galileo.

References