Galileo AI provides a set of built-in, preset metrics designed to evaluate various aspects of LLM, agent, and retrieval-based workflows. This guide provides a reference for Galileo AI’s preset metrics, including and SDK slugs.


Metrics Reference

The table below summarizes each metric’s purpose and the exact slug to reference in the SDK.

MetricSDK Slug(s)Description
Action Advancementagentic_workflow_successMeasures whether a step or span moves the user closer to their overall goal within the session.
Action Completionagentic_session_successAssesses if the user’s goal was ultimately achieved at the session or trace level.
BLEUbleuBLEU is a case-sensitive measurement of the difference between a model generation and target generation at the sentence-level.
Chunk Attributionchunk_attribution_utilization_gptMeasures whether or not each chunk retrieved in a RAG pipeline had an effect on the model’s response.
Chunk Utilizationchunk_attribution_utilization_gptMeasures the fraction of text in each retrieved chunk that had an impact on the model’s response in a RAG pipeline.
Completenesscompleteness_gptAssesses whether the response covers all necessary aspects of the prompt or question.
Context Adherencecontext_adherence_gptEvaluates if the LLM output is consistent with and grounded in the provided context.
Context Relevance (Query Adherence)context_relevanceMeasures whether the retrieved context has enough information to answer the user’s query.
Correctness (factuality)correctnessEvaluates whether the output is factually correct based on available information.
Ground Truth Adherenceground_truth_adherenceMeasures semantic equivalence between model output and ground truth, typically using LLM-based judgment.
Instruction Adherenceinstruction_adherenceChecks if the LLM output follows the explicit instructions given in the prompt.
Prompt Injectionprompt_injection_gptMeasures the presence of prompt injection attacks in inputs to the LLM.
Prompt Perplexityprompt_perplexityIndicates how “surprising” or difficult the prompt is for the model.
ROUGErougeMeasures the unigram overlap between model generation and target generation as a single F-1 score.
Sexism / Biasinput_sexist_gpt, output_sexist_gptMeasures how ‘sexist’ an input or output might be perceived as a value between 0 and 1 (1 being more sexist).
Toneinput_tone_gpt, output_tone_gptDetects the tone (e.g., polite, neutral, aggressive) of the input/output.
Tool Errorstool_error_rateFlags errors that occur when an agent or LLM calls a tool (e.g., API or function call fails).
Tool Selection Qualitytool_selection_qualityDetermines if the agent/LLM selected the correct tool(s) and provided appropriate arguments.
Toxicityinput_toxicity_gpt, output_toxicity_gptMeasures the presence and severity of harmful, offensive, or abusive language

How do I use metrics in the SDK?

The run experiment function (Python, TypeScript) takes a list of metrics as part of its arguments.

Preset metrics

Supply a list of one or more metric names into the run_experiment function as shown below:

from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai

dataset = get_dataset(name="fictional_character_names")

# Define a custom "runner" function for your experiment. 
def my_custom_llm_runner(input):
	return openai.chat.completions.create(
        model="gpt-4o",
        messages=[
          { "role": "system", "content": "You are a great storyteller." },
          { "role": "user", "content": f"Write a story about {input["topic"]}" },
        ],
    ).choices[0].message.content

# Run the experiment!
results = run_experiment(
	"test-experiment",
	project="my-test-project-1",
	dataset=dataset,
	function=my_custom_llm_runner,
	metrics=[ 
        # List metrics here
        "agentic_workflow_success",
        "completeness_gpt",
        "instruction_adherence"
    ], 
)

For more information, read about running experiments with the Python or the TypeScript SDK.

Custom metrics

You can use custom metrics in the same way as Galileo’s preset metrics. At a high level, this involves the following steps:

  1. Create your metric in the Galileo console (or in code). Your custom metric will return a numerical score based on its input.
  2. Pass the name of your new metric into the run experiment, like in the example below.

In this example, we reference a custom metric that was saved in the console with the name My custom metric.

from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai

dataset = get_dataset(name="fictional_character_names")

# Define a custom "runner" function for your experiment. 
def my_custom_llm_runner(input):
	return openai.chat.completions.create(
        model="gpt-4o",
        messages=[
          { "role": "system", "content": "You are a great storyteller." },
          { "role": "user", "content": f"Write a story about {input["topic"]}" },
        ],
    ).choices[0].message.content

# Run the experiment!
results = run_experiment(
	"test-experiment",
	project="my-test-project-1",
	dataset=dataset,
	function=my_custom_llm_runner,
	metrics=[ 
        "My custom metric" # List your custom metrics by name here
    ], 
)

Custom metrics provide the flexibility to define precisely what you want to measure, enabling deep analysis and targeted improvement. For a detailed walkthrough on creating them, see Custom Metrics.


Which metrics require ground truth data?

Ground truth is the authoritative, validated answer or label used to benchmark model performance. For LLM metrics, this often means a gold-standard answer, fact, or supporting evidence against which outputs are compared.

The following metrics require ground truth data to compute their scores, as they involve direct comparison to a reference answer, label, or fact:


Are metrics LLM-agnostic?

Yes, all metrics are designed to work across any LLM integrated with Galileo.


References

Concepts

Guides