Metrics SDK Reference

Galileo AI provides a set of built-in, preset metrics designed to evaluate various aspects of LLM, agent, and retrieval-based workflows. This guide provides a reference for Galileo AI’s preset metrics, including and SDK slugs.

Metrics Reference

The table below summarizes each metric’s purpose and the exact slug to reference in the SDK.

Metric	SDK Slug(s)	Description
Action Advancement	`agentic_workflow_success`	Measures whether a step or span moves the user closer to their overall goal within the session.
Action Completion	`agentic_session_success`	Assesses if the user’s goal was ultimately achieved at the session or trace level.
BLEU	`bleu`	BLEU is a case-sensitive measurement of the difference between a model generation and target generation at the sentence-level.
Chunk Attribution	`chunk_attribution_utilization_gpt`	Measures whether or not each chunk retrieved in a RAG pipeline had an effect on the model’s response.
Chunk Utilization	`chunk_attribution_utilization_gpt`	Measures the fraction of text in each retrieved chunk that had an impact on the model’s response in a RAG pipeline.
Completeness	`completeness_gpt`	Assesses whether the response covers all necessary aspects of the prompt or question.
Context Adherence	`context_adherence_gpt`	Evaluates if the LLM output is consistent with and grounded in the provided context.
Context Relevance (Query Adherence)	`context_relevance`	Measures whether the retrieved context has enough information to answer the user’s query.
Correctness (factuality)	`correctness`	Evaluates whether the output is factually correct based on available information.
Ground Truth Adherence	`ground_truth_adherence`	Measures semantic equivalence between model output and ground truth, typically using LLM-based judgment.
Instruction Adherence	`instruction_adherence`	Checks if the LLM output follows the explicit instructions given in the prompt.
Prompt Injection	`prompt_injection_gpt`	Measures the presence of prompt injection attacks in inputs to the LLM.
Prompt Perplexity	`prompt_perplexity`	Indicates how “surprising” or difficult the prompt is for the model.
ROUGE	`rouge`	Measures the unigram overlap between model generation and target generation as a single F-1 score.
Sexism / Bias	`input_sexist_gpt`, `output_sexist_gpt`	Measures how ‘sexist’ an input or output might be perceived as a value between 0 and 1 (1 being more sexist).
Tone	`input_tone_gpt`, `output_tone_gpt`	Detects the tone (e.g., polite, neutral, aggressive) of the input/output.
Tool Errors	`tool_error_rate`	Flags errors that occur when an agent or LLM calls a tool (e.g., API or function call fails).
Tool Selection Quality	`tool_selection_quality`	Determines if the agent/LLM selected the correct tool(s) and provided appropriate arguments.
Toxicity	`input_toxicity_gpt`, `output_toxicity_gpt`	Measures the presence and severity of harmful, offensive, or abusive language

How do I use metrics in the SDK?

The run experiment function (Python, TypeScript) takes a list of metrics as part of its arguments.

Preset metrics

Supply a list of one or more metric names into the run_experiment function as shown below:

from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai

dataset = get_dataset(name="fictional_character_names")

# Define a custom "runner" function for your experiment. 
def my_custom_llm_runner(input):
	return openai.chat.completions.create(
        model="gpt-4o",
        messages=[
          { "role": "system", "content": "You are a great storyteller." },
          { "role": "user", "content": f"Write a story about {input["topic"]}" },
        ],
    ).choices[0].message.content

# Run the experiment!
results = run_experiment(
	"test-experiment",
	project="my-test-project-1",
	dataset=dataset,
	function=my_custom_llm_runner,
	metrics=[ 
        # List metrics here
        "agentic_workflow_success",
        "completeness_gpt",
        "instruction_adherence"
    ], 
)

For more information, read about running experiments with the Python or the TypeScript SDK.

Custom metrics

You can use custom metrics in the same way as Galileo’s preset metrics. At a high level, this involves the following steps:

Create your metric in the Galileo console (or in code). Your custom metric will return a numerical score based on its input.
Pass the name of your new metric into the run experiment, like in the example below.

In this example, we reference a custom metric that was saved in the console with the name My custom metric.

from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai

dataset = get_dataset(name="fictional_character_names")

# Define a custom "runner" function for your experiment. 
def my_custom_llm_runner(input):
	return openai.chat.completions.create(
        model="gpt-4o",
        messages=[
          { "role": "system", "content": "You are a great storyteller." },
          { "role": "user", "content": f"Write a story about {input["topic"]}" },
        ],
    ).choices[0].message.content

# Run the experiment!
results = run_experiment(
	"test-experiment",
	project="my-test-project-1",
	dataset=dataset,
	function=my_custom_llm_runner,
	metrics=[ 
        "My custom metric" # List your custom metrics by name here
    ], 
)

Custom metrics provide the flexibility to define precisely what you want to measure, enabling deep analysis and targeted improvement. For a detailed walkthrough on creating them, see Custom Metrics.

Which metrics require ground truth data?

Ground truth is the authoritative, validated answer or label used to benchmark model performance. For LLM metrics, this often means a gold-standard answer, fact, or supporting evidence against which outputs are compared.

The following metrics require ground truth data to compute their scores, as they involve direct comparison to a reference answer, label, or fact:

agentic_session_success (Action Completion)
chunk_attribution_utilization_gpt (Chunk Attribution, Chunk Utilization)
completeness_gpt (Completeness)
correctness (Correctness (factuality))
ground_truth_adherence (Ground Truth Adherence)
tool_selection_quality (Tool Selection Quality)

Overview

Getting Started

SDK/API

How-to Guides

Cookbooks

Integrations

Concepts

API Reference

References

Metrics SDK Reference

Metrics Reference

How do I use metrics in the SDK?

Preset metrics

Custom metrics

Which metrics require ground truth data?

Are metrics LLM-agnostic?

References

Concepts

Guides

Overview

Getting Started

SDK/API

How-to Guides

Cookbooks

Integrations

Concepts

API Reference

References

​Metrics Reference

​How do I use metrics in the SDK?

​Preset metrics

​Custom metrics

​Which metrics require ground truth data?

​Are metrics LLM-agnostic?

​References

​Concepts

​Guides

Metrics Reference

How do I use metrics in the SDK?

Preset metrics

Custom metrics

Which metrics require ground truth data?

Are metrics LLM-agnostic?

References

Concepts

Guides