Skip to main content
Galileo AI provides a set of built-in, preset metrics designed to evaluate various aspects of LLM, agent, and retrieval-based workflows. You can also create custom metrics using LLM-as-a-judge, or code. This guide provides a reference for using these metrics in your experiments.

Out-of-the-box metrics reference

The table below summarizes gives the constants used in code to access each metric. To use these metrics, import the relevant enum.
from galileo import GalileoMetrics

LLM-as-a-judge Metrics

MetricEnum Value
Action AdvancementGalileoMetrics.action_advancement
Action CompletionGalileoMetrics.action_completion
Agent EfficiencyGalileoMetrics.agent_efficiency
Agent FlowGalileoMetrics.agent_flow
BLEUGalileoMetrics.bleu
Chunk Attribution UtilizationGalileoMetrics.chunk_attribution_utilization
CompletenessGalileoMetrics.completeness
Context AdherenceGalileoMetrics.context_adherence
Context PrecisionGalileoMetrics.context_precision
Context Relevance (Query Adherence)GalileoMetrics.context_relevance
Conversation QualityGalileoMetrics.conversation_quality
Correctness (factuality)GalileoMetrics.correctness
Ground Truth AdherenceGalileoMetrics.ground_truth_adherence
Instruction AdherenceGalileoMetrics.instruction_adherence
PII (personally identifiable information)GalileoMetrics.input_pii, GalileoMetrics.output_pii
Prompt InjectionGalileoMetrics.prompt_injection
Prompt PerplexityGalileoMetrics.prompt_perplexity
ROUGEGalileoMetrics.rouge
Sexism / BiasGalileoMetrics.input_sexism, GalileoMetrics.output_sexism
ToneGalileoMetrics.input_tone, GalileoMetrics.output_tone
Tool ErrorsGalileoMetrics.tool_error_rate
Tool Selection QualityGalileoMetrics.tool_selection_quality
Reasoning CoherenceGalileoMetrics.reasoning_coherence
SQL CorrectnessGalileoMetrics.sql_correctness
SQL AdherenceGalileoMetrics.sql_adherence
SQL InjectionGalileoMetrics.sql_injection
SQL EfficiencyGalileoMetrics.sql_efficiency
ToxicityGalileoMetrics.input_toxicity, GalileoMetrics.output_toxicity
User Intent ChangeGalileoMetrics.user_intent_change

Luna-2 metrics

If you are using the Galileo Luna-2 model, then use these metric values.
MetricEnum Value
Action AdvancementGalileoMetrics.action_advancement_luna
Action CompletionGalileoMetrics.action_completion_luna
Chunk Attribution UtilizationGalileoMetrics.chunk_attribution_utilization_luna
CompletenessGalileoMetrics.completeness_luna
Context AdherenceGalileoMetrics.context_adherence_luna
PII (personally identifiable information)GalileoMetrics.input_pii, GalileoMetrics.output_pii
Prompt InjectionGalileoMetrics.prompt_injection_luna
Sexism / BiasGalileoMetrics.input_sexism_luna, GalileoMetrics.output_sexism_luna
ToneGalileoMetrics.input_tone, GalileoMetrics.output_tone
Tool ErrorsGalileoMetrics.tool_error_rate_luna
Tool Selection QualityGalileoMetrics.tool_selection_quality_luna
ToxicityGalileoMetrics.input_toxicity_luna, GalileoMetrics.output_toxicity_luna
UncertaintyGalileoMetrics.uncertainty

How do I use metrics in experiments?

The run experiment function (Python, TypeScript) takes a list of metrics as part of its arguments.

Preset metrics

Supply a list of one or more metric names into the run_experiment function as shown below:
import os
from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai
from galileo import GalileoMetrics

dataset = get_dataset(name="fictional_character_names")

# Define a custom "runner" function for your experiment.
def my_custom_llm_runner(input):
  client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
  return client.chat.completions.create(
        model="gpt-4o",
        messages=[
          { "role": "system", "content": "You are a great storyteller." },
          { "role": "user", "content": f"Write a story about {input["topic"]}" },
        ],
    ).choices[0].message.content

# Run the experiment!
results = run_experiment(
    "test-experiment",
    project="my-test-project-1",
    dataset=dataset,
    function=my_custom_llm_runner,
    metrics=[
        # List metrics here
        GalileoMetrics.action_advancement,
        GalileoMetrics.completeness,
        GalileoMetrics.instruction_adherence
    ],
)
For more information, read about running experiments with the Galileo SDKs.

Custom metrics

You can use custom metrics in the same way as Galileo’s preset metrics. At a high level, this involves the following steps:
  1. Create your metric in the Galileo Console (or in code). Your custom metric will return a numerical score based on its input.
  2. Pass the name of your new metric into the run experiment, like in the example below.
For example, if you have a metric called "Compliance - do not recommend any financial actions": A metric called Compliance - do not recommend any financial actions You would pass this to an experiment like this:
from galileo.experiments import run_experiment

results = run_experiment(
    "finance-experiment",
    dataset=dataset,
    function=llm_call,
    metrics=["Compliance - do not recommend any financial actions"],
    project="my-project",
)
Custom metrics provide the flexibility to define precisely what you want to measure, enabling deep analysis and targeted improvement. For a detailed walkthrough on creating them, see Custom Metrics.

Ground truth data

Ground truth is the authoritative, validated answer or label used to benchmark model performance. For LLM metrics, this often means a gold-standard answer, fact, or supporting evidence against which outputs are compared. The following metrics require ground truth data to compute their scores, as they involve direct comparison to a reference answer, label, or fact.
These metrics are only supported in experiments, as they require the ground truth to be set in the dataset used by the experiment.
To set the ground truth, set this in the output of your dataset either in the Galileo Console, or in code.

Are metrics LLM-agnostic?

Yes, all metrics are designed to work across any LLM integrated with Galileo.

Next steps

Metrics Overview

Explore Galileo’s comprehensive metrics framework for evaluating and improving AI system performance across multiple dimensions.

Experiments Overview

Learn how to use datasets and experiments to improve your application.

Run experiments

Learn how to run experiments in Galileo using the Galileo SDKs and custom metrics.