Skip to main content
Galileo AI provides a set of built-in, preset metrics designed to evaluate various aspects of LLM, agent, and retrieval-based workflows. You can also create custom metrics using LLM-as-a-judge, or code. This guide provides a reference for using these metrics in your experiments.

Out-of-the-box metrics reference

The table below summarizes gives the constants used in code to access each metric. To use these metrics, import the relevant enum.
from galileo.schema.metrics import GalileoScorers

LLM-as-a-judge Metrics

  • Python
  • TypeScript
MetricEnum Value
Action AdvancementGalileoScorers.action_advancement
Action CompletionGalileoScorers.action_completion
Agent EfficiencyGalileoScorers.agent_efficiency
Agent FlowGalileoScorers.agent_flow
BLEUGalileoScorers.bleu
Chunk Attribution UtilizationGalileoScorers.chunk_attribution_utilization
CompletenessGalileoScorers.completeness
Context AdherenceGalileoScorers.context_adherence
Context Relevance (Query Adherence)GalileoScorers.context_relevance
Conversation QualityGalileoScorers.conversation_quality
Correctness (factuality)GalileoScorers.correctness
Ground Truth AdherenceGalileoScorers.ground_truth_adherence
Instruction AdherenceGalileoScorers.instruction_adherence
PII (personally identifiable information)GalileoScorers.input_pii, GalileoScorers.output_pii
Prompt InjectionGalileoScorers.prompt_injection
Prompt PerplexityGalileoScorers.prompt_perplexity
ROUGEGalileoScorers.rouge
Sexism / BiasGalileoScorers.input_sexism, GalileoScorers.output_sexism
ToneGalileoScorers.input_tone, GalileoScorers.output_tone
Tool ErrorsGalileoScorers.tool_error_rate
Tool Selection QualityGalileoScorers.tool_selection_quality
ToxicityGalileoScorers.input_toxicity, GalileoScorers.output_toxicity
User Intent ChangeGalileoScorers.user_intent_change

Luna-2 metrics

If you are using the Galileo Luna-2 model, then use these metric values.
  • Python
  • TypeScript
MetricEnum Value
Action AdvancementGalileoScorers.action_advancement_luna
Action CompletionGalileoScorers.action_completion_luna
Chunk Attribution UtilizationGalileoScorers.chunk_attribution_utilization_luna
CompletenessGalileoScorers.completeness_luna
Context AdherenceGalileoScorers.context_adherence_luna
PII (personally identifiable information)GalileoScorers.input_pii, GalileoScorers.output_pii
Prompt InjectionGalileoScorers.prompt_injection_luna
Sexism / BiasGalileoScorers.input_sexism_luna, GalileoScorers.output_sexism_luna
ToneGalileoScorers.input_tone, GalileoScorers.output_tone
Tool ErrorsGalileoScorers.tool_error_rate_luna
Tool Selection QualityGalileoScorers.tool_selection_quality_luna
ToxicityGalileoScorers.input_toxicity_luna, GalileoScorers.output_toxicity_luna
UncertaintyGalileoScorers.uncertainty

How do I use metrics in experiments?

The run experiment function (Python, TypeScript) takes a list of metrics as part of its arguments.

Preset metrics

Supply a list of one or more metric names into the run_experiment function as shown below:
import os
from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai
from galileo.schema.metrics import GalileoScorers

dataset = get_dataset(name="fictional_character_names")

# Define a custom "runner" function for your experiment.
def my_custom_llm_runner(input):
  client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
  return client.chat.completions.create(
        model="gpt-4o",
        messages=[
          { "role": "system", "content": "You are a great storyteller." },
          { "role": "user", "content": f"Write a story about {input["topic"]}" },
        ],
    ).choices[0].message.content

# Run the experiment!
results = run_experiment(
    "test-experiment",
    project="my-test-project-1",
    dataset=dataset,
    function=my_custom_llm_runner,
    metrics=[
        # List metrics here
        GalileoScorers.action_advancement,
        GalileoScorers.completeness,
        GalileoScorers.instruction_adherence
    ],
)
For more information, read about running experiments with the Galileo SDKs.

Custom metrics

You can use custom metrics in the same way as Galileo’s preset metrics. At a high level, this involves the following steps:
  1. Create your metric in the Galileo Console (or in code). Your custom metric will return a numerical score based on its input.
  2. Pass the name of your new metric into the run experiment, like in the example below.
For example, if you have a metric called "Compliance - do not recommend any financial actions": A metric called Compliance - do not recommend any financial actions You would pass this to an experiment like this:
from galileo.experiments import run_experiment

results = run_experiment(
    "finance-experiment",
    dataset=dataset,
    function=llm_call,
    metrics=["Compliance - do not recommend any financial actions"],
    project="my-project",
)
Custom metrics provide the flexibility to define precisely what you want to measure, enabling deep analysis and targeted improvement. For a detailed walkthrough on creating them, see Custom Metrics.

Ground truth data

Ground truth is the authoritative, validated answer or label used to benchmark model performance. For LLM metrics, this often means a gold-standard answer, fact, or supporting evidence against which outputs are compared. The following metrics require ground truth data to compute their scores, as they involve direct comparison to a reference answer, label, or fact.
These metrics are only supported in experiments, as they require the ground truth to be set in the dataset used by the experiment.
To set the ground truth, set this in the output of your dataset either in the Galileo Console, or in code.

Are metrics LLM-agnostic?

Yes, all metrics are designed to work across any LLM integrated with Galileo.

Next steps