Skip to main content
As you progress from initial testing to systematic evaluation, you’ll want to run experiments to validate your application’s performance and behavior. Here are several ways to structure your experiments, starting from the simplest approaches and moving to more sophisticated implementations. Experiments fit both into the initial prompt engineering and model selection phases of your app, as well as during application development time, such as during testing or in a CI/CD pipeline. This allows you to fit experiments into your SDLC for evaluation-driven development. AI Engineers and data scientists can use experiments in notebooks or in simple applications to test out prompts or different models. AI Engineers can then add experiments into their production apps allowing these experiments to be run against complex applications or scenarios, including RAG and agentic flows.

Configure an LLM integration

To calculate metrics using an out-of-the-box metric, or a custom LLM-as-a-judge metric, you will either need to configure an integration with an LLM, or set up the Luna-2 SLM. To configure an LLM, visit the relevant API platform to obtain an API key, then add it using the integrations page in the Galileo Console. If you are using a custom code-based metric then you don’t need an LLM integration.

Experiment flow

The entry point for running experiments is a call to the run experiments function (see the run_experiment Python SDK docs, or the runExperiment TypeScript SDK docs for more details). Experiments take a dataset, and can either pass it to a prompt template, or to a custom function. This custom function can go from a simple call to an LLM right up to a full agentic workflow. Experiments also take a list of one or more metrics to use to evaluate the traces. This can be one of the out-of-the-box metrics using the constants provided by the Galileo SDK, or the name of a custom metric. For each row in a dataset, a new trace is created, and either the prompt template is logged as an LLM span, or every span created in the custom function is logged to that trace.
When you call your application code from an experiment, the experiment runner will start a new session and trace for every row in your dataset. You will need to ensure your application code doesn’t start a new session or trace manually, or conclude or flush the trace.If you are using the log wrapper or a third-party integration, this is handled for you. If you are logging manually you will need to check to see if an experiment is in progress.See the Experiment SDK docs for details on how to do this.
All the traces logged to Galileo can then be evaluated using the metrics of your choice. If you are building experiments into your production application, you will need to enable a way to call the experiment runner. For example, you can do this inside a unit test.
Which approach should I use?
ApproachWhen to useOutput generation
Generated outputYou already have output from your AI system to evaluateNo LLM generation needed — output already exists in your dataset
Prompt templateYou want Galileo to generate output using a prompt and an LLMLLM generates output
Custom functionYou need to run complex application logic (RAG, agents, multi-step)Your application generates output

Run experiments with prompt template

A simple way to get started with experimentation is by evaluating prompts against datasets. This is especially valuable during the initial prompt development and refinement phase, where you want to test different prompt variations. Assuming you’ve previously created a dataset, you can use the following code to run an experiment:
from galileo import Message, MessageRole
from galileo.prompts import create_prompt
from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo import GalileoMetrics

from dotenv import load_dotenv
load_dotenv()

project = "my-project"

# 1a. If the prompt  does not exist, create it:
prompt = create_prompt(
    name="geography-prompt",
    template=[
        Message(role=MessageRole.system,
                content="""
                You are a geography expert.
                Respond with only the continent name.
                """),
        Message(role=MessageRole.user, content="{{input}}")
    ]
)

# 1b. (OPTIONAL) If the prompt  already exists, fetch it:
# prompt = get_prompt(name="geography-prompt")

# 2. Run the experiment and get results
results = run_experiment(
    "geography-experiment",
    # Name of a dataset you created
    dataset=get_dataset(name="countries"),
    prompt_template=prompt,
    # Optional
    prompt_settings={
        "max_tokens": 256,
        # Make sure you have an integration set up
        # for the model alias you're using
        "model_alias": "GPT-4o",
        "temperature": 0.8
    },
    metrics=[GalileoMetrics.correctness],
    project=project
)

Run experiments with generated output

As of Galileo Python SDK v1.50.1+ — Bring your own data from any system (production logs, external LLMs, or manual curation) and evaluate it directly.
If you already have output from your AI system, you can evaluate it directly in Galileo without regenerating anything. Unlike the prompt-driven flow where Galileo calls an LLM to generate output, this flow uses the output that already exists in your dataset. You only pay for metric computation. This is ideal for:
  • Evaluating production output: Export traces from your live system, run quality metrics to find issues
  • Comparing model providers: Collect output from OpenAI, Anthropic, and Gemini offline, then score them all in Galileo
  • Regression testing: After improving your RAG pipeline, run the same metrics on new output to see if scores improved
  • A/B testing: Run the same inputs through two different systems, put both outputs in datasets, compare metric scores

How it works

  1. Create a dataset with an input column and a generated_output column
  2. Call run_experiment without a prompt_template — Galileo detects the generated_output column automatically
  3. Metrics are computed directly on the existing output — no LLM calls needed for generation

Example

This flow is currently supported in the Python SDK (v1.50.1+). TypeScript support uses the same API — omit promptTemplate to use this flow.
Python
from galileo.datasets import create_dataset, get_dataset
from galileo.experiments import run_experiment
from galileo import GalileoMetrics

from dotenv import load_dotenv
load_dotenv()

# Option A: Create a new dataset from local data
dataset = create_dataset(
    name="my-eval-dataset",
    content=[
        {
            "input": "What is the capital of France?",
            "generated_output": "The capital of France is Paris.",
            # Optional — enables Ground Truth Adherence
            "ground_truth": "Paris",
        },
        {
            "input": "Explain quantum computing in one sentence.",
            "generated_output": (
                "Quantum computing uses qubits to perform"
                " calculations exponentially faster than"
                " classical computers."
            ),
            # No ground_truth — other metrics still work
        },
    ],
)

# Option B: Or use an existing Galileo dataset (just uncomment one line below)
# dataset = get_dataset(name="my-existing-dataset")

# Run experiment — no prompt template needed
results = run_experiment(
    "evaluate-my-output",
    dataset=dataset,
    metrics=[
        GalileoMetrics.completeness,
        GalileoMetrics.context_adherence,
        GalileoMetrics.ground_truth_adherence,  # Uses ground_truth when available
    ],
    project="my-project",
)

Using this flow

You can use this flow in two ways:
  1. Already have a Galileo dataset? Pass it directly to run_experiment(...).
  2. Have local Python data (e.g., a list of dictionaries)? First upload it with create_dataset(...), then pass the returned dataset to run_experiment(...).

Dataset columns

Your dataset needs at minimum an input column and a generated_output column.
ColumnRequiredDescription
inputYesThe user query or prompt input
generated_outputYesThe output from your AI system. Must have data in at least one of the first 100 rows.
ground_truthNoExpected answer — used by the Ground Truth Adherence metric. The SDK also accepts output as an alias for backward compatibility.
metadataNoAdditional context for filtering or grouping
Column naming: The SDK accepts both ground_truth and output for the reference/expected answer column. Internally they map to the same field. In the Galileo Console and CSV exports, this column is displayed as “Ground Truth”. We recommend using ground_truth in new datasets for clarity.

Flow determination

When you call run_experiment:
  • If you provide a prompt_template, the prompt-driven flow is always used (even if the dataset has a generated_output column).
  • If you omit prompt_template and the dataset has a generated_output column with at least one non-empty value in the first 100 rows, the generated output flow is used.
  • If you omit prompt_template and the dataset has no generated_output column, or the column exists but has no non-empty values in the sampled rows, an error is returned.

Run experiments with custom function

Once you’re comfortable with basic prompt testing, you might want to evaluate more complex parts of your app using your datasets. This approach is particularly useful when you have a generation function in your app that takes a set of inputs, which you can model with a dataset. If your experiment runs code that uses the log decorator, or a third-party SDK integration, then all the spans created by these will be logged to the experiment. This example uses the log decorator. The workflow span created by the log decorator will be logged to the experiment.
from galileo import log
from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo import GalileoMetrics

dataset = get_dataset(name="countries")

@log(span_type="llm", name= "My Span")
def llm_call(input):
  # Custom function code
    return result

results = run_experiment(
    "geography-experiment",
    dataset=dataset,
    function=llm_call,
    metrics=[GalileoMetrics.correctness],
    project="my-project",
)
This example uses the OpenAI SDK wrapper. The LLM span created by the wrapper will be logged to the experiment.
import os
from galileo.experiments import run_experiment
from galileo.datasets import get_dataset
from galileo.openai import openai
from galileo import GalileoMetrics

client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
dataset = get_dataset(name="countries")

def llm_call(input):
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[
          {
            "role": "system",
            "content": "You are a geography expert."
          },
          {
            "role": "user",
            "content": f"""
            Which continent does the following country belong to: {input}
            """
          }
        ],
    ).choices[0].message.content

results = run_experiment(
    "geography-experiment",
    dataset=dataset,
    function=llm_call,
    metrics=[GalileoMetrics.correctness],
    project="my-project",
)

Run experiments against complex code with custom functions

Custom functions can be as complex as required, including multiple steps, agents, RAG, and more. This means you can build experiments around an existing application, allowing you to run experiments against the full application you have built, using datasets to mimic user inputs. For example, if you have a multi-agent LangGraph chatbot application, you can run an experiment against it using a dataset to define different user inputs, and log every stage in the agentic flow as part of that experiment. To enable this, you will need to make some small changes to your application logic to handle the logging context from the experiment. When functions in your application are run by the run_experiment call, a logger is created by the experiment runner, and a trace is started. This logger can be passed through the application, accessed using the @log decorator or by calling galileo_context.get_logger_instance() in Python, or getLogger in TypeScript. You will need to change your code to use this instead of creating a new logger and starting a new trace.

Get an existing logger and check for an existing trace

The Galileo SDK maintains a context that tracks the current logger. You can get this logger with the following code:
from galileo import galileo_context

# Get the current logger
galileo_logger=galileo_context.get_logger_instance()
If there isn’t a current logger, one will be created by this call, so this will always return a logger. Once you have the logger, you can check for an existing trace by accessing the current parent trace from the logger. If this is not set, then there is no active trace.
has_existing_trace = galileo_logger.current_parent() is not None
You can use this to decide if you need to create a new trace in your application. If there is no parent trace, you can safely create a new one.
def process_message(input):
    # Get the Galileo logger instance
    galileo_logger = galileo_context.get_logger_instance()

    # If there is a current parent trace, we are in an experiment
    # Otherwise, we start a new trace for the chat workflow
    is_in_experiment = False
    if not galileo_logger.current_parent():
        galileo_logger.start_trace(
            input=input,
            name="Chat Workflow"
        )
    else:
        is_in_experiment = True

    # Your code goes here to process the input and create log spans as needed
    # You can also pass this log to other functions, or access it in those using
    # galileo_context.get_logger_instance()

    # If we are not in an experiment, we conclude and flush the trace
    if not is_in_experiment:
        galileo_logger.conclude("Some output")
        galileo_logger.flush()
You can then safely call your code from the experiment runner as well as in your normal application logic. When called from the experiment runner, your traces will be logged to that experiment. When called from your application code, the traces will be logged as normal.

Using third-party integrations with experiments

If you are using third-party integrations, there may be some configuration you need to do to make the integrations work with experiments. See the following documentation for more details:

Custom function logging principles

There are a few important principles to understand when logging experiments in code.
  • When running an experiment, a new logger is created for you and set in the Galileo context. If you create a new logger manually in the application code used in your experiment, this logger will not be used in the experiment.
  • To access the logger to manually add traces inside the experiment code, you can call galileo_context.get_logger_instance() (Python) or getLogger() (TypeScript) to get the current logger.
  • To detect if there is an active trace, use the current_parent() (Python) or currentParent (TypeScript) method on the logger. This will return None/undefined if there isn’t an active trace.
  • Be sure to handle cases in your application code where a logger is created or a trace is started, and make sure this doesn’t happen in an experiment, and the experiment logger and trace is used instead.
  • Every row in a dataset is a new trace. If you create new traces manually, they will not be used.
  • Do not conclude or flush the logger in your experiment, the experiment will do this for you.

Set metrics for your experiment

When you run an experiment, you need to define which metrics you want to evaluate for each row in the dataset. For out-of-the-box metrics, use the constants provided by the Galileo SDK.
from galileo.experiments import run_experiment
from galileo import GalileoMetrics

results = run_experiment(
    "finance-experiment",
    dataset=dataset,
    function=llm_call,
    metrics=[GalileoMetrics.correctness],
    project="my-project",
)
For custom metrics, use the name you set when you created the metric. For example, if you have a custom LLM-as-a-judge metric called "Compliance - do not recommend any financial actions": A metric called Compliance - do not recommend any financial actions You would pass this to an experiment like this:
from galileo.experiments import run_experiment

results = run_experiment(
    "finance-experiment",
    dataset=dataset,
    function=llm_call,
    metrics=["Compliance - do not recommend any financial actions"],
    project="my-project",
)

Ground truth

For Ground Truth Adherence, you also need to set the ground truth in your dataset. This is set in the ground_truth column.
dataset = [
  {
    "input": "Spain"
    "ground_truth": "Spain is in Europe"
  }
]
If you set the ground_truth column when using other metrics, the value is not used in the calculation of the metric, but can be added to the Galileo console under the “Dataset Ground Truth” column. This can be helpful for manual review.

Custom dataset evaluation

As your testing needs become more specific, you might need to work with custom or local datasets. This approach is perfect for focused testing of edge cases or when building up your test suite with specific scenarios:
import os
from galileo.experiments import run_experiment
from galileo.openai import openai
from galileo import GalileoMetrics

dataset = [
  {
    "input": "Spain"
  }
]

def llm_call(input):
  client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
  return client.chat.completions.create(
        model="gpt-4",
        messages=[
          {
            "role": "system",
            "content": "You are a geography expert"
          },
          {
            "role": "user",
            "content": f"""
            Which continent does the following country belong to: {input}
            """
          }
        ],
    ).choices[0].message.content

results = run_experiment(
    "geography-experiment",
    dataset=dataset,
    function=llm_call,
    metrics=[GalileoMetrics.correctness],
    project="my-project"
)

Custom metrics for deep analysis

For the most sophisticated level of testing, you might need to track specific aspects of your application’s behavior. Custom metrics provide the flexibility to define precisely what you want to measure, enabling deep analysis and targeted improvement:
import os
from galileo import Trace, Span
from galileo.experiments import run_experiment
from galileo.openai import openai
from galileo.schema.metrics import LocalMetricConfig

# 1. Scorer Function
def brevity_rank(step: Span | Trace) -> str:
    """Rank response brevity based on word count."""
    word_count = len(step.output.content.split(" "))
    if word_count <= 3:
        return "Terse"
    if word_count <= 5:
        return "Temperate"
    return "Talkative"

# 2. Configure the Local Metric
terseness = LocalMetricConfig[str](
    name="Terseness",
    scorer_fn=brevity_rank
)

# 3. Dataset
countries_dataset = [
    {"input": "Indonesia"},
    {"input": "New Zealand"},
    {"input": "Greenland"},
    {"input": "China"},
]

# 4. LLM-Call Function
def llm_call(input):
    client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    return (
        client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": """
                    You are a geography expert. Always answer as succinctly as possible.
                    """
                },
                {
                    "role": "user",
                    "content": f"""
                    Which continent does the following country belong to: {input}
                    """
                },
            ],
        )
        .choices[0]
        .message.content
    )

# 5. Run the Experiment!
results = run_experiment(
    "terseness-custom-metric",
    dataset=countries_dataset,
    function=llm_call,
    metrics=[terseness],  # You can add multiple custom metrics here
    project="My first project",
)
Each of these experimentation approaches fits into different stages of your development and testing workflow. As you progress from simple prompt testing to sophisticated custom metrics, Galileo’s experimentation framework provides the tools you need to gather insights and improve your application’s performance at every level of complexity.

Experimenting with agentic and RAG applications

The experimentation framework extends naturally to more complex applications like agentic AI systems and RAG (Retrieval-Augmented Generation) applications. When working with agents, you can evaluate various aspects of their behavior, from decision-making capabilities to tool usage patterns. This is particularly valuable when testing how agents handle complex workflows, multi-step reasoning, or tool selection. For RAG applications, experimentation helps validate both the retrieval and generation components of your system. You can assess the quality of retrieved context, measure response relevance, and ensure that your RAG pipeline maintains high accuracy across different types of queries. This is especially important when fine-tuning retrieval parameters or testing different reranking strategies. The same experimentation patterns shown above apply to these more complex systems. You can use predefined datasets to benchmark performance, create custom datasets for specific edge cases, and define specialized metrics that capture the unique aspects of agent behavior or RAG performance. This systematic approach to testing helps ensure that your advanced AI applications maintain high quality and reliability in production environments.

Best practices

  1. Use consistent datasets: Use the same dataset when comparing different prompts or models to ensure fair comparisons.
  2. Test multiple variations: Run experiments with different prompt variations to find the best approach.
  3. Use appropriate metrics: Choose metrics that are relevant to your specific use case.
  4. Start small: Begin with a small dataset to quickly iterate and refine your approach before scaling up.
  5. Document your experiments: Keep track of what you’re testing and why to make it easier to interpret results.

Next steps

Experiments SDK

Metrics