Experiments Basics

Experiments in Galileo allow you to evaluate and compare different prompts, models, and configurations using datasets and prompt templates, and measure their performance using various metrics. This helps you identify the best approach for your specific use case. You can use experiments for all the stages in your AI application development lifecycle, and experiments are a key part of evaluation-driven development:

Prompt engineering: Use experiments to iterate and test different prompts against well defined inputs, or your application code.
Model selection: Experiments can be run against different models to provide a comprehensive way to compare each model against your specific data, use case, or application.
Application testing: Experiments are a way to repeatable test your application with known data, for example in unit, integration, or end-to-end testing in a CI/CD pipeline.

For a list of supported metrics, see the Metrics Reference Guide.

Components of an experiment

An experiment is made of three components:

One or more metrics that you want to evaluate.
A dataset containing well defined inputs, and optional outputs for evaluations that require ground truth.
A means of running the dataset against a model and evaluating the response. This can be via sending prompts with placeholders for the data from the dataset, or by running custom application code.

You can manage datasets and run experiments either through the playground in the Galileo console, or in code. Datasets can either be managed through the Galileo console, created either in the console, or uploaded via code, or managed purely in code.

Metrics

When you run an experiment, you define what metrics you want to evaluate against. These can be:

You can learn more about metrics in our metrics documentation.

Datasets

Datasets are well-defined sets of data with inputs, and optional ground truth values. The inputs can be full prompts, data that is injected into a defined prompt, or inputs to an AI application such as user inputs to a chatbot. Ground truth is required for some metrics to help evaluate the response against this truth. Datasets can be created in the Galileo console by uploading a file, using synthetic data generation, or by manually creating rows. You can also create datasets in code, or define datasets inline that are not saved to Galileo. Datasets saved to Galileo are versioned, with the full history maintained when you update a dataset. You can learn more about datasets in our dataset documentation.

Running the experiments

Experiments can be run in multiple different ways, depending on your needs:

Using a playground
Running an experiment in code using a prompt template against an LLM
Running an experiment in code against a custom function in code
Running an experiment in code against your existing application code

Output of an experiment

When you run an experiment, the output contains:

The input. When using prompts the input is the full prompt with dataset values injected as variables
The response from the LLM or your code
System metrics, such as latency, the number of tokens used, and the estimated cost
The calculated metrics

If you are using a playground, the output will be in the playground, and you can optionally save this to an experiment Log stream. If you are using code to run your experiment, the output will be in an experiment Log stream. These Log streams contain one trace per dataset row. These Log streams are visible in the Experiments tab. Each experiment is a single line, with average values for the system and evaluated metrics. You can then drill into each experiment to see all the traces and spans.

Logging experiments

Experiments belong to projects, with one project containing many experiments. Each experiment has a single Log stream with multiple traces. When you use a dataset with an experiment, each row in the dataset is logged as a separate trace in the experiment’s Log stream. When you are starting to plan your experiments, ensure you have created the relevant project to run them in.

Typical experimentation flow

When you start working with your application, you’ll naturally progress from basic testing to more comprehensive evaluation. This journey helps you build confidence in your application’s performance and systematically improve its behavior. As you advance, you’ll find that organizing your test cases into datasets becomes essential for effective experimentation - allowing you to track performance, identify patterns, and measure improvements over time.

Initial Testing

Run your application with simple test cases to get a feel for how it performs. This is like taking your first steps:

Test with straightforward, expected inputs
Watch how your application behaves in ideal conditions
Look for any immediate issues or unexpected behaviors
Get comfortable with your metrics and what they tell you

This phase helps you establish a baseline for what “good” looks like.

Expanding Test Coverage

Once you’re comfortable with basic testing, it’s time to broaden your horizons. This is where Galileo’s dataset features become valuable:

Introduce more complex and varied inputs
Use Galileo’s datasets to organize and maintain test cases
Or bring your own dataset if you already have test data
Run experiments to look for patterns in how your application handles different types of inputs

Think of this as stress-testing your application across a wide range of scenarios.

Finding and Fixing Issues

As you test more extensively, you’ll discover areas where your application needs improvement:

Identify specific inputs that cause problems
Add these inputs to a datasets
Look for patterns in problematic cases
Track how your fixes perform against these problem cases
Build a library of test cases for regression testing

This systematic approach helps you not only fix issues but also prevent them from recurring.

Continuous Improvement

Now you’re in a cycle of continuous improvement:

Regularly run tests against your datasets
Monitor for new issues or patterns
Quickly identify when changes cause problems
Maintain datasets that represent your key test cases
Track your app’s improving performance over time

This ongoing process helps ensure your application keeps getting better while maintaining quality.

Initial setup

To log experiments to Galileo, you need to configure the SDK to connect to Galileo using an API key and optionally a URL for a custom deployment, as well as setting the project name to log the experiments to.

API key

To get started running experiments with Galileo, you need to configure your API key, and optionally the URL of your Galileo deployment if you are using a custom-hosted, or self-deployed version. These are set as environment variables. In development you can use a .env file for these, for a production deployment make sure you configure these correctly for your deployment platform.

If you are using the free version of Galileo, there is no need to set the GALILEO_CONSOLE_URL environment variable.

Environment variable	Description
`GALILEO_API_KEY`	Your Galileo API key.
`GALILEO_CONSOLE_URL`	For custom Galileo deployments only, set this to the URL of your Galileo Console to log to. If this is not set, it will default to the hosted Galileo version at app.galileo.ai.

Project

The project can be configured as an environment variable, or directly in code.

Environment variable	Description
`GALILEO_PROJECT`	The Galileo project to log to. If this is not set, you will need to pass the project name in code.

You can also set the project when running the experiment by passing it in to the run experiments call.

results = run_experiment(
    "my-experiment",
    dataset=dataset,
    prompt_template=prompt,
    metrics=[GalileoScorers.correctness],
    # Set the project name here
    project="my-project"
)

Next steps

Experiments SDK

Run experiments in code

Learn how to run experiments in Galileo using the Galileo SDKs

Run experiments in playgrounds

Learn about running experiments in the Galileo console using playgrounds and datasets.

Run experiments in unit tests

Learn how to run experiments in unit tests that you can use during development, or in your CI/CD pipelines.

Compare experiments

Learn how to compare experiments in Galileo.

Datasets

Learn about more datasets, the data driving your experiments.

Prompts

Learn how to create and use prompts in experiments

Out-of-the-box and custom metrics

Metrics reference guide

A list of supported metrics and how to use them in experiments.

Local metrics

Create and run custom metrics directly in code.

Custom code-based metrics

Create reusable custom metrics right in the Galileo Console.

Custom LLM-as-a-judge metrics

Create reusable custom metrics using LLMs to evaluate your response quality

Overview

Get Started

Logging and Monitoring

Experiments

Runtime Protection

Metrics

Annotations

Integrations

Security

References

Experiments Basics

Components of an experiment

Metrics

Datasets

Running the experiments

Output of an experiment

Logging experiments

Typical experimentation flow

Initial setup

API key

Project

Next steps

Experiments SDK

Run experiments in code

Run experiments in playgrounds

Run experiments in unit tests

Compare experiments

Datasets

Prompts

Out-of-the-box and custom metrics

Metrics reference guide

Local metrics

Custom code-based metrics

Custom LLM-as-a-judge metrics

Overview

Get Started

Logging and Monitoring

Experiments

Runtime Protection

Metrics

Annotations

Integrations

Security

References

​Components of an experiment

​Metrics

​Datasets

​Running the experiments

​Output of an experiment

​Logging experiments

​Typical experimentation flow

​Initial setup

​API key

​Project

​Next steps

​Experiments SDK

Run experiments in code

Run experiments in playgrounds

Run experiments in unit tests

Compare experiments

Datasets

Prompts

​Out-of-the-box and custom metrics

Metrics reference guide

Local metrics

Custom code-based metrics

Custom LLM-as-a-judge metrics

Components of an experiment

Metrics

Datasets

Running the experiments

Output of an experiment

Logging experiments

Typical experimentation flow

Initial setup

API key

Project

Next steps

Experiments SDK

Out-of-the-box and custom metrics