Experiments allow you to evaluate prompts, models, and your application code, using well-defined inputs, against metrics of your choice. You can use experiments for all the stages in your AI application development lifecycle, and experiments are a key part of evaluation-driven development:
  • Prompt engineering: Use experiments to iterate and test different prompts against well defined inputs, or your application code.
  • Model selection: Experiments can be run against different models to provide a comprehensive way to compare each model against your specific data, use case, or application.
  • Application testing: Experiments are a way to repeatable test your application with known data, for example in unit, integration, or end-to-end testing in a CI/CD pipeline.
Data science teams often use experiments either through the Galileo console, or in code in notebooks for prompt engineering and model selection. AI engineers use experiments in code to validate prompts and models, and test their application.

Components of an experiment

An experiment is made of three components:
  • One or more metrics that you want to evaluate. These can be out-of-the-box Galileo metrics, Luna-2 metrics, or custom LLM-as-a-judge metrics, or custom code-based metrics.
  • A dataset containing well defined inputs, and optional outputs for evaluations that require ground truth.
  • A means of running the dataset against a model and evaluating the response. This can be via sending prompts with placeholders for the data from the dataset, or by running custom application code.
You can manage datasets and run experiments either through the playground in the Galileo console, or in code. Datasets can either be managed through the Galileo console, created either in the console, or uploaded via code, or managed purely in code.

Metrics

When you run an experiment, you define what metrics you want to evaluate against. These can be: You can learn more about metrics in our metrics documentation.

Datasets

Datasets are well-defined sets of data with inputs, and optional ground truth values. The inputs can be full prompts, data that is injected into a defined prompt, or inputs to an AI application such as user inputs to a chatbot. Ground truth is required for some metrics to help evaluate the response against this truth. Datasets can be created in the Galileo console by uploading a file, using synthetic data generation, or by manually creating rows. You can also create datasets in code, or define datasets inline that are not saved to Galileo. Datasets saved to Galileo are versioned, with the full history maintained when you update a dataset. You can learn more about datasets in our dataset documentation.

Running the experiments

Experiments can be run in multiple different ways, depending on your needs:

Output of an experiment

When you run an experiment, the output contains:
  • The input. When using prompts the input is the full prompt with dataset values injected as variables
  • The response from the LLM or your code
  • System metrics, such as latency, the number of tokens used, and the estimated cost
  • The calculated metrics
If you are using a playground, the output will be in the playground, and you can optionally save this to an experiment log stream. If you are using code to run your experiment, the output will be in an experiment log stream. These log streams contain one trace per dataset row. These log streams are visible in the Experiments tab. Each experiment is a single line, with average values for the system and evaluated metrics. You can then drill into each experiment to see all the traces and spans.

Typical experimentation flow

When you start working with your application, you’ll naturally progress from basic testing to more comprehensive evaluation. This journey helps you build confidence in your application’s performance and systematically improve its behavior. As you advance, you’ll find that organizing your test cases into datasets becomes essential for effective experimentation - allowing you to track performance, identify patterns, and measure improvements over time.
1

Initial Testing

Run your application with simple test cases to get a feel for how it performs. This is like taking your first steps:
  • Test with straightforward, expected inputs
  • Watch how your application behaves in ideal conditions
  • Look for any immediate issues or unexpected behaviors
  • Get comfortable with your metrics and what they tell you
This phase helps you establish a baseline for what “good” looks like.
2

Expanding Test Coverage

Once you’re comfortable with basic testing, it’s time to broaden your horizons. This is where Galileo’s dataset features become valuable:
  • Introduce more complex and varied inputs
  • Use Galileo’s datasets to organize and maintain test cases
  • Or bring your own dataset if you already have test data
  • Run experiments to look for patterns in how your application handles different types of inputs
Think of this as stress-testing your application across a wide range of scenarios.
3

Finding and Fixing Issues

As you test more extensively, you’ll discover areas where your application needs improvement:
  • Identify specific inputs that cause problems
  • Add these inputs to a datasets
  • Look for patterns in problematic cases
  • Track how your fixes perform against these problem cases
  • Build a library of test cases for regression testing
This systematic approach helps you not only fix issues but also prevent them from recurring.
4

Continuous Improvement

Now you’re in a cycle of continuous improvement:
  • Regularly run tests against your datasets
  • Monitor for new issues or patterns
  • Quickly identify when changes cause problems
  • Maintain datasets that represent your key test cases
  • Track your app’s improving performance over time
This ongoing process helps ensure your application keeps getting better while maintaining quality.

Next steps