- Prompt engineering: Use experiments to iterate and test different prompts against well defined inputs, or your application code.
- Model selection: Experiments can be run against different models to provide a comprehensive way to compare each model against your specific data, use case, or application.
- Application testing: Experiments are a way to repeatable test your application with known data, for example in unit, integration, or end-to-end testing in a CI/CD pipeline.
Run experiments in playgrounds
Learn about running experiments in the Galileo console using playgrounds and datasets.
Experiments SDK overview
Learn how to run experiments with multiple data points using datasets and prompt templates.
Components of an experiment
An experiment is made of three components:- One or more metrics that you want to evaluate. These can be out-of-the-box Galileo metrics, Luna-2 metrics, or custom LLM-as-a-judge metrics, or custom code-based metrics.
- A dataset containing well defined inputs, and optional outputs for evaluations that require ground truth.
- A means of running the dataset against a model and evaluating the response. This can be via sending prompts with placeholders for the data from the dataset, or by running custom application code.
Metrics
When you run an experiment, you define what metrics you want to evaluate against. These can be:- Out-of-the-box Galileo metrics
- Luna-2 metrics
- Custom LLM-as-a-judge metrics
- Custom code-based metrics
Datasets
Datasets are well-defined sets of data with inputs, and optional ground truth values. The inputs can be full prompts, data that is injected into a defined prompt, or inputs to an AI application such as user inputs to a chatbot. Ground truth is required for some metrics to help evaluate the response against this truth. Datasets can be created in the Galileo console by uploading a file, using synthetic data generation, or by manually creating rows. You can also create datasets in code, or define datasets inline that are not saved to Galileo. Datasets saved to Galileo are versioned, with the full history maintained when you update a dataset. You can learn more about datasets in our dataset documentation.Running the experiments
Experiments can be run in multiple different ways, depending on your needs:- Using a playground
- Running an experiment in code using a prompt template against an LLM
- Running an experiment in code against a custom function in code
- Running an experiment in code against your existing application code
Output of an experiment
When you run an experiment, the output contains:- The input. When using prompts the input is the full prompt with dataset values injected as variables
- The response from the LLM or your code
- System metrics, such as latency, the number of tokens used, and the estimated cost
- The calculated metrics
Typical experimentation flow
When you start working with your application, you’ll naturally progress from basic testing to more comprehensive evaluation. This journey helps you build confidence in your application’s performance and systematically improve its behavior. As you advance, you’ll find that organizing your test cases into datasets becomes essential for effective experimentation - allowing you to track performance, identify patterns, and measure improvements over time.1
Initial Testing
Run your application with simple test cases to get a feel for how it performs. This is like taking your first steps:
- Test with straightforward, expected inputs
- Watch how your application behaves in ideal conditions
- Look for any immediate issues or unexpected behaviors
- Get comfortable with your metrics and what they tell you
2
Expanding Test Coverage
Once you’re comfortable with basic testing, it’s time to broaden your horizons. This is where Galileo’s dataset features become valuable:
- Introduce more complex and varied inputs
- Use Galileo’s datasets to organize and maintain test cases
- Or bring your own dataset if you already have test data
- Run experiments to look for patterns in how your application handles different types of inputs
3
Finding and Fixing Issues
As you test more extensively, you’ll discover areas where your application needs improvement:
- Identify specific inputs that cause problems
- Add these inputs to a datasets
- Look for patterns in problematic cases
- Track how your fixes perform against these problem cases
- Build a library of test cases for regression testing
4
Continuous Improvement
Now you’re in a cycle of continuous improvement:
- Regularly run tests against your datasets
- Monitor for new issues or patterns
- Quickly identify when changes cause problems
- Maintain datasets that represent your key test cases
- Track your app’s improving performance over time
Next steps
Create a dataset
Learn how to create and manage datasets in Galileo.
Run experiments in playgrounds
Learn about running experiments in the Galileo console using playgrounds and datasets.
Run experiments with Code
Learn how to run experiments in Galileo.
Compare experiments
Learn how to compare experiments in Galileo.