- Prompt engineering: Use experiments to iterate and test different prompts against well defined inputs, or your application code.
- Model selection: Experiments can be run against different models to provide a comprehensive way to compare each model against your specific data, use case, or application.
- Application testing: Experiments are a way to repeatable test your application with known data, for example in unit, integration, or end-to-end testing in a CI/CD pipeline.
Components of an experiment
An experiment is made of three components:- One or more metrics that you want to evaluate.
- A dataset containing well defined inputs, and optional outputs for evaluations that require ground truth.
- A means of running the dataset against a model and evaluating the response. This can be via sending prompts with placeholders for the data from the dataset, or by running custom application code.
Metrics
When you run an experiment, you define what metrics you want to evaluate against. These can be:- Out-of-the-box Galileo metrics
- Luna-2 metrics
- Custom LLM-as-a-judge metrics
- Custom code-based metrics
Datasets
Datasets are well-defined sets of data with inputs, and optional ground truth values. The inputs can be full prompts, data that is injected into a defined prompt, or inputs to an AI application such as user inputs to a chatbot. Ground truth is required for some metrics to help evaluate the response against this truth. Datasets can be created in the Galileo console by uploading a file, using synthetic data generation, or by manually creating rows. You can also create datasets in code, or define datasets inline that are not saved to Galileo. Datasets saved to Galileo are versioned, with the full history maintained when you update a dataset. You can learn more about datasets in our dataset documentation.Running the experiments
Experiments can be run in multiple different ways, depending on your needs:- Using a playground
- Running an experiment in code using a prompt template against an LLM
- Running an experiment in code against a custom function in code
- Running an experiment in code against your existing application code
Output of an experiment
When you run an experiment, the output contains:- The input. When using prompts the input is the full prompt with dataset values injected as variables
- The response from the LLM or your code
- System metrics, such as latency, the number of tokens used, and the estimated cost
- The calculated metrics
Logging experiments
Experiments belong to projects, with one project containing many experiments. Each experiment has a single Log stream with multiple traces. When you use a dataset with an experiment, each row in the dataset is logged as a separate trace in the experiment’s Log stream. When you are starting to plan your experiments, ensure you have created the relevant project to run them in.Typical experimentation flow
When you start working with your application, you’ll naturally progress from basic testing to more comprehensive evaluation. This journey helps you build confidence in your application’s performance and systematically improve its behavior. As you advance, you’ll find that organizing your test cases into datasets becomes essential for effective experimentation - allowing you to track performance, identify patterns, and measure improvements over time.1
Initial Testing
Run your application with simple test cases to get a feel for how it performs. This is like taking your first steps:
- Test with straightforward, expected inputs
- Watch how your application behaves in ideal conditions
- Look for any immediate issues or unexpected behaviors
- Get comfortable with your metrics and what they tell you
2
Expanding Test Coverage
Once you’re comfortable with basic testing, it’s time to broaden your horizons. This is where Galileo’s dataset features become valuable:
- Introduce more complex and varied inputs
- Use Galileo’s datasets to organize and maintain test cases
- Or bring your own dataset if you already have test data
- Run experiments to look for patterns in how your application handles different types of inputs
3
Finding and Fixing Issues
As you test more extensively, you’ll discover areas where your application needs improvement:
- Identify specific inputs that cause problems
- Add these inputs to a datasets
- Look for patterns in problematic cases
- Track how your fixes perform against these problem cases
- Build a library of test cases for regression testing
4
Continuous Improvement
Now you’re in a cycle of continuous improvement:
- Regularly run tests against your datasets
- Monitor for new issues or patterns
- Quickly identify when changes cause problems
- Maintain datasets that represent your key test cases
- Track your app’s improving performance over time
Initial setup
To log experiments to Galileo, you need to configure the SDK to connect to Galileo using an API key and optionally a URL for a custom deployment, as well as setting the project name to log the experiments to.API key
To get started running experiments with Galileo, you need to configure your API key, and optionally the URL of your Galileo deployment if you are using a custom-hosted, or self-deployed version. These are set as environment variables. In development you can use a.env
file for these, for a production deployment make sure you configure these correctly for your deployment platform.
If you are using the free version of Galileo, there is no need to set the
GALILEO_CONSOLE_URL
environment variable.Environment variable | Description |
---|---|
GALILEO_API_KEY | Your Galileo API key. |
GALILEO_CONSOLE_URL | For custom Galileo deployments only, set this to the URL of your Galileo Console to log to. If this is not set, it will default to the hosted Galileo version at app.galileo.ai. |
Project
The project can be configured as an environment variable, or directly in code.Environment variable | Description |
---|---|
GALILEO_PROJECT | The Galileo project to log to. If this is not set, you will need to pass the project name in code. |
Next steps
Experiments SDK
Run experiments in code
Learn how to run experiments in Galileo using the Galileo SDKs
Run experiments in playgrounds
Learn about running experiments in the Galileo console using playgrounds and datasets.
Run experiments in unit tests
Learn how to run experiments in unit tests that you can use during development, or in your CI/CD pipelines.
Compare experiments
Learn how to compare experiments in Galileo.
Datasets
Learn about more datasets, the data driving your experiments.
Prompts
Learn how to create and use prompts in experiments
Out-of-the-box and custom metrics
Metrics reference guide
A list of supported metrics and how to use them in experiments.
Local metrics
Create and run custom metrics directly in code.
Custom code-based metrics
Create reusable custom metrics right in the Galileo Console.
Custom LLM-as-a-judge metrics
Create reusable custom metrics using LLMs to evaluate your response quality