Set up your unit test
A typical set up for unit testing with evals is to have:- A dataset with a set of well-defined inputs to your application. When you initially develop your application, your dataset can be defined by a product expert or data science team, augmented with synthetic data. After your application is in production, you can then export real-world data from Log streams into your dataset.
- A set of metrics that you want to evaluate in the experiment. These can be out-of-the-box metrics, or custom metrics, either using LLM-as-a-judge, or code. For out-of-the-box or LLM-as-a-judge metrics, you will need an LLM integration configured to evaluate the metric.
-
An application to unit test.
This should be configured to use an actual LLM for the code under test, using the same LLM as production. Other parts of your application can be mocked as required.
You may need to configure the way you start, conclude, and flush sessions and traces differently in your application to support running this code using experiments. See our run experiments with custom functions guide for more details.
Although it can be tempting to use a different LLM (such as a cheaper one) during unit testing, using a different LLM to production will give unit test results that do not match the actual behavior of your production system.
Run the experiment
To run the experiment, create a unit test using the framework of your choice. You can then run an experiment using the run experiment function, passing in a named dataset, and calling your application code. When you run the experiment, you provide an experiment name. This needs to be unique, but if you set a non-unique name, the SDK will add the run date and time to the name in the created experiment to make it unique. This new name is returned in the response.Check the results
When you run an experiment, it starts running in the background. You can then poll for the status of the experiment, and when it is finished and all metrics are evaluated, check for the average values of each metric against the entire dataset. You can then assert against these values, for example failing the unit test if the average is less than a defined threshold. To poll the experiment, you retrieve it by name. If you used a non-unique name when running the experiment, then the actual name that is used with the date and time added is returned from the call to the run experiment function.average_<metric_name>
, where <metric_name>
is the value of the metric used, such as the value of a GalileoScorers
, or the name of a custom metric.
You can check the returned experiment for this property, and the metrics. If these are not yet set, wait a few seconds, re-get the experiment and check again.
Sample projects
The sample projects that are created when you create a new Galileo organization all contain unit tests that run experiments. Check out these projects for more details.Simple Chatbot
Learn more about the simple chatbot sample project
Multi-Agent Banking Chatbot With LangGraph
Learn more about the multi-agent banking chatbot with LangGraph sample project