Once you have run some experiments, the next natural step is to compare the results of your experiments, allowing you to optimize prompts, select the best model for your use case, or tune your input data to suit your needs. Galileo allows you to compare up to five different experiments, showing the difference in outputs, metrics, latency, and token usage.

Prerequisites

To compare experiments, you will need:
  • A project containing two or more experiments, created either from the Galileo Console, or in code

Compare experiments

Experiments can be compared from the Experiments tab in the Galileo Console. Experiments are part of a project, so select the relevant project to see the experiments tab.
  1. Open the Experiments tab The experiments tab for a project
  2. Select the experiments you want to compare by checking the box next to each experiment. Select between two and five experiments. The experiments tab with check boxes checked on the left of two experiments
  3. Select the Compare experiments button to open the comparison page The compare experiments button, above the rows of experiments
  4. You will see the experiments side by side in the comparison page 2 experiments side by side showing metrics, input prompt and output

Review the comparison

The comparison shows each experiment’s metrics, inputs, and outputs.
  1. If your experiment has multiple inputs, you can navigate between inputs using the forward and backwards buttons. Your experiment inputs should align by position - for example if you have two inputs in each, the comparison is based on input one from experiment one being compared to input one from experiment two, and so on.
    All the experiments in the comparison should have the same number of inputs. If they do not, you will only be able to navigate based off of the experiment with the least inputs.
    The navigation buttons to navigate between inputs
  2. The Details section shows the model used, and averages and totals for both the cost of each response and generating the metrics, as well as averages for the metrics.
    Averages are calculated for experiments with multiple inputs.
    The details tab showing two experiments, one using GPT 3.5 Turbo, the other using GPT-4o mini. Each detail has averages and totals for costs, and averages for metrics
  3. The Metrics section shows the metrics for the currently selected input. These metrics include system metrics (latency, the number of input and output tokens), and the selected metrics for the experiment. Comparing two sets of metrics with latency, number of tokens, instruction adherence, and validate investment advice If you hover over a metric, a pop-up will explain the reasoning behind the score, along with details of the LLM used to judge, the cost of the judgment, and the number of judges used. Hovering over a metric showing an explanation in a popup
  4. The Input and Output sections show the input to the experiment, and the output generated by the LLM. The inputs and outputs for an experiment