Overview

This guide shows you how to create a custom local metric in Python to use in an experiment.

In this example, you will be creating a metric to rate the brevity (shortness) of an LLM’s response based on word count. The sample code to run the experiment will use OpenAI as an LLM.

Before you start

To complete this how-to, you will need:

Install dependencies

To use Galileo, you need to install some package dependencies, and configure environment variables.

1

Install Required Dependencies

Install the required dependencies for your app. Create a virtual environment using your preferred method, then install dependencies inside that environment:

Python
pip install "galileo[openai]" python-dotenv
2

Create a `.env` file, and add the following values

.env
GALILEO_API_KEY=your_galileo_api_key
GALILEO_PROJECT=your_project_name

OPENAI_API_KEY=your_openai_api_key

Create your local metric

1

Create a file for your experiment called `experiment.py`.

2

Create a scorer function

The Scorer Function assigns one of three ranks — "Terse", "Temperate", or "Talkative", depending on how many words the model outputs. Add this code to your experiment.py file.

Python
from galileo import Trace, Span

def brevity_rank(step: Span | Trace) -> str:
    """
    Rank response brevity based on word count.
    """
    word_count = len(step.output.content.split(" "))
    if word_count <= 3:
        return "Terse"
    if word_count <= 5:
        return "Temperate"
    return "Talkative"
3

Create an aggregator function

Since our Scorer returns a single rank per record, the aggregator examines that rank and returns it — modifying it to flag overly long responses as "Terrible". Add this code to your experiment.py file.

Python
def brevity_aggregator(ranks: list[str]) -> str:
    """
    Extract the single rank and adjust if it's 'Talkative'.
    """
    # `ranks` has exactly one element, 
    # because brevity_rank outputs one value per record.
    return "Terrible" if ranks[0] == "Talkative" else ranks[0]
4

Create the local metric configuration

Here, we tell Galileo that our custom metric returns a str. We give it a name (“Terseness”), then assign the Scorer and Aggregator. Add this code to your experiment.py file.

Python
from galileo.schema.metrics import LocalMetricConfig

terseness = LocalMetricConfig[str](
    # Metric name (shown as a column in Galileo)
    name="Terseness",
    # Scorer Function defined above
    scorer_fn=brevity_rank,
    # Aggregator Function defined above
    aggregator_fn=brevity_aggregator,
)

The metric has been created. Next, we can use it in an experiment.

Prepare the experiment

For this example, we’ll ask the LLM to specify the continent of four countries, encouraging it to be succinct.

1

Create a dataset

Create a dataset of inputs to the experiment by adding this code to your experiment.py file.

Python
# Simple dataset with four countries:
countries_dataset = [
    {"input": "Indonesia"},
    {"input": "New Zealand"},
    {"input": "Greenland"},
    {"input": "China"},
]
2

Call the LLM

Next you need a custom function to be called by your experiment. Add this code to your experiment.py file.

Python
from galileo.openai import openai

# The function that calls the LLM for each input:
def llm_call(input):
    client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    return (
        client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system", 
                    "content": """
                    You are a geography expert.
                    Always answer as succinctly as possible.
                    """
                },
                {
                    "role": "user", 
                    "content": f"""
                    Which continent does the following
                    country belong to: {input}
                    """
                },
            ],
        )
        .choices[0]
        .message.content
    )
3

Add code to run the experiment

Finally, add code to run the experiment using your dataset and custom local metric.

Python
import os

# Run the Experiment!
results = run_experiment(
    "terseness-local-metric",
    dataset=countries_dataset,
    function=llm_call,
    metrics=[terseness],
    project=os.environ["GALILEO_PROJECT"],
)

Run the experiment

Now your experiment is set up, you can run it to see the results of your local metric.

1

Run the experiment code

Python
python experiment.py

When the experiment runs, it will output a link to view the results in the terminal.

Python
(.venv)  python experiment.py
Experiment terseness-local-metric has completed and results are available
at https://console.galileo.ai//project/xxx/experiments/xxx
2

View the experiment

Follow the link in your terminal to view the results of the experiment. This experiment has 4 rows - one per item in the dataset.

The new Terseness metric is available in both the Traces table, and from the metrics pane when selecting a row.

You have successfully created a local metric and used it in an experiment.

See also