Overview

A Local Metric (or Local scorer) is a custom metric that you can attach to an experiment—just like a Galileo preset metric. The key difference is that a Local Metric lives in code on your machine, so you share it by sharing your code. Local Metrics are ideal for running isolated tests and refining outcomes when you need more control than built-in metrics offer. This guide explains what Local Metrics are, how to create one, and where to view results.

If you’d rather see a full code example first, jump to the Run the experiment section, then come back to review the implementation details.

Galileo currently only supports Local scorers in Python, although TypeScript support is planned.

Local Scorer Components

A Local scorer consists of three main parts:

  1. Scorer Function
    Receives a single Span or Trace (containing the LLM input and output) and computes a score. The exact measurement is up to you—for example, you might measure the length of the output or rate it based on the presence/absence of specific words.

  2. Aggregator Function
    Aggregates the scores generated by the Scorer Function and returns a final metric value. This function receives a list of the type returned by your Scorer. For instance, if your Scorer returns a str, the Aggregator will be called with a list[str]. The Aggregator’s return value can also be any type (e.g., str, bool, int), depending on how you want to represent the final metric.

  3. LocalMetricConfig[type]
    A typed callable provided by Galileo’s Python SDK that combines your Scorer and Aggregator into a custom metric.

    • The generic type should match the type returned by your Aggregator.
    • Example: If your Scorer returns bool values, you would use LocalMetricConfig[bool](…), and your Aggregator must accept a list[bool] and return a bool.

Scorer and Aggregator functions can be simple lambdas when your logic is straightforward.

Example: Response Brevity Metric

Below is a step-by-step implementation of a Local Metric that rates the brevity (shortness) of an LLM’s response based on word count.

Creating the Local Scorer

1

Create a Scorer Function

The Scorer Function assigns one of three ranks—"Terse", "Temperate", or "Talkative"—depending on how many words the model outputs:

Python
from galileo import Trace, Span

def brevity_rank(step: Span | Trace) -> str:
    """Rank response brevity based on word count."""
    word_count = len(step.output.content.split(" "))
    if word_count <= 3:
        return "Terse"
    if word_count <= 5:
        return "Temperate"
    return "Talkative"
2

Create an Aggregator Function

Since our Scorer returns a single rank per record, the Aggregator simply examines that rank and returns it—modifying it to flag overly long responses as "Terrible":

Python
def brevity_aggregator(ranks: list[str]) -> str:
    """Extract the single rank and adjust if it's 'Talkative'."""
    # `ranks` has exactly one element, because brevity_rank outputs one value per record.
    return "Terrible" if ranks[0] == "Talkative" else ranks[0]
3

Create the Local Metric Configuration

Here, we tell Galileo that our custom metric returns a str. We give it a name (“Terseness”), assign the Scorer and Aggregator, and voilà—our Local Metric is ready:

Python
from galileo.schema.metrics import LocalMetricConfig

terseness = LocalMetricConfig[str](
    name="Terseness",        # Metric name (shown as a column in Galileo)
    scorer_fn=brevity_rank,  # Scorer Function defined above
    aggregator_fn=brevity_aggregator,  # Aggregator Function defined above
)

The metric has been created. Next, we can use it in an experiment.

Prepare the Experiment

For this example, we’ll ask the LLM to specify the continent of four countries, encouraging it to be succinct:

Python
from galileo.openai import openai

# Simple dataset with four countries:
countries_dataset = [
    {"input": "Indonesia"},
    {"input": "New Zealand"},
    {"input": "Greenland"},
    {"input": "China"},
]

# The function that calls the LLM for each input:
def llm_call(input):
    return (
        openai.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system", 
                    "content": "You are a geography expert. Always answer as succinctly as possible."
                },
                {
                    "role": "user", 
                    "content": f"Which continent does the following country belong to: {input}"
                },
            ],
        )
        .choices[0]
        .message.content
    )

Run the Experiment

The snippet below brings all the preceding code samples together into a single file. Now we can combine everything (dataset, LLM-call, and our custom metric) by calling run_experiment. After it runs, you’ll see a URL in your terminal that directs you to the experiment results in the Galileo console:

from galileo import Trace, Span
from galileo.experiments import run_experiment
from galileo.openai import openai
from galileo.schema.metrics import LocalMetricConfig

# 1. Scorer Function
def brevity_rank(step: Span | Trace) -> str:
    """Rank response brevity based on word count."""
    word_count = len(step.output.content.split(" "))
    if word_count <= 3:
        return "Terse"
    if word_count <= 5:
        return "Temperate"
    return "Talkative"

# 2. Aggregator Function
def brevity_aggregator(ranks: list[str]) -> str:
    """Extract the single rank and adjust if it's 'Talkative'."""
    return "Terrible" if ranks[0] == "Talkative" else ranks[0]

# 3. Configure the Local Metric
terseness = LocalMetricConfig[str](
    name="Terseness",
    scorer_fn=brevity_rank,
    aggregator_fn=brevity_aggregator,
)

# 4. Dataset
countries_dataset = [
    {"input": "Indonesia"},
    {"input": "New Zealand"},
    {"input": "Greenland"},
    {"input": "China"},
]

# 5. LLM-Call Function
def llm_call(input):
    return (
        openai.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system", 
                    "content": "You are a geography expert. Always answer as succinctly as possible."
                },
                {
                    "role": "user", 
                    "content": f"Which continent does the following country belong to: {input}"
                },
            ],
        )
        .choices[0]
        .message.content
    )

# 6. Run the Experiment!
results = run_experiment(
    "terseness-custom-metric",
    dataset=countries_dataset,
    function=llm_call,
    metrics=[terseness],  # You can add multiple custom metrics here
    project="My first project",
)

View the Results

After the experiment completes, your terminal output will include a URL directing you to the experiment page in the Galileo console. On the experiment’s page, you’ll see a new column labeled Terseness (or whatever name you chose) containing your custom metric’s results for each input.


Conclusion

Local Metrics let you tailor evaluation to your exact needs by defining custom scoring logic in code. Whether you want to measure response brevity, detect specific keywords, or implement a complex scoring algorithm, Local Metrics integrate seamlessly with Galileo’s experimentation framework. Once you’ve defined your Scorer and Aggregator functions and wrapped them in a LocalMetricConfig, running the experiment is as simple as calling run_experiment. The results appear alongside Galileo’s built-in metrics, so you can compare, visualize, and analyze everything in one place.

With Local Metrics, you have full control over how you measure LLM behavior—unlocking deeper insights and more targeted evaluations for your AI applications.