Skip to main content
Custom metrics allow you to define specific evaluation criteria for your LLM applications. Galileo supports two types of custom metrics:
  • Registered custom metrics: Metrics that can be shared across your organization
  • Local metrics: Metrics that run in your local notebook environment

Registered custom metrics

Registered custom metrics are stored and run in Galileo’s environment and can be used across your organization.

Create a registered custom metric

You can create a registered custom metric either through the Python SDK or directly in the Galileo UI. Let’s walk through the UI approach:
1

Navigate to the Metrics section

In the Galileo platform, go to the Metrics section and select the Create New Metric button in the top right corner.Create a new metric
2

Select the Code metric type

From the dialog that appears, choose the Code-powered metric type. This option allows you to write custom Python code to evaluate your LLM outputs.
Select the Code metric type
3

Write your custom metric

Select the step level you’d like to apply this metric to (ie: Sessions, Traces, LlmSpan, etc…). Then, use the code editor to write your custom metric. The editor provides a template with the required functions and helpful comments to guide you.Code editorThe code editor allows you to write and test your metric directly in the browser. You’ll need to define scorer_fn and aggregator_fn functions as described below.
4

Save your metric

After writing your custom metric code, select the Save button in the bottom right corner of the code editor. Your metric will be validated and, if there are no errors, it will be saved and become available for use across your organization.You can now select this metric when running evaluations.

The scorer function

This function evaluates individual responses and returns a score:
def scorer_fn(
    *,
    step_object: (
        Session | Trace | WorkflowSpan | AgentSpan |
        LlmSpan | RetrieverSpan | ToolSpan
    ),
    **kwargs: Any
) -> float | int | bool | str:
    # Your scoring logic here
    return score
The function must accept **kwargs to ensure forward/backward compatibility. Here’s a complete example that measures the difference in length between the output and ground truth:
def scorer_fn(*,
              step_object: LlmSpan,
              **kwargs: Any) -> Union[float, int, bool, str, None]:
    node_output = step_object.output.content
    reference_output = step_object.dataset_output
    return abs(len(node_output) - len(reference_output))
Parameter details:
  • step_object: The step object represents the unit of your LLM application being evaluated. It can be one of several types from the galileo library:
    • Session - A complete user session containing multiple traces
    • Trace - A single execution trace containing multiple spans
    • WorkflowSpan - A workflow-level span containing child spans
    • AgentSpan - An agent execution span
    • LlmSpan - A single LLM call span
    • RetrieverSpan - A retriever/search operation span
    • ToolSpan - A tool execution span
All step objects provide access to key attributes for evaluation:
  • Input/Output data: Access the input prompt and generated output (e.g., step_object.output.content for LLM responses)
  • Metadata: Additional context like timestamps, model information, and custom metadata
  • Dataset references: Ground truth or reference data when available (e.g., step_object.dataset_output)
  • Hierarchical data: For Session/Trace/Workflow objects, access child spans and nested execution data
For detailed documentation on each step object type and their specific attributes, refer to the Galileo Python SDK documentation. Each type has unique properties tailored to its execution context—for example, LlmSpan includes model parameters and token counts, while RetrieverSpan includes retrieved documents and search queries.

The aggregator function

This function aggregates individual scores into summary metrics across experiments:
def aggregator_fn(*,
                 scores: List[Union[float, int, bool, str, None]]
                 ) -> Dict[str, Union[float, int, bool, str, None]]:
    # Your aggregation logic here
    return {
        "Metric Name 1": aggregated_value_1,
        "Metric Name 2": aggregated_value_2
    }

Complete example: trace counter

Let’s create a custom metric that counts the number of traces in a Session:
from galileo import Session

def scorer_fn(*, step_object: Session, **kwargs) -> int:
    num_traces = len(step_object.traces)
    return num_traces

def aggregator_fn(*, scores: list[int]) -> dict[str, int]:
    return {
        "Total Traces Across Sessions": sum(scores),
        "Average Num Traces": sum(scores) / len(scores) if scores else 0,
    }

Execution environment

Registered custom metrics run in a sandbox Python 3.10 environment with only the Python standard library and the Galileo SDK installed. To install your own PyPI package, you can define dependencies at the top of the file using the script dependency format from uv:
# /// script
# dependencies = [
#   "requests<3",
#   "rich",
# ]
# ///
For full documentation on defining dependencies, check out the ‘uv’ script dependency docs.

Local metrics

A Local metric (or Local scorer) is a custom metric that you can attach to an experiment — just like a Galileo preset metric. The key difference is that a Local Metric lives in code on your machine, so you share it by sharing your code. Local Metrics are ideal for running isolated tests and refining outcomes when you need more control than built-in metrics offer. You can also use any library or custom Python code with your local metrics, including calling out to LLMs or other APIs.
Galileo currently only supports Local scorers in Python

Local scorer components

A Local scorer consists of three main parts:
  1. Scorer Function Receives a single Span or Trace containing the LLM input and output, and computes a score. The exact measurement is up to you — for example, you might measure the length of the output or rate it based on the presence/absence of specific words.
  2. Aggregator Function Aggregates the scores generated by the Scorer Function and returns a final metric value. This function receives a list of the type returned by your Scorer. For instance, if your Scorer returns a str, the Aggregator will be called with a list[str]. The Aggregator’s return value can also be any type (e.g., str, bool, int), depending on how you want to represent the final metric.
  3. LocalMetricConfig[type] A typed callable provided by Galileo’s Python SDK that combines your Scorer and Aggregator into a custom metric.
    • The generic type should match the type returned by your Aggregator.
    • Example: If your Scorer returns bool values, you would use LocalMetricConfig[bool](…), and your Aggregator must accept a list[bool] and return a bool.
Scorer and aggregator functions can be simple lambdas when your logic is straightforward. Local metrics let you tailor evaluation to your exact needs by defining custom scoring logic in code. Whether you want to measure response brevity, detect specific keywords, or implement a complex scoring algorithm, Local Metrics integrate seamlessly with Galileo’s experimentation framework. Once you’ve defined your Scorer and Aggregator functions and wrapped them in a LocalMetricConfig, running the experiment is as simple as calling run_experiment. The results appear alongside Galileo’s built-in metrics, so you can compare, visualize, and analyze everything in one place. With local metrics, you have full control over how you measure LLM behavior—unlocking deeper insights and more targeted evaluations for your AI applications.

Create a local metric

Learn how to create a local metric in Python to use in your experiments

Comparison: registered custom metrics vs. local metrics

FeatureRegistered Custom MetricsLocal Metrics
CreationPython client, activated via UIPython client only
SharingOrganization-wideCurrent project only
EnvironmentServer-sideLocal Python environment
LibrariesAny available library.Any available library
ResourcesRestricted by GalileoLocal resources

Common use cases

Custom metrics are ideal for:
  • Heuristic evaluation: Checking for specific patterns, keywords, or structural elements
  • Model-guided evaluation: Using pre-trained models to detect entities or LLMs to grade outputs
  • Business-specific metrics: Measuring domain-specific quality indicators
  • Comparative analysis: Comparing outputs against ground truth or reference data

Simple example: sentiment scorer

Here’s a simple custom metric that evaluates the sentiment of responses:
from galileo import Span, Trace

def scorer_fn(step: Span | Trace) -> float:
    """
    A simple sentiment scorer that counts positive and negative words.
    Returns a score between -1 (negative) and 1 (positive).
    """
    positive_words = [
        "good", "great", "excellent",
        "positive", "happy", "best", "wonderful"
    ]
    negative_words = [
        "bad", "poor", "negative", "terrible",
        "worst", "awful", "horrible"
    ]

    node_output = step.output.content

    # Convert to lowercase for case-insensitive matching
    text = node_output.lower()

    # Count occurrences
    positive_count = sum(text.count(word) for word in positive_words)
    negative_count = sum(text.count(word) for word in negative_words)

    total_count = positive_count + negative_count

    # Calculate sentiment score
    if total_count == 0:
        return 0.0  # Neutral

    return (positive_count - negative_count) / total_count

def aggregator_fn(scores: list[float]) -> dict[str, float]:
    """Aggregate sentiment scores across responses."""
    if not scores:
        return {"Average Sentiment": 0.0}

    avg_sentiment = sum(scores) / len(scores)

    return {
        "Average Sentiment": round(avg_sentiment, 2),
        "Positive Responses": sum(1 for s in scores if s > 0.2),
        "Neutral Responses": sum(1 for s in scores if -0.2 <= s <= 0.2),
        "Negative Responses": sum(1 for s in scores if s < -0.2)
    }
This simple sentiment scorer:
  • Counts positive and negative words in responses
  • Calculates a sentiment score between -1 (negative) and 1 (positive)
  • Aggregates results to show the distribution of positive, neutral, and negative responses
You can easily extend this with more sophisticated sentiment analysis techniques or domain-specific terminology.

Next steps