Composite Metrics

Composite metrics are advanced custom metrics that can access and leverage the results of other metrics to perform sophisticated evaluations. Unlike standard metrics that operate independently, composite metrics build upon previously computed metric values to create more nuanced and context-aware assessments.

What are composite metrics?

A composite metric is a custom metric that has access to other metrics computed on the current step or any of its child steps. This allows you to:

Combine multiple metric scores into a single comprehensive evaluation
Apply conditional logic based on metric values
Create hierarchical evaluations that aggregate scores across sessions, traces, and spans
Build context-aware metrics that only calculate when certain conditions are met

Composite metrics use the required_metrics parameter to specify which metrics they depend on. These required metrics are guaranteed to be computed before the composite metric runs, and their values are accessible via the step_object.metrics dictionary.

Common use cases

Conditional evaluation

Calculate a metric only when another metric meets certain criteria: Example: Only calculate adherence if the input prompt is correct Required metrics: GalileoMetrics.correctness, GalileoMetrics.context_adherence

from galileo import GalileoMetrics, LlmSpan

def scorer_fn(*, step_object: LlmSpan, **kwargs) -> float:
    # Only calculate adherence if correctness score is high
    correctness = step_object.metrics[GalileoMetrics.correctness]
    if correctness < 0.7:
        return 0.0  # Skip adherence calculation for incorrect inputs

    # Calculate adherence for correct inputs
    adherence = step_object.metrics[GalileoMetrics.context_adherence]
    return adherence

Hierarchical aggregation

Aggregate metric values across different levels of your application hierarchy: Example: Calculate average metric scores across all spans in a session Required metrics: GalileoMetrics.context_adherence

from galileo import GalileoMetrics, Session

def scorer_fn(*, step_object: Session, **kwargs) -> float:
    llm_scores = []

    # Collect scores from all LLM spans across all traces
    for trace in step_object.traces:
        for span in trace.spans:
            if span.type == "llm":
                score = span.metrics[GalileoMetrics.context_adherence]
                llm_scores.append(score)

    # Return average score
    return sum(llm_scores) / len(llm_scores) if llm_scores else 0.0

Multi-metric analysis

Combine multiple metrics to detect specific patterns or issues: Example: Check for PII and count occurrences if found Required metrics: GalileoMetrics.output_pii

from galileo import GalileoMetrics, LlmSpan

def scorer_fn(*, step_object: LlmSpan, **kwargs) -> int:
    # Check if PII is present
    has_pii = step_object.metrics[GalileoMetrics.output_pii]

    if not has_pii:
        return 0

    # If PII found, count how many times SSN pattern appears
    import re
    output = step_object.output.content
    ssn_pattern = r'\b\d{3}-\d{2}-\d{4}\b'
    ssn_count = len(re.findall(ssn_pattern, output))

    return ssn_count

Cross-span evaluation

Evaluate metrics across different span types in a trace: Example: Combine retriever and LLM metrics for RAG evaluation Required metrics: GalileoMetrics.context_relevance, GalileoMetrics.context_adherence

from galileo import GalileoMetrics, Trace

def scorer_fn(*, step_object: Trace, **kwargs) -> float:
    retriever_score = 0.0
    llm_score = 0.0

    for span in step_object.spans:
        if span.type == "retriever":
            retriever_score = span.metrics[GalileoMetrics.context_relevance]
        elif span.type == "llm":
            llm_score = span.metrics[GalileoMetrics.context_adherence]

    # Combine both scores
    return (retriever_score + llm_score) / 2

Specifying required metrics

The required_metrics parameter tells Galileo which metrics must be computed before your composite metric runs. This ensures the metric values are available when your scorer function executes. You specify required metrics when creating your code-based custom metric:

In the UI: Select metrics from the “Required Metrics” dropdown (see how)
In the Python SDK: Pass the required_metrics parameter

Galileo preset metrics

For Galileo’s built-in metrics, use the GalileoMetrics enum. For example, you might select:

GalileoMetrics.context_adherence
GalileoMetrics.context_adherence_luna
GalileoMetrics.correctness

Custom metrics

For your own custom metrics, reference them by name as strings. You can also mix custom metrics with Galileo preset metrics:

"My Custom Metric" (string for custom metric)
"Compliance Check" (string for custom metric)
GalileoMetrics.output_pii (Galileo preset metric)

Accessing metric values

Once you’ve specified required metrics, access them through the step_object.metrics dictionary:

def scorer_fn(*, step_object: LlmSpan, **kwargs) -> float:
    # Access metrics using the same enum or string used in required_metrics
    adherence = step_object.metrics[GalileoMetrics.context_adherence]
    custom_score = step_object.metrics["My Custom Metric"]

    # Use the metric values in your logic
    return (adherence + custom_score) / 2

Complete example: multi-level session metric

This example demonstrates a comprehensive composite metric that aggregates scores from all hierarchy levels. Required metrics to select (in UI dropdown or SDK parameter):

GalileoMetrics.conversation_quality
GalileoMetrics.action_completion
GalileoMetrics.agent_efficiency
GalileoMetrics.action_completion_luna
GalileoMetrics.action_advancement
GalileoMetrics.context_adherence
GalileoMetrics.context_relevance
GalileoMetrics.tool_error_rate

from galileo import GalileoMetrics, Session

def scorer_fn(*, step_object: Session, **kwargs) -> float:
    """
    Comprehensive session score combining metrics from all hierarchy levels.
    """
    # Session-level metrics
    conversation_quality = step_object.metrics[
        GalileoMetrics.conversation_quality
    ]
    action_completion = step_object.metrics[GalileoMetrics.action_completion]
    agent_efficiency = step_object.metrics[GalileoMetrics.agent_efficiency]

    # Collect trace-level metrics
    trace_scores = []
    for trace in step_object.traces:
        trace_scores.append(
            trace.metrics[GalileoMetrics.action_completion_luna]
        )
        trace_scores.append(trace.metrics[GalileoMetrics.action_advancement])

    # Collect span-level metrics by type
    llm_scores = []
    retriever_scores = []
    tool_scores = []

    for trace in step_object.traces:
        for span in trace.spans:
            if span.type == "llm":
                llm_scores.append(
                    span.metrics[GalileoMetrics.context_adherence]
                )
            elif span.type == "retriever":
                retriever_scores.append(
                    span.metrics[GalileoMetrics.context_relevance]
                )
            elif span.type == "tool":
                tool_scores.append(
                    1 - span.metrics[GalileoMetrics.tool_error_rate]
                )

    # Calculate averages for each level
    session_avg = (
        conversation_quality + action_completion + agent_efficiency
    ) / 3
    trace_avg = sum(trace_scores) / len(trace_scores) if trace_scores else 0.5
    llm_avg = sum(llm_scores) / len(llm_scores) if llm_scores else 0.5
    retriever_avg = (
        sum(retriever_scores) / len(retriever_scores)
        if retriever_scores
        else 0.5
    )
    tool_avg = sum(tool_scores) / len(tool_scores) if tool_scores else 0.5

    # Return weighted average across all levels
    return (session_avg + trace_avg + llm_avg + retriever_avg + tool_avg) / 5
]

Best practices

Be specific with required metrics

Only include metrics you actually use. This improves performance and makes your metric’s dependencies clear:

# Bad - includes unnecessary metrics
required_metrics = [
    GalileoMetrics.context_adherence,
    GalileoMetrics.context_relevance,
    GalileoMetrics.completeness,  # Not used in scorer
    GalileoMetrics.correctness     # Not used in scorer
]

# Good - only required metrics
required_metrics = [
    GalileoMetrics.context_adherence,
    GalileoMetrics.context_relevance
]

Use appropriate step types

Match your composite metric’s step type to where the required metrics exist:

Session: Can access session, trace, and span metrics
Trace: Can access trace and span metrics
Span: Can only access metrics on that specific span

Execution restrictions

Composite metrics depend on the successful completion of their required metrics:

While any required metric is not yet final (e.g., queued or computing), the composite metric remains queued.
If any required metric finishes without a successful final status (e.g., failed, not computed, or not applicable), the composite metric raises an error that includes the failed statuses of those required metrics.
Metrics not listed in required_metrics do not affect the composite metric—only the required ones gate execution.

Creating composite metrics

Composite metrics can be created in two ways:

Galileo Console UI: Use the custom code-based metrics editor and select required metrics from the “Required Metrics” dropdown
Python SDK: Add the required_metrics parameter when creating code-based metrics

Composite metrics are only supported for code-based custom metrics. LLM-as-a-judge metrics do not support the required_metrics parameter.

Create composite metrics in the UI

Learn how to create composite metrics using the Galileo Console

Python SDK reference

View Python SDK documentation for metrics

Custom metrics overview

Learn about custom code-based metrics in Galileo

Next steps

Custom code-based metrics

Learn how to create custom code-based metrics in Galileo

Metrics overview

Explore Galileo’s comprehensive metrics framework

Run experiments with metrics

Learn how to use metrics in experiments

Overview

Get Started

Logging and Monitoring

Experiments

Runtime Protection

Metrics

Annotations

Integrations

Security

References

Composite Metrics

What are composite metrics?

Common use cases

Conditional evaluation

Hierarchical aggregation

Multi-metric analysis

Cross-span evaluation

Specifying required metrics

Galileo preset metrics

Custom metrics

Accessing metric values

Complete example: multi-level session metric

Best practices

Be specific with required metrics

Use appropriate step types

Execution restrictions

Creating composite metrics

Create composite metrics in the UI

Python SDK reference

Custom metrics overview

Next steps

Custom code-based metrics

Metrics overview

Run experiments with metrics

Overview

Get Started

Logging and Monitoring

Experiments

Runtime Protection

Metrics

Annotations

Integrations

Security

References

​What are composite metrics?

​Common use cases

​Conditional evaluation

​Hierarchical aggregation

​Multi-metric analysis

​Cross-span evaluation

​Specifying required metrics

​Galileo preset metrics

​Custom metrics

​Accessing metric values

​Complete example: multi-level session metric

​Best practices

​Be specific with required metrics

​Use appropriate step types

​Execution restrictions

​Creating composite metrics

Create composite metrics in the UI

Python SDK reference

Custom metrics overview

​Next steps

Custom code-based metrics

Metrics overview

Run experiments with metrics

What are composite metrics?

Common use cases

Conditional evaluation

Hierarchical aggregation

Multi-metric analysis

Cross-span evaluation

Specifying required metrics

Galileo preset metrics

Custom metrics

Accessing metric values

Complete example: multi-level session metric

Best practices

Be specific with required metrics

Use appropriate step types

Execution restrictions

Creating composite metrics

Next steps