Local Metrics
Learn how to create and use local scorers with Galileo’s Python SDK
Overview
A Local Metric (or Local scorer) is a custom metric that you can attach to an experiment—just like a Galileo preset metric. The key difference is that a Local Metric lives in code on your machine, so you share it by sharing your code. Local Metrics are ideal for running isolated tests and refining outcomes when you need more control than built-in metrics offer. This guide explains what Local Metrics are, how to create one, and where to view results.
If you’d rather see a full code example first, jump to the Run the experiment section, then come back to review the implementation details.
Local Scorer Components
A Local scorer consists of three main parts:
-
Scorer Function
Receives a singleSpan
orTrace
(containing the LLM input and output) and computes a score. The exact measurement is up to you—for example, you might measure the length of the output or rate it based on the presence/absence of specific words. -
Aggregator Function
Aggregates the scores generated by the Scorer Function and returns a final metric value. This function receives a list of the type returned by your Scorer. For instance, if your Scorer returns astr
, the Aggregator will be called with alist[str]
. The Aggregator’s return value can also be any type (e.g.,str
,bool
,int
), depending on how you want to represent the final metric. -
LocalMetricConfig[type]
A typed callable provided by Galileo’s Python SDK that combines your Scorer and Aggregator into a custom metric.- The generic
type
should match the type returned by your Aggregator. - Example: If your Scorer returns
bool
values, you would useLocalMetricConfig[bool](…)
, and your Aggregator must accept alist[bool]
and return abool
.
- The generic
Scorer and Aggregator functions can be simple lambdas when your logic is straightforward.
Example: Response Brevity Metric
Below is a step-by-step implementation of a Local Metric that rates the brevity (shortness) of an LLM’s response based on word count.
Creating the Local Scorer
Create a Scorer Function
The Scorer Function assigns one of three ranks—"Terse"
, "Temperate"
, or "Talkative"
—depending on how many words the model outputs:
Create an Aggregator Function
Since our Scorer returns a single rank per record, the Aggregator simply examines that rank and returns it—modifying it to flag overly long responses as "Terrible"
:
Create the Local Metric Configuration
Here, we tell Galileo that our custom metric returns a str
. We give it a name (“Terseness”), assign the Scorer and Aggregator, and voilà—our Local Metric is ready:
The metric has been created. Next, we can use it in an experiment.
Prepare the Experiment
For this example, we’ll ask the LLM to specify the continent of four countries, encouraging it to be succinct:
Run the Experiment
The snippet below brings all the preceding code samples together into a single file. Now we can combine everything (dataset, LLM-call, and our custom metric) by calling run_experiment
. After it runs, you’ll see a URL in your terminal that directs you to the experiment results in the Galileo console:
View the Results
After the experiment completes, your terminal output will include a URL directing you to the experiment page in the Galileo console. On the experiment’s page, you’ll see a new column labeled Terseness (or whatever name you chose) containing your custom metric’s results for each input.
Conclusion
Local Metrics let you tailor evaluation to your exact needs by defining custom scoring logic in code. Whether you want to measure response brevity, detect specific keywords, or implement a complex scoring algorithm, Local Metrics integrate seamlessly with Galileo’s experimentation framework. Once you’ve defined your Scorer and Aggregator functions and wrapped them in a LocalMetricConfig
, running the experiment is as simple as calling run_experiment
. The results appear alongside Galileo’s built-in metrics, so you can compare, visualize, and analyze everything in one place.
With Local Metrics, you have full control over how you measure LLM behavior—unlocking deeper insights and more targeted evaluations for your AI applications.
Related Resources
-
Metrics Overview: Understand Galileo’s built-in metrics and how they work.
-
Running Experiments with Code: Get a broader view of how to set up and run experiments, including integrating custom metrics.
-
Experiment Reference (Python SDK): API reference for
run_experiment
and related functions.