Custom Metrics - Code based scoring
Learn how to create, register, and use custom metrics to evaluate your LLM applications
Custom metrics allow you to define specific evaluation criteria for your LLM applications. Galileo supports two types of custom metrics:
- Registered Scorers: Server-side metrics that can be shared across your organization
- Custom Scorers: Local metrics that run in your notebook environment
Registered Scorers
Registered Scorers run in Galileo’s backend environment and can be used across your organization in both Evaluate and Observe projects.
Creating a Registered Scorer
You can create a registered scorer either through the Python SDK or directly in the Galileo UI. Let’s walk through the UI approach:
Navigate to the Metrics section
In the Galileo platform, go to the Metrics section and click the “Create New Metric” button in the top right corner.
Select the Code metric type
From the dialog that appears, choose the “Code” metric type. This option allows you to write custom Python code to evaluate your LLM outputs.
Write your custom metric
Use the code editor to write your custom metric. The editor provides a template with the required functions and helpful comments to guide you.
The code editor allows you to write and test your metric directly in the browser. You’ll need to define at least the scorer_fn
and aggregator_fn
functions as described below.
Save your metric
After writing your custom metric code, click the “Save” button in the top right corner of the code editor. Your metric will be validated and, if there are no errors, it will be saved and become available for use across your organization.
You can now select this metric when running evaluations in both the Evaluate and Observe modules.
1. The Scorer Function (scorer_fn
)
This function evaluates individual responses and returns a score:
The function must accept **kwargs
to ensure forward compatibility. Here’s a complete example that measures the difference in length between the output and ground truth:
Parameter details:
index
: Row index in the datasetnode_input
: Input to the nodenode_output
: Output from the nodenode_name
,node_type
,node_id
,tools
: Workflow/chain-specific parametersdataset_variables
: Key-value pairs from the dataset (includes ground truth)
2. The Aggregator Function (aggregator_fn
)
This function aggregates individual scores into summary metrics:
Optional Functions
Score Type Function
This function defines the return type of your scorer (default is float
).
Node Type Restriction
This function restricts which node types your scorer can evaluate. For example, to only score retriever nodes:
LLM Credentials Access
To access LLM credentials during scorer execution:
When enabled, credentials are passed to scorer_fn
as a dictionary:
Complete Example: Response Length Scorer
Let’s create a custom metric that measures response length:
Execution Environment
Registered Scorers run in a Python 3.10 environment with these libraries:
We provide advance notice before major version updates to these libraries.
Custom Scorers
If you need additional libraries or want to run metrics locally, use Custom Scorers. These run in your notebook environment but are limited to the Evaluate module.
Creating a Custom Scorer
Custom Scorers require two functions:
1. The Executor Function
2. The Aggregator Function
Example: Response Length Custom Scorer
Use your Custom Scorer:
Comparison: Registered vs. Custom Scorers
Feature | Registered Scorers | Custom Scorers |
---|---|---|
Creation | Python client, activatable via UI | Python client only |
Sharing | Organization-wide | Current project only |
Modules | Evaluate and Observe | Evaluate only |
Definition | Independent Python file | Within notebook |
Environment | Server-side | Local Python environment |
Libraries | Limited to Galileo environment | Any available library |
Resources | Restricted by Galileo | Local resources |
Common Use Cases
Custom metrics are ideal for:
- Heuristic evaluation: Checking for specific patterns, keywords, or structural elements
- Model-guided evaluation: Using pre-trained models to detect entities or LLMs to grade outputs
- Business-specific metrics: Measuring domain-specific quality indicators
- Comparative analysis: Comparing outputs against ground truth or reference data
Simple Example: Sentiment Scorer
Here’s a simple custom metric that evaluates the sentiment of responses:
This simple sentiment scorer:
- Counts positive and negative words in responses
- Calculates a sentiment score between -1 (negative) and 1 (positive)
- Aggregates results to show the distribution of positive, neutral, and negative responses
You can easily extend this with more sophisticated sentiment analysis techniques or domain-specific terminology.