LLM-as-a-judge metrics
LLM-as-a-judge metrics are natural language prompts that are run against an LLM, using the input and output from a span, trace, or session. When the span, trace, or session is logged, all the details including inputs and outputs are sent to the LLM along with a prompt, and the response from the prompt is used to score the metric. The response needs to be a fixed type for the metric to be evaluated correctly. Currently the following response types are supported:Type | Allowed return values | Description |
---|---|---|
Boolean | true ,false | A true or false prompt. The prompt must return true or false only. |
Create a new LLM-as-a-judge metric
You can create and manage LLM-as-a-judge metrics from the Galileo console.1
Navigate to the Metrics section
In the Galileo console, go to the Metrics hub and select the + Create Metric button in the top right corner.

2
Select the LLM-as-a-Judge metric type
From the dialog that appears, choose the LLM-as-a-Judge metric type. This allows you to create metrics that use an LLM to evaluate responses based on criteria you define.

3
Give your metric a name and description
If you are planning to use this metric in an experiment, then the name you set here is the name of the metric that you pass to the run experiments function. For example, if you have a metric called 
You would pass this to an experiment like this:
"Compliance - do not recommend any financial actions"
:
4
Configure the metric
In the Prompt Editor, configure the following fields:
- LLM Model - select the model from the available integrations that you want to use to evaluate this metric.
- Apply to - select what you want this metric to apply to, either an entire session, a trace, or an individual LLM, retriever, or tool span
- Step-by-step reasoning - turn this on to see metric explanations, giving a reasoning behind the score. This will use more LLM tokens, and have a corresponding increase in cost.
- No. of judges - select the number of judges to use. The metric is run against each judge, and an average score is returned. The more judges you use, the more accurate the score, but with a higher cost.
5
Configure the metric prompt
In the Prompt dialog, enter your prompt. Ensure your prompt returns the right response type. For example:In this example, there is an explicit instruction to return a boolean type:
See our prompt engineering guide to learn more about writing an effective prompt, and what happens behind the scenes with your prompts.
6
Optional - get help writing a prompt for an LLM span metric
If you are creating an LLM span metric, you can use the Help me write option. In this case, describe what you want the prompt to evaluate in natural language, then select the Auto-Generate Prompt button to have Galileo create a prompt for you using the selected model.
7
Save your metric
Once you are happy with your metric, select the Create metric button to save your metric. You can now enable this metric for your log streams.

Metric versions
As you use your metric against real-world data, you may want to iterate over the prompt or configuration to improve how it works when running against real user data. Every time you update the metric, a new version is created. This new version becomes the default. You can see the version history, and select the default version from the Version History tab.


Best practices for LLM-as-a-Judge metrics
When to use LLM-as-a-Judge metrics
LLM-as-a-Judge metrics are particularly valuable for:- Subjective evaluations: Assessing qualities like helpfulness, creativity, or appropriateness
- Complex criteria: Evaluating adherence to multiple guidelines or requirements
- Nuanced feedback: Getting detailed explanations about strengths and weaknesses
- Human-like judgment: Approximating how a human might perceive the quality of a response
Understanding the number of AI judges
The “Number of AI Judges” setting allows you to configure how many independent LLM evaluations to run in a chain-poll approach. This feature balances evaluation accuracy with processing efficiency:- Using more judges generally produces more consistent and reliable evaluations by reducing the impact of individual outlier judgments
- However, increasing the number of judges also increases processing time and associated costs
Limitations and considerations
While powerful, LLM-as-a-Judge metrics have some limitations to keep in mind:- Potential bias: The LLM judge may have inherent biases that affect its evaluations
- Consistency challenges: Evaluations may vary slightly between runs
- Cost considerations: Using LLMs for evaluation incurs additional API costs
- Prompt sensitivity: The quality of evaluation depends heavily on how well the prompt is crafted
Next steps
LLM-as-a-Judge Prompt Engineering Guide
Learn best practices for prompt engineering with custom LLM-as-a-judge metrics.
Metrics overview
Explore Galileo’s comprehensive metrics framework for evaluating and improving AI system performance across multiple dimensions.
Custom code-based metrics
Learn how to create, register, and use custom code-based metrics to evaluate your LLM applications.
Run experiments
Learn how to run experiments in Galileo using the Galileo SDKs and custom metrics.