LLM-as-a-judge metrics
LLM-as-a-judge metrics are natural language prompts that are run against an LLM, using the input and output from a span, trace, or session. When the span, trace, or session is logged, all the details including inputs and outputs are sent to the LLM along with a prompt, and the response from the prompt is used to score the metric. The response needs to be a fixed type for the metric to be evaluated correctly. Currently the following output types are supported:Output type | Allowed return values | Description |
---|---|---|
Boolean | true ,false | A true or false prompt. The prompt must return true or false only. |
Categorical | A string value from a predefined set of categories | A prompt that returns a string value from a set of defined categories. The prompt must also define the possible category values. |
Count. | A positive integer | A positive integer, from 0 upwards that represents the count of something |
Discrete. | An integer in a defined range | A prompt that returns a integer in a defined range, that can be defined in the prompt. For example, a score from 0-5. |
Percentage | 0.0 - 1.0 | A prompt that returns a percentage value, scored from 0.0 to 1.0, with 0.0 being 0%, and 1.0 being 100% |
Create a new LLM-as-a-judge metric in the Galileo console
1
Navigate to the Metrics section
In the Galileo console, go to the Metrics hub and select the + Create Metric button in the top right corner.

2
Select the LLM-as-a-Judge metric type
From the dialog that appears, choose the LLM-as-a-Judge metric type. This allows you to create metrics that use an LLM to evaluate responses based on criteria you define.

3
Give your metric a name and description
If you are planning to use this metric in an experiment, then the name you set here is the name of the metric that you pass to the run experiments function. For example, if you have a metric called 
You would pass this to an experiment like this:
"Compliance - do not recommend any financial actions"
:
4
Define what this metric applies to
In the Apply to box, select what level this metric applies to.
Level | Description |
---|---|
Session | A session can include multiple traces. Apply your metric to a session when you want to evaluate a multiple-step interaction, including multiple RAG retrievals, tool calls, or LLM calls. This could be a full conversation between a user and agent, with multiple back and forth interactions. |
Trace | A trace is typically one single interaction. Apply your metric to a trace when you want to evaluate a single step interaction, or a single step in a session. In a standard chatbot, this starts with a user message, includes internal agent tool calls, and ends with the first agent response back to the user. |
Retriever span | When building a RAG application, retriever spans consists of the user query as input, and the retrieved docs as output. Apply your metric to a retriever span when you are evaluating document retrieval and processing steps. |
LLM span | An LLM Span is a single LLM call, it’s input will contain the entire chat history up to that point, and the output would be the latest output. Apply your metric to an LLM span when you want to evaluate how the LLM processes the input and generates the output. |
Tool span | A tool span logs a tool call. The input here is the arguments being passed to a tool, and the output is the tool output. Apply your metric to a tool span when you are evaluating tools. |
5
Configure the metric
In the Prompt Editor, configure the following fields:
- LLM Model - select the model from the available integrations that you want to use to evaluate this metric.
- Step-by-step reasoning - turn this on to see metric explanations, giving a reasoning behind the score. This will use more LLM tokens, and have a corresponding increase in cost.
- No. of judges - select the number of judges to use. The metric is run against each judge, and an average score is returned. The more judges you use, the more accurate the score, but with a higher cost.
6
Configure the metric prompt
In the Prompt dialog, enter your prompt. Ensure your prompt returns the right response type. For example:In this example, there is an explicit instruction to return a boolean type:
See our prompt engineering guide to learn more about writing an effective prompt, and what happens behind the scenes with your prompts.
7
Optional - get help writing a prompt using Prompt Assist
To help you create a metric prompt, you can use the Prompt assist feature. This allows you to define how you want the metric to work in natural language, and Galileo will create the metric prompt for you.To use prompt assist, select the Show prompt assist button.
Provide a natural language description of what the metric to measure, including:
Once done, select the Generate prompt button to generate the metric prompt.

- The output you would like, e.g. binary, categories or a number range.
- The criteria you would like the metric to use to decide the output values.

8
Test your metric
When you have your metric configured, it is important to test the metric against multiple inputs and outputs. You can then use the results of the tests to iterate on the metric prompt and configuration, for example experimenting with different models, or number of judges.To test your metric, head to the Test Metric tab. You can either test the metric by passing in a manual input, or by using logged sessions, traces, spans, or experiments.Due to the complex structure of sessions and traces, manual input is only supported for metrics that apply to spans only. To test session and trace level metrics, you need to test with existing logged sessions or traces, or experiments.For manual input testing, provide the input and output you want to test against, then select the Test button. You will see the result of the metric, and an explanation if you have step-by-step reasoning turned on.
To test against current logs or experiments, select the project, then select the source type, then select the relevant Log stream or experiment.You can then run your metric against the last 5 logged sessions, traces, spans, or experiments by selecting the Test Metric button. The metric will be calculated, along with an explanation if you have step-by-step reasoning turned on.
After the metrics are calculated, select each row to see more details on the explanation if available.


9
Save your metric
Once you are happy with your metric, select the Create metric button to save your metric. You can now enable this metric for your Log streams.

Create a new LLM-as-a-judge metric in code
In addition to creating custom LLM-as-a-judge metrics through the Galileo console, you can also create these in code.Create a custom metric
When you create a custom metric, you need to provide a name and the prompt to use. You can optionally also provide the output type, what it applies to, span, trace, or session, the model to use, if reasoning should be generated, the number of LLM judges to use, and any tags.Delete a custom metric
You can also delete a metric by name.Metric versions
As you use your metric against real-world data, you may want to iterate over the prompt or configuration to improve how it works when running against real user data. Every time you update the metric, a new version is created. This new version becomes the default. You can see the version history, and select the default version from the Version History tab.


Best practices for LLM-as-a-Judge metrics
When to use LLM-as-a-Judge metrics
LLM-as-a-Judge metrics are particularly valuable for:- Subjective evaluations: Assessing qualities like helpfulness, creativity, or appropriateness
- Complex criteria: Evaluating adherence to multiple guidelines or requirements
- Nuanced feedback: Getting detailed explanations about strengths and weaknesses
- Human-like judgment: Approximating how a human might perceive the quality of a response
Understanding the number of AI judges
The “Number of AI Judges” setting allows you to configure how many independent LLM evaluations to run in a chain-poll approach. This feature balances evaluation accuracy with processing efficiency:- Using more judges generally produces more consistent and reliable evaluations by reducing the impact of individual outlier judgments
- However, increasing the number of judges also increases processing time and associated costs
Limitations and considerations
While powerful, LLM-as-a-Judge metrics have some limitations to keep in mind:- Potential bias: The LLM judge may have inherent biases that affect its evaluations
- Consistency challenges: Evaluations may vary slightly between runs
- Cost considerations: Using LLMs for evaluation incurs additional API costs
- Prompt sensitivity: The quality of evaluation depends heavily on how well the prompt is crafted
Next steps
LLM-as-a-Judge Prompt Engineering Guide
Learn best practices for prompt engineering with custom LLM-as-a-judge metrics.
Metrics overview
Explore Galileo’s comprehensive metrics framework for evaluating and improving AI system performance across multiple dimensions.
Custom code-based metrics
Learn how to create, register, and use custom code-based metrics to evaluate your LLM applications.
Run experiments
Learn how to run experiments in Galileo using the Galileo SDKs and custom metrics.