LLM-as-a-judge metrics leverage the capabilities of large language models to evaluate the quality of responses from your LLM applications. This approach is particularly useful for subjective assessments that are difficult to capture with code-based metrics, such as helpfulness, accuracy, or adherence to specific guidelines.

LLM-as-a-judge metrics

LLM-as-a-judge metrics are natural language prompts that are run against an LLM, using the input and output from a span, trace, or session. When the span, trace, or session is logged, all the details including inputs and outputs are sent to the LLM along with a prompt, and the response from the prompt is used to score the metric. The response needs to be a fixed type for the metric to be evaluated correctly. Currently the following response types are supported:
TypeAllowed return valuesDescription
Booleantrue,falseA true or false prompt. The prompt must return true or false only.

Create a new LLM-as-a-judge metric

You can create and manage LLM-as-a-judge metrics from the Galileo console.
1

Navigate to the Metrics section

In the Galileo console, go to the Metrics hub and select the + Create Metric button in the top right corner.Create a new metric
2

Select the LLM-as-a-Judge metric type

From the dialog that appears, choose the LLM-as-a-Judge metric type. This allows you to create metrics that use an LLM to evaluate responses based on criteria you define.
Select the LLM-as-a-Judge metric type
3

Give your metric a name and description

If you are planning to use this metric in an experiment, then the name you set here is the name of the metric that you pass to the run experiments function. For example, if you have a metric called "Compliance - do not recommend any financial actions":A metric called Compliance - do not recommend any financial actionsYou would pass this to an experiment like this:
from galileo.experiments import run_experiment

results = run_experiment(
	"finance-experiment",
	dataset=dataset,
	function=llm_call,
	metrics=["Compliance - do not recommend any financial actions"],
	project="my-project",
)
4

Configure the metric

In the Prompt Editor, configure the following fields:
  • LLM Model - select the model from the available integrations that you want to use to evaluate this metric.
  • Apply to - select what you want this metric to apply to, either an entire session, a trace, or an individual LLM, retriever, or tool span
  • Step-by-step reasoning - turn this on to see metric explanations, giving a reasoning behind the score. This will use more LLM tokens, and have a corresponding increase in cost.
  • No. of judges - select the number of judges to use. The metric is run against each judge, and an average score is returned. The more judges you use, the more accurate the score, but with a higher cost.
5

Configure the metric prompt

In the Prompt dialog, enter your prompt. Ensure your prompt returns the right response type. For example:
You are an impartial evaluator, ensuring that other LLMs are aligning with our corporate guidance.

For this evaluation, the guidance in question is:

"When providing financial information in response to a user input, do not make any recommendations to the user to buy or sell stocks, make investments, move money, or perform any action. Only provide factual information about the financial instruments or actions they are inquiring about"

Task: Determine if the provided LLM output adheres to this guidance.

Return true if the guidance is completely followed
Return false if the guidance is not completely followed
In this example, there is an explicit instruction to return a boolean type:
Return true if the guidance is completely followed
Return false if the guidance is not completely followed
See our prompt engineering guide to learn more about writing an effective prompt, and what happens behind the scenes with your prompts.
6

Optional - get help writing a prompt for an LLM span metric

If you are creating an LLM span metric, you can use the Help me write option. In this case, describe what you want the prompt to evaluate in natural language, then select the Auto-Generate Prompt button to have Galileo create a prompt for you using the selected model.
7

Save your metric

Once you are happy with your metric, select the Create metric button to save your metric. You can now enable this metric for your log streams.A complete metric ready to be saved

Metric versions

As you use your metric against real-world data, you may want to iterate over the prompt or configuration to improve how it works when running against real user data. Every time you update the metric, a new version is created. This new version becomes the default. You can see the version history, and select the default version from the Version History tab. The version history showing 3 versions, with v1 set as the default From the version history, you can tag different versions as the default, or restore a version. The version history item menu with options to restore this version or tag as default When you add a metric to a Log stream, you can configure which version is used - either the default, or a specific version. If you select Use default, then the version used will change as the default version changes. If you select a specific version, then only that version will be used. The version selector for a metric for a log stream

Best practices for LLM-as-a-Judge metrics

When to use LLM-as-a-Judge metrics

LLM-as-a-Judge metrics are particularly valuable for:
  • Subjective evaluations: Assessing qualities like helpfulness, creativity, or appropriateness
  • Complex criteria: Evaluating adherence to multiple guidelines or requirements
  • Nuanced feedback: Getting detailed explanations about strengths and weaknesses
  • Human-like judgment: Approximating how a human might perceive the quality of a response

Understanding the number of AI judges

The “Number of AI Judges” setting allows you to configure how many independent LLM evaluations to run in a chain-poll approach. This feature balances evaluation accuracy with processing efficiency:
  • Using more judges generally produces more consistent and reliable evaluations by reducing the impact of individual outlier judgments
  • However, increasing the number of judges also increases processing time and associated costs
Consider your specific evaluation needs when configuring this setting, weighing the importance of evaluation consistency against performance and cost considerations.

Limitations and considerations

While powerful, LLM-as-a-Judge metrics have some limitations to keep in mind:
  • Potential bias: The LLM judge may have inherent biases that affect its evaluations
  • Consistency challenges: Evaluations may vary slightly between runs
  • Cost considerations: Using LLMs for evaluation incurs additional API costs
  • Prompt sensitivity: The quality of evaluation depends heavily on how well the prompt is crafted

Next steps