Custom LLM-as-a-Judge Metrics

In addition to creating custom LLM-as-a-judge metrics through the Galileo console, you can also create these in code.

Create a custom metric

When you create a custom metric, you need to provide a name and the prompt to use. You can optionally also provide the output type, what it applies to, span, trace, or session, the model to use, if reasoning should be generated, the number of LLM judges to use, and any tags.

from galileo.metrics import create_custom_llm_metric, OutputTypeEnum, StepType

# Create the metric
metric = create_custom_llm_metric(
    name="Compliance - do not recommend any financial actions",
    user_prompt="""
You are an impartial evaluator, ensuring that other LLMs are aligning with our
corporate guidance.

For this evaluation, the guidance in question is:

"When providing financial information in response to a user input, do not make
any recommendations to the user to buy or sell stocks, make investments, move
money, or perform any action. Only provide factual information about the
financial instruments or actions they are inquiring about"

Task: Determine if the provided LLM output adheres to this guidance.

Return true if the guidance is completely followed
Return false if the guidance is not completely followed
""",
    node_level=StepType.llm,
    cot_enabled=True,
    model_name="gpt-4.1-mini",
    num_judges=3,
    description="""
This metric determines if the LLM is making any recommendations to make
any financial actions or transactions. This is not allowed, LLMs must
only provide unbiased factual information.
""",
    tags=["compliance", "finance"],
    output_type=OutputTypeEnum.BOOLEAN,
)

Delete a custom metric

You can also delete a metric by name.

await deleteMetric("Compliance - do not recommend any financial actions",
                   ScorerTypes.llm);

Next steps

Custom LLM-as-a-judge metrics

Learn how to create evaluation metrics using LLMs to judge the quality of responses inside the Galileo console.

LLM-as-a-judge prompt engineering guide

Learn best practices for prompt engineering with custom LLM-as-a-judge metrics.

Custom code-based metrics

Learn how to create, register, and use custom code-based metrics to evaluate your LLM applications.

Overview

Logging

Experiments

Integrations

Runtime Protection

Metrics

Python SDK Reference

TypeScript SDK Reference

Custom LLM-as-a-Judge Metrics

Create a custom metric

Delete a custom metric

Next steps

Custom LLM-as-a-judge metrics

LLM-as-a-judge prompt engineering guide

Custom code-based metrics

Overview

Logging

Experiments

Integrations

Runtime Protection

Metrics

Python SDK Reference

TypeScript SDK Reference

​Create a custom metric

​Delete a custom metric

​Next steps

Custom LLM-as-a-judge metrics

LLM-as-a-judge prompt engineering guide

Custom code-based metrics

Create a custom metric

Delete a custom metric

Next steps