Skip to main content
LLM-as-a-judge metrics evaluate LLM application outputs at scale, but may not reflect your team’s domain-specific standards out of the box. Whether you’re adapting a preset metric to a new domain or refining a custom metric that still isn’t accurate enough, the metric prompt often needs tuning to capture your specific evaluation criteria — and doing that manually is time-consuming and hard to scale. Teams typically rewrite prompts, test changes, and repeat that cycle across multiple rounds with no guarantee the result is right. Autotune lets anyone involved in building or reviewing metrics — annotators, product managers, or developers — provide feedback on metric outputs instead of editing prompts directly. Reviewers correct results and explain their reasoning in natural language. Galileo translates that feedback into prompt improvements and shows exactly what changed.

When to use Autotune

Use Autotune to improve metric performance when:
  • A new custom metric isn’t accurate enough for your use case
  • An existing metric isn’t generalizing well to a new domain or use case
  • An existing metric is producing inconsistent results with low reviewer agreement in production
  • The current prompt isn’t handling domain-specific edge cases reliably
  • Manual prompt iteration is too time-consuming to scale

How it works

See Autotune in action

1

Review metric results across logs

Examine metric outputs across your logged spans, traces, or sessions where the metric isn’t performing as expected.
2

Identify incorrect outputs

Flag results that are incorrect or do not match your team’s expectations.
3

Enter the expected value and explain why it's correct

For each flagged result, enter the value the metric should have produced and add a natural-language explanation of why.
4

Retune the metric

Run Autotune using the collected feedback. Galileo aggregates the feedback and adapts the metric prompt accordingly.
5

Review and test the updated prompt

Inspect the changes to the prompt and test the updated metric before publishing.
6

Apply the improved metric to future runs

Publish the updated metric so it is used for new logs and evaluations.
Galileo automatically versions the metric so you can track changes and revert if needed.
You can optionally recompute historical results with the updated metric after publishing.

How to provide good feedback

Autotune supports unlimited feedback per metric. Good feedback should include what output the metric should have produced and why the corrected result is right. Avoid vague corrections:
VagueGood
”This score is wrong""Score should be 60% — the user had 5 goals (A, B, C, D, E) but only completed B, D, and E, so 3 out of 5 were met (3/5 = 60%)"
"This should be flagged""Should be flagged — the response recommends ibuprofen at a specific dosage without disclaiming that this is not medical advice, which violates the safety criterion”

Which metrics are supported?

Autotune works across all LLM-as-a-judge metrics, output types, and metric levels.
CategorySupported
Metric typesOut-of-the-box and custom LLM-as-a-judge metrics
Output typesAll types — boolean, categorical, percentage, count, and discrete
Metric levelsAll levels — spans, traces, and sessions

Custom LLM-as-a-Judge Metrics

Learn how to create and configure custom LLM-as-a-judge metrics in the Galileo console.

LLM-as-a-Judge Prompt Engineering Guide

Learn best practices for writing effective metric prompts.

Metrics Overview

Explore Galileo’s comprehensive metrics framework for evaluating and improving AI system performance.