When to use Autotune
Use Autotune to improve metric performance when:- A new custom metric isn’t accurate enough for your use case
- An existing metric isn’t generalizing well to a new domain or use case
- An existing metric is producing inconsistent results with low reviewer agreement in production
- The current prompt isn’t handling domain-specific edge cases reliably
- Manual prompt iteration is too time-consuming to scale
How it works
See Autotune in action
Review metric results across logs
Examine metric outputs across your logged spans, traces, or sessions where the metric isn’t performing as expected.
Identify incorrect outputs
Flag results that are incorrect or do not match your team’s expectations.
Enter the expected value and explain why it's correct
For each flagged result, enter the value the metric should have produced and add a natural-language explanation of why.
Retune the metric
Run Autotune using the collected feedback. Galileo aggregates the feedback and adapts the metric prompt accordingly.
Review and test the updated prompt
Inspect the changes to the prompt and test the updated metric before publishing.
You can optionally recompute historical results with the updated metric after publishing.
How to provide good feedback
Autotune supports unlimited feedback per metric. Good feedback should include what output the metric should have produced and why the corrected result is right. Avoid vague corrections:| Vague | Good |
|---|---|
| ”This score is wrong" | "Score should be 60% — the user had 5 goals (A, B, C, D, E) but only completed B, D, and E, so 3 out of 5 were met (3/5 = 60%)" |
| "This should be flagged" | "Should be flagged — the response recommends ibuprofen at a specific dosage without disclaiming that this is not medical advice, which violates the safety criterion” |
Which metrics are supported?
Autotune works across all LLM-as-a-judge metrics, output types, and metric levels.| Category | Supported |
|---|---|
| Metric types | Out-of-the-box and custom LLM-as-a-judge metrics |
| Output types | All types — boolean, categorical, percentage, count, and discrete |
| Metric levels | All levels — spans, traces, and sessions |
Related resources
Custom LLM-as-a-Judge Metrics
Learn how to create and configure custom LLM-as-a-judge metrics in the Galileo console.
LLM-as-a-Judge Prompt Engineering Guide
Learn best practices for writing effective metric prompts.
Metrics Overview
Explore Galileo’s comprehensive metrics framework for evaluating and improving AI system performance.