Customize LLM-Powered Metrics via CLHF

As you start using Galileo Preset LLM-powered metrics (e.g. Context Adherence or Instruction Adherence), or start creating your own LLM-powered metrics, you might not always agree with the results. False positives or False Negatives in metric values are often due to domain edge cases that aren’t handled in the metric’s prompt. Galileo helps you address this problem and adapt and continuously improve metrics through our Continuous Learning via Human Feedback features.

How it works

As you identify mistakes in your metrics, you can provide ‘feedback’ to ‘auto-improve’ your metrics. Your feedback gets translated (by LLMs) into few-shot examples that are appended to the Metric’s prompt. Few-shot examples help your LLM-as-a-judge in a few ways:

Examples with your domain data teach it what to expect from your domain.
Concrete examples on edge cases teach your LLM-as-a-judge how to deal with outlier scenarios.

This process has shown to increase accuracy of metrics by 20-30%.

CLHF-ed metrics are scoped to the project. I.e. you can have different teams customizing the same metric in different ways and not impact each other’s projects.

How to create good feedback

When entering feedback, enter a critique of the explanation generated by the erroneous metric. Be as precise as possible in your critique, outlining the exact reason behind the desired metric value.

How many examples to provide

You should see significant improvement with only 1 or 2 examples. If a small number of examples doesn’t work, adding more may help. It is recommended that you follow an iterative workflow:

Provide feedback using just one or a small number of examples
Retune the metric
Run the updated metric on the same data
Look over the results to see if the problem is fully resolved
If not then provide feedback on a few more examples and repeat the process

Providing an example to CLHF will typically only help with the error that the metric made in that example, not with other errors that might be occurring in other cases. If there are several different cases that you want the metric to handle differently, the feedback you submit needs to include at least one instance of each of cases.

What makes good examples

You should pick examples where the metric’s value is high (meaning >=0.5) and you think it should be low, or where it is low (<0.5) and you want it to be high. Don’t submit feedback on:

Cases where you disagree with the explanation but agree with the value
Cases where the value was on the correct side of 0.5 but you want it to be more or less extreme, e.g. the value was 0.67 but you want it to be 1.

If the metric’s current value is >= 0.5, CLHF interprets your feedback as “ideally you’d want the value to be 0,” and if the value is <0.5, it interprets your feedback as “ideally you’d want the value to be 1”.

Your feedback should be unambiguous as to what you think was wrong with the metric value and explanation for the example. Your textual feedback should clearly explain what you disagree with in the metric explanation and how you would want it to be different.

Limits on the number of examples

Examples are limited to 15 per metric per project. If you submit more than 15 examples to CLHF for a given metric in a given project, only the most recently submitted 15 will be used.

How to use it

See this video on how to use Continuous Learning via Human Feedback to improve your metric accuracy:

Which metrics is this supported on?

Context Adherence
Instruction Adherence
Correctness
Preset and custom LLM-as-a-judge generated metrics

Overview

Get Started

Logging and Monitoring

Experiments

Runtime Protection

Metrics

Annotations

Integrations

Security

References

Customize LLM-Powered Metrics via CLHF

How it works

How to create good feedback

How many examples to provide

What makes good examples

Limits on the number of examples

How to use it

Which metrics is this supported on?

Overview

Get Started

Logging and Monitoring

Experiments

Runtime Protection

Metrics

Annotations

Integrations

Security

References

​How it works

​How to create good feedback

​How many examples to provide

​What makes good examples

​Limits on the number of examples

​How to use it

​Which metrics is this supported on?

How it works

How to create good feedback

How many examples to provide

What makes good examples

Limits on the number of examples

How to use it

Which metrics is this supported on?