Conversation quality is a custom LLM-as-a-judge trace-level metric, with a pre-created prompt available from Galileo.

Conversation Quality is a binary evaluation metric that assesses whether a chatbot interaction left the user feeling satisfied and positive, or frustrated and dissatisfied, based on tone, engagement, and overall experience.
This is a boolean metric, returning either 0% (false) or 100% (true) - 0% means the interaction left the user feeling frustrated and dissatisfied, 100% means it left the user feeling satisfied and positive. If you use multiple judges, then the score will be a percentage based on the number of judges who scored true vs false. For example, if 4 out of 5 scored the metric as true, the score would be 80%.

Create the agent efficiency metric

This metric needs to be manually created, using a prompt defined by Galileo.
1

Create a new LLM-as-a-judge metric

Create a new LLM-as-a-judge metric by following the instructions in our LLM-as-a-judge concept guide.Use the following settings:
SettingValue
NameConversation quality
LLM ModelSelect your preferred model
Apply toSession
Advanced SettingsConfigure these as required for your needs
2

Set the prompt

Set the prompt to the following:
Prompt
## Task Overview
You are an expert conversation analyst tasked with evaluating the quality of chatbot interactions. 
Your job is to classify each conversation session as either "GOOD" (true) or "BAD" (false) based on user satisfaction, tone, engagement, and overall experience.

## Rubric

#### What Makes a GOOD Conversation (true):
- User does not express harassment, irritation, or frustration directed at the bot
- An out-of-scope query (where the bot cannot fulfill the user's request due to functional limitations) does not automatically qualify as bad quality.
- User frustration is directed at external circumstances rather than the bot (e.g., delivery issues, third-party services)
- Bot successfully de-escalates user frustration about external factors

#### What Makes a BAD Conversation (false):
- Impatient, frustrated, or hostile tone from the user directed at the bot
- Repeated bot clarifying questions without meaningful progress  
- User frustration is specifically about bot performance or errors (e.g., wrong information, system crashes)
- User makes negative comparisons between bot and other services

#### Important Note on Abrupt Endings:
**Not all abrupt endings indicate BAD conversations.** Distinguish between:
- **Task-completed departures (potentially GOOD)**: User receives complete answer/solution and leaves without further comment. No negative sentiment expressed.
- **Out-of-scope departures (potentially GOOD)**: User's request is outside bot's capabilities but bot clearly communicates this limitation. User accepts explanation without frustration directed at bot.
- **Frustration-driven departures (BAD)**: User leaves due to bot failures, with expressions of frustration, giving up statements, or clear dissatisfaction before departure.
3

Save the metric

Save the metric, then turn it on for your Log stream.
Your metric is now ready to use in your project.