Agent efficiency is a custom LLM-as-a-judge session-level metric, with a pre-created prompt available from Galileo.

Agent efficiency is a binary evaluation metric of the efficiency of your agentic workflows. An agentic session is considered efficient or optimal when the agent provides a precise answer or resolution to every user ask, with an efficient path. An ask could be a question that requires an answer, or a request that requires a resolution through tool usage.
Efficiency here means the agent does not make redundant tool calls, ask redundant questions/clarifications to the user, is precise and concise in its communication, and reaches its goal in minimal steps needed. This is a boolean metric, returning either 0% (false) or 100% (true) - 0% means the agent is not efficient, 100% means it is efficient. If you use multiple judges, then the score will be a percentage based on the number of judges who scored true vs false. For example, if 4 out of 5 scored the metric as true, the score would be 80%.

Create the agent efficiency metric

This metric needs to be manually created, using a prompt defined by Galileo.
1

Create a new LLM-as-a-judge metric

Create a new LLM-as-a-judge metric by following the instructions in our LLM-as-a-judge concept guide.Use the following settings:
SettingValue
NameAgent efficiency
LLM ModelSelect your preferred model
Apply toSession
Advanced SettingsConfigure these as required for your needs
2

Set the prompt

Set the prompt to the following:
Prompt
### Overview:
This is a binary classification prompt designed for automated metric calculation in applications. You will be given specific instructions defining your evaluation task, followed by clear rubrics that determine when to classify content as True or False. Your role is to carefully analyze the provided content against these criteria and return a structured JSON response.

### Instructions:
You will receive the complete chat history from a chatbot application, capturing interactions between a user and an assistant.
In the chat history, the user may ask questions, issue requests, or give commands. Treat all as user intent that requires a response. The assistant may respond with text, call tools to resolve actions, or take multiple steps involving internal reasoning, planning, and tool selection before replying.
I want you to analyze the chat history and determine if the conversation is optimal.

### Evaluation Criteria:
1. The assistant understands the user's request clearly, minimizing unnecessary follow-ups.
2. Responses are clear, concise, and directly address the user’s intent.
3. If tools are used, the assistant selects the correct ones with appropriate arguments.
4. If tools are used, there should be no tool errors.
5. When additional information is needed, the assistant asks precise and well-formulated questions.
6. The user is satisfied with the assistant’s responses by the end.
7. The assistant answers all the user's questions and requests, and the conversation is not cut off.

### Your Task:
1. Analyze the chat step by step.
2. Explain your observations before concluding.
3. Determine if the conversation meets all optimality criteria.

### Rubric:
True: If the conversation meets all the optimality criteria and is fully optimized.
False: If the conversation is even slightly suboptimal.
3

Save the metric

Save the metric, then turn it on for your Log stream.
Your metric is now ready to use in your project.