Agent flow is a custom LLM-as-a-judge session-level metric, based around a pre-created prompt available from Galileo with user customizations.

Agent Flow is a binary evaluation metric that measures the correctness and coherence of an agentic trajectory by validating it against user-specified natural language tests.
A trajectory is said to pass the Agent Flow metric if and only if all the user-defined natural language conditions are successfully satisfied by the agent’s realized behavior or output.
When you create this metric, you will need to provide the flow conditions in the prompt.
This is a boolean metric, returning either 0% (false) or 100% (true) - 0% means the agent is not correct based on all the user defined tests, 100% means it is correct. If you use multiple judges, then the score will be a percentage based on the number of judges who scored true vs false. For example, if 4 out of 5 scored the metric as true, the score would be 80%.

Create the agent flow metric

This metric needs to be manually created, using a prompt defined by Galileo.
1

Create a new LLM-as-a-judge metric

Create a new LLM-as-a-judge metric by following the instructions in our LLM-as-a-judge concept guide.Use the following settings:
SettingValue
NameAgent flow
LLM ModelSelect your preferred model
Apply toSession
Advanced SettingsConfigure these as required for your needs
2

Set the prompt

Set the prompt to the following:
Prompt
### Overview:
You are an evaluator for agentic system interactions. Your role is to assess whether user-defined tests pass or fail based on interaction logs. You will receive:
1. Condensed Logs : Complete logs of interaction history between user and LLM (including tool usage)
2. User Tests: A list of specific validation criteria defined by the user

Objective: Determine if ALL user-defined tests pass during the interaction.

### Instructions:
A test PASSES under one of these conditions:

A test PASSES if:
    1. the test is applicable to the given data (the if/when condition from the test is triggered during the conversation), AND
    2. the check/requirements of the test holds true for the given data
OR
    1. the test is not applicable to the given data (no parts of the logs are relevant to the test)

When generating your explanation, walk through each test: explain its applicability, its outcome, and your overall conclusion. Be thorough and precise in your reasoning.

### User-defined tests
{{ Add your tests here }}

### Rubric:
True: ALL user-defined tests PASS for the given logs
False: ONE or more user-defined tests did NOT PASS for the given logs
3

Customize the prompt by adding your user-defined tests

This prompt needs to be customized based on your application, and the inputs and outputs you are expecting. Replace {{ Add your tests here }} with a numbered list of tests in natural language that can be used to evaluate the agent efficiency. This can include:
  • Expected tool or agent calls, using the tool or agent names
  • Conditions on tool or agent calling (e.g. if tool x is called, don’t call agent y)
  • Expectations around the input or output parameters to tools and agents
  • Limitations on the number of tool or agent calls
For example, imagine you were creating an agent to provide advice on exercises for different body parts, such as for a physical therapy application. This has multiple tools, including list_by_target_muscle_for_exercised, list_by_body_part_for_exercised, list_of_bodyparts_for_exercised. Some user tests might be:
1. If a call to "list_by_target_muscle_for_exercised" returns an error that contains the text "target not found", the agent should subsequently attempt an alternative lookup by calling either "list_by_body_part_for_exercised" or "list_of_bodyparts_for_exercised"
2. When the user asks for exercises that target leg muscles, the agent must call at least one of the tools ["list_by_target_muscle_for_exercised", "list_by_body_part_for_exercised"] during the conversation
3. After receiving a successful response from "list_by_body_part_for_exercised", the agent's following natural-language message must contain at least one exercise name, the corresponding equipment, and an animated demonstration URL taken from the tool output
4. Every invocation of the tool "list_by_body_part_for_exercised" must include the required parameter "bodypart"
5. After receiving data from list_by_body_part_for_exercised, the agent response must include the exercise id for every exercise it presents to the user
6. No assistant message should include more than one tool invocation
7. The agent should conclude the conversation with a human-readable answer that summarizes the requested leg exercises using data returned from the tools
4

Save the metric

Save the metric, then turn it on for your Log stream.
Your metric is now ready to use in your project.