Skip to main content

Agent Flow is a binary metric that checks if an agent’s behavior satisfies all user-defined natural language conditions.
Agent Flow is a binary evaluation metric that measures the correctness and coherence of an agentic trajectory by validating it against user-specified natural language tests. A trajectory is said to pass the Agent Flow metric if and only if all the user-defined natural language conditions are successfully satisfied by the agent’s realized behavior or output.
To use this metric, you will need to create a copy and edit the prompt to provide your natural language tests.
This is a boolean metric, returning a confidence score that the agent flow satisfies all conditions. The score ranges from 0% (no confidence the agent flow satisfies all conditions) to 100% (complete confidence that the agent flow satisfies all conditions).

Agent Flow at a glance

PropertyDescription
NameAgent Flow
CategoryAgentic AI
Can be applied toSession
LLM-as-a-judge Support
Luna Support
Protect Runtime Protection
Value TypeBoolean shown as a percentage confidence score

When to use this metric

When to Use This Metric

Agent Flow is useful when measuring multi-agent systems that have well-defined paths or interactions
Agents with multiple possible paths: Agent Flow can evaluate an agentic application that has multiple possible paths where you know the expected behavior for each user response. You can validate that the agent performs the expected behavior.
Agents with specific intention views: Agent flow can validate specific interaction rules. For example, ensuring the agent asks for confirmation before completing a purchase.
Agents with unconditional behaviors: Agent flow can check for unconditional behaviors, such as verifying that the agent always calls the authentication tool during a conversation.

Score interpretation

Expected Score: 80%-100%.
060%100%
Poor
Fair
Excellent

Configure Agent Flow

This metric needs to be manually customized to include your own natural language tests.
1

Create a copy of the Agent Flow metric

From the Metrics Hub, select the Agent Flow metric. You will get a popup asking you to duplicate the metric. Select Duplicate metric to create a copy.The agent flow metric with the duplicate metric popup
2

Locate the user defined tests section

Locate the user defined tests section in the prompt.
<user-defined-tests>
{{ Add your tests here }}
</user-defined-tests>
3

Customize the prompt by adding your user-defined tests

This prompt needs to be customized based on your application, and the inputs and outputs you are expecting. Replace {{ Add your tests here }} with a numbered list of tests in natural language that can be used to evaluate the agent efficiency. This can include:
  • Expected tool or agent calls, using the tool or agent names
  • Conditions on tool or agent calling (e.g. if tool x is called, don’t call agent y)
  • Expectations around the input or output parameters to tools and agents
  • Limitations on the number of tool or agent calls
For example, imagine you were creating an agent to provide advice on exercises for different body parts, such as for a physical therapy application. This has multiple tools, including list_by_target_muscle_for_exercised, list_by_body_part_for_exercised, list_of_bodyparts_for_exercised. Some user tests might be:
1. If a call to "list_by_target_muscle_for_exercised" returns an error that contains the text "target not found", the agent should subsequently attempt an alternative lookup by calling either "list_by_body_part_for_exercised" or "list_of_bodyparts_for_exercised"
2. When the user asks for exercises that target leg muscles, the agent must call at least one of the tools ["list_by_target_muscle_for_exercised", "list_by_body_part_for_exercised"] during the conversation
3. After receiving a successful response from "list_by_body_part_for_exercised", the agent's following natural-language message must contain at least one exercise name, the corresponding equipment, and an animated demonstration URL taken from the tool output
4. Every invocation of the tool "list_by_body_part_for_exercised" must include the required parameter "bodypart"
5. After receiving data from list_by_body_part_for_exercised, the agent response must include the exercise id for every exercise it presents to the user
6. No assistant message should include more than one tool invocation
7. The agent should conclude the conversation with a human-readable answer that summarizes the requested leg exercises using data returned from the tools
4

Save the metric

Save the metric, then turn it on for your Log stream.

Best practices

Trajectory tests are similar to unit tests for the agents trajectory, to check if certain conditions are followed during the agents path. You should write all the tests in a numbered list. For example:
1. If X happens then ask the user Y and call tool Z.
2. X tool is always called before Y tool.
3. When user asks X reply with Y
4. The tool Y should be called once in the conversation.
Each test should check for one single condition only. Tests should be logically consistent, and well defined.