Agent Flow

Agent Flow is a binary metric that checks if an agent’s behavior satisfies all user-defined natural language conditions.

Agent Flow is a binary evaluation metric that measures the correctness and coherence of an agentic trajectory by validating it against user-specified natural language tests. A trajectory is said to pass the Agent Flow metric if and only if all the user-defined natural language conditions are successfully satisfied by the agent’s realized behavior or output.

To use this metric, you will need to create a copy and edit the prompt to provide your natural language tests.

This is a boolean metric, returning a confidence score that the agent flow satisfies all conditions. The score ranges from 0% (no confidence the agent flow satisfies all conditions) to 100% (complete confidence that the agent flow satisfies all conditions).

Agent Flow at a glance

Property	Description
Name	Agent Flow
Category	Agentic AI
Can be applied to	Session
LLM-as-a-judge Support	✅
Luna Support	❌
Protect Runtime Protection	❌
Value Type	Boolean shown as a percentage confidence score

When to use this metric

When to Use This Metric

Agent Flow is useful when measuring multi-agent systems that have well-defined paths or interactions

Agents with multiple possible paths: Agent Flow can evaluate an agentic application that has multiple possible paths where you know the expected behavior for each user response. You can validate that the agent performs the expected behavior.

Agents with specific intention views: Agent flow can validate specific interaction rules. For example, ensuring the agent asks for confirmation before completing a purchase.

Agents with unconditional behaviors: Agent flow can check for unconditional behaviors, such as verifying that the agent always calls the authentication tool during a conversation.

Score interpretation

Expected Score: 80%-100%.

060%100%

Poor

Fair

Excellent

Configure Agent Flow

This metric needs to be manually customized to include your own natural language tests.

Create a copy of the Agent Flow metric

From the Metrics Hub, select the Agent Flow metric. You will get a popup asking you to duplicate the metric. Select Duplicate metric to create a copy.

Locate the user defined tests section

Locate the user defined tests section in the prompt.

<user-defined-tests>
{{ Add your tests here }}
</user-defined-tests>

Customize the prompt by adding your user-defined tests

This prompt needs to be customized based on your application, and the inputs and outputs you are expecting. Replace {{ Add your tests here }} with a numbered list of tests in natural language that can be used to evaluate the agent efficiency. This can include:

Expected tool or agent calls, using the tool or agent names
Conditions on tool or agent calling (e.g. if tool x is called, don’t call agent y)
Expectations around the input or output parameters to tools and agents
Limitations on the number of tool or agent calls

For example, imagine you were creating an agent to provide advice on exercises for different body parts, such as for a physical therapy application. This has multiple tools, including list_by_target_muscle_for_exercised, list_by_body_part_for_exercised, list_of_bodyparts_for_exercised. Some user tests might be:

If a call to "list_by_target_muscle_for_exercised" returns an error that contains the text "target not found", the agent should subsequently attempt an alternative lookup by calling either "list_by_body_part_for_exercised" or "list_of_bodyparts_for_exercised"
When the user asks for exercises that target leg muscles, the agent must call at least one of the tools ["list_by_target_muscle_for_exercised", "list_by_body_part_for_exercised"] during the conversation
After receiving a successful response from "list_by_body_part_for_exercised", the agent's following natural-language message must contain at least one exercise name, the corresponding equipment, and an animated demonstration URL taken from the tool output
Every invocation of the tool "list_by_body_part_for_exercised" must include the required parameter "bodypart"
After receiving data from list_by_body_part_for_exercised, the agent response must include the exercise id for every exercise it presents to the user
No assistant message should include more than one tool invocation
The agent should conclude the conversation with a human-readable answer that summarizes the requested leg exercises using data returned from the tools

Save the metric

Save the metric, then turn it on for your Log stream.

Best practices

Trajectory tests are similar to unit tests for the agents trajectory, to check if certain conditions are followed during the agents path. You should write all the tests in a numbered list. For example:

If X happens then ask the user Y and call tool Z.
X tool is always called before Y tool.
When user asks X reply with Y
The tool Y should be called once in the conversation.

Each test should check for one single condition only. Tests should be logically consistent, and well defined.

Overview

Get Started

Logging and Monitoring

Experiments

Runtime Protection

Metrics

Annotations

Integrations

Security

References

Agent Flow at a glance

When to use this metric

When to Use This Metric

Score interpretation

Configure Agent Flow

Best practices

Overview

Get Started

Logging and Monitoring

Experiments

Runtime Protection

Metrics

Annotations

Integrations

Security

References

​Agent Flow at a glance

​When to use this metric

When to Use This Metric

​Score interpretation

​Configure Agent Flow

​Best practices

Agent Flow at a glance

When to use this metric

Score interpretation

Configure Agent Flow

Best practices