If the response does not achieve an Action Completion score of 100%, it indicates that at least one judge considered the model to have failed in accomplishing every user goal.Action Completion is calculated by:
1
Additional Requests
Multiple requests are sent to an LLM (e.g., OpenAI’s GPT4o) using a carefully designed chain-of-thought prompt that adheres to the definition above.
2
Judgment Responses
The LLM generates multiple distinct responses, each containing:
An explanation.
A final judgment: “Yes” (goal accomplished) or “No” (goal not accomplished).
3
Score Computation
Action Completion Score = (Number of “Yes” Responses) / (Total Number of Responses)
4
Explanation Surfacing
One explanation is surfaced, chosen to align with the majority judgment among the responses.
We display one of the generated explanations alongside the score, always choosing one that aligns with the majority judgment.
This metric requires multiple LLM calls to compute, which may impact usage and billing.
The Action Completion metric is the single best measure of whether an agent is truly useful. It is particularly valuable in the following scenarios:
Agentic Workflows: When an AI agent must decide on a course of action and select tools to accomplish tasks.
Multi-step Tasks: When completing a user’s request requires multiple steps or decisions.
Tool-using Assistants: When evaluating if the assistant successfully used the right tools.
Action completion will help determine whether the agent successfully accomplished all of the user’s goals. By tracking Action Completion over time, teams can identify patterns where agents fall short, analyze why certain scenarios lead to failures, and focus on targeted fixes.
To optimize your agent’s performance and ensure high Action Completion scores, consider the following best practices:
Track Progress Over Time
Monitor Action Completion scores across different versions of your agent to identify trends and ensure continuous improvements in task completion capabilities.
Analyze Failure Patterns
When Action Completion scores are low, examine specific steps or scenarios where agents fail to meet user goals. Use this analysis to identify and address systematic issues.
Combine with Other Metrics
Use Action Completion alongside other agentic metrics, such as Action Advancement, to get a comprehensive view of your assistant’s effectiveness and identify areas for improvement.
Test Edge Cases
Create evaluation datasets that include complex, multi-step tasks to thoroughly assess your agent’s ability to handle challenging scenarios and advance user goals effectively.
When optimizing for Action Completion, ensure you’re not sacrificing other important aspects like safety, factual accuracy, or user experience in pursuit of task completion.