Tool Selection Quality determines whether the agent selected the correct tool and for each tool the correct arguments.

This metric is particularly valuable for evaluating agentic AI systems where the model must decide which tools to use and how to use them correctly. Poor tool selection can lead to ineffective or incorrect responses.

Here’s a scale that shows the relationship between Tool Selection Quality and the potential impact on your AI system:

0

1

Low Quality

The agent selected incorrect tools or used correct tools with incorrect parameters.

High Quality

The agent selected the correct tools with the correct parameters.

Calculation Method

Tool Selection Quality is computed through a multi-step process:

1

Model Request

Multiple evaluation requests are sent to an LLM evaluator (e.g., OpenAI’s GPT4o-mini) to analyze the agent’s tool selection decisions.

2

Prompt Engineering

A carefully engineered chain-of-thought prompt guides the model to evaluate whether the selected tools and their parameters were appropriate for the task.

3

Multiple Evaluations

The system requests multiple distinct responses to this prompt to ensure robust evaluation through consensus.

4

Result Analysis

Each evaluation generates both an explanation of the reasoning and a binary judgment (yes/no) on tool selection appropriateness.

5

Score Calculation

The final Tool Selection Quality score is computed as the ratio of positive (‘yes’) responses to the total number of evaluation responses.

We also surface one of the generated explanations, always choosing one that aligns with the majority judgment among the responses.

This metric is computed by prompting an LLM multiple times, and thus requires additional LLM calls to compute, which may impact usage and billing.

Understanding Tool Selection Quality

When Tool Selection is Evaluated

Tool Selection Quality evaluates different scenarios:

No Tool Needed: The assistant is not expected to call tools if there are no unanswered user queries, if no tools can help answer any query, or if all the information to answer is contained in the history.

Tool Needed: When tools should be used, the turn is considered successful if the agent selected the correct tool and provided all required arguments with correct values.

Unsuccessful Selection: If the agent calls tools when it shouldn’t, or selects the wrong tool/arguments when it should call tools, the turn is considered unsuccessful.

Optimizing Your AI System

Addressing Low Tool Selection Quality

When a response has a low Tool Selection Quality score, consider these improvements:

Analyze error patterns: Identify common mistakes in tool selection or parameter usage.

Improve tool descriptions: Enhance tool documentation with clearer descriptions of when and how to use each tool.

Refine system prompts: Update instructions to provide better guidance on tool selection criteria.

Consider model capabilities: Some models may be better at tool selection than others.

Best Practices

Clear Tool Documentation

Provide detailed descriptions for each tool, including when to use it and what parameters are required.

Parameter Validation

Implement validation for tool parameters to prevent incorrect usage and provide helpful error messages.

Monitor Tool Usage Patterns

Track which tools are frequently misused to identify opportunities for improvement in tool design or documentation.

Fine-tune with Examples

Provide examples of correct tool usage in different scenarios to help the agent learn appropriate selection patterns.

Tool Selection Quality is most useful in Agentic Workflows, where an LLM decides the course of action to take by selecting a Tool. This metric helps you detect whether the right course of action was taken by the Agent.