Overview

Tool calling evaluation assesses how well your LLM:
  • Selects appropriate tools for given tasks
  • Provides correct parameters to function calls
  • Handles tool responses appropriately
  • Maintains context throughout tool interactions

Evaluation Types

Composo supports two main types of tool calling evaluation:

Immediate Function Evaluation

Evaluates tool calls before receiving the tool response. This assesses:
  • Sufficiency: Whether the tool call provides enough information
  • Relevance: Whether the selected tool is appropriate for the task
  • Parameter Quality: Whether arguments are complete and valid
  • Planning Capability: Whether the tool call demonstrates logical reasoning

Hindsight Function Evaluation

Evaluates tool calls after receiving the tool response. This assesses:
  • Regret Minimization: Whether the tool call was the optimal choice
  • Tool Understanding: Whether the LLM correctly anticipated the tool’s behavior
  • Information Retrieval: Whether the tool call successfully retrieved relevant information

Criteria Format Requirements

Tool calling criteria must start with one of these prefixes:

Continuous Evaluation (0-1 scoring)

  • "Reward tool calls"
  • "Penalize tool calls"

Binary Evaluation (Pass/Fail)

  • "Tool call passes if"
  • "Tool call fails if"

Key Use Cases

Location Inference and Parameter Completeness

Evaluate whether the LLM can infer missing location information and provide complete parameters. Example Criteria:
"Reward tool calls that correctly provide the country, inferring it if necessary"
"Penalize tool calls that fail to specify location when required"

Function Name Appropriateness

Assess whether the LLM selects function names that clearly match the user’s intent. Example Criteria:
"Reward tool calls that use appropriate function names for the user's request"
"Penalize tool calls that select generic functions when specific ones are available"

JSON Argument Validation

Evaluate the technical correctness of function arguments. Example Criteria:
"Penalize tool calls with malformed JSON arguments"
"Reward tool calls that provide properly formatted and valid JSON parameters"

Multi-step Planning and Sequencing

Assess the LLM’s ability to plan and execute multi-step tool calling workflows. Example Criteria:
"Reward tool calls that demonstrate logical sequencing in multi-step workflows"
"Reward tool calls that show awareness of dependencies between different tools"

Tool Response Integration Quality

Evaluate how well the LLM uses tool responses to provide accurate final answers. Example Criteria:
"Reward tool calls that lead to accurate final responses based on tool results"
"Penalize tool calls that ignore or misinterpret tool response data"

Error Handling and Fallback Logic

Assess the LLM’s ability to handle tool failures and provide alternative solutions. Example Criteria:
"Reward tool calls that include fallback options when primary tools might fail"
"Penalize tool calls that don't consider potential tool failures"

Context-Aware Tool Selection

Evaluate whether the LLM considers conversation context when selecting tools. Example Criteria:
"Reward tool calls that consider previous conversation context in tool selection"
"Penalize tool calls that ignore relevant context from earlier messages"

Parameter Completeness vs. Over-specification

Assess the balance between providing necessary parameters and avoiding unnecessary ones. Example Criteria:
"Reward tool calls that provide all required parameters without unnecessary ones"
"Penalize tool calls that include irrelevant parameters that could cause errors"

Message Format

Tool calling evaluations use the same message format as response evaluations, but include tool definitions:
{
    "messages": [
        {"role": "user", "content": "What's the weather in Paris?"},
        {"role": "assistant", "content": null, "tool_calls": [
            {
                "id": "call_123",
                "type": "function",
                "function": {
                    "name": "get_weather",
                    "arguments": "{\"location\": \"Paris, France\"}"
                }
            }
        ]},
        {"role": "tool", "tool_call_id": "call_123", "content": "Currently 15°C with clear skies"},
        {"role": "assistant", "content": "The weather in Paris is currently 15°C with clear skies."}
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"]
                }
            }
        }
    ],
    "evaluation_criteria": "Reward tool calls that select appropriate tools and provide accurate parameters"
}

Evaluation Examples

Example 1: Immediate Evaluation (Before Tool Response)

User: “What’s the weather like today?” Assistant: calls get_weather with location parameter Criteria: “Reward tool calls that correctly identify when weather information is needed”

Example 2: Parameter Quality Assessment

User: “Calculate 15 * 23” Assistant: calls calculator function with correct expression Criteria: “Reward tool calls that provide complete and valid parameters for mathematical operations”

Example 3: Hindsight Evaluation (After Tool Response)

User: “Find a restaurant near me and get directions” Assistant: calls location service, then navigation tool Tool Response: returns restaurant location and directions Criteria: “Reward tool calls that successfully retrieve relevant information and lead to helpful responses”

Example 4: Planning Capability

User: “I need to book a flight and hotel for my trip” Assistant: calls flight search, then hotel search based on flight location Criteria: “Reward tool calls that demonstrate logical planning and sequential reasoning”

Best Practices

  1. Use Correct Criteria Prefixes: Always start tool calling criteria with "Reward tool calls", "Penalize tool calls", "Tool call passes if", or "Tool call fails if"
  2. Include Tool Definitions: Always provide all the tools with complete function schemas
  3. Complete Conversations: Include the full conversation including tool responses for hindsight evaluation
  4. Choose Correct Evaluation Type: Use immediate evaluation for planning assessment, hindsight evaluation for outcome assessment
  5. Specific Criteria: Write criteria that focus on specific aspects like tool selection, parameter quality, or planning capability
  6. Test Edge Cases: Evaluate how your LLM handles invalid parameters or tool failures

Integration

Use our Python SDK to easily integrate tool calling evaluation into your workflow:
from composo import Composo

composo = Composo()
    result = composo.evaluate(
        messages=messages,
        tools=tools,
        criteria="Reward tool calls that select appropriate tools for user requests"
    )
    print(f"Tool calling score: {result.score}")

Predefined Criteria Sets

Use our predefined criteria sets for common tool calling evaluation scenarios:
from composo import criteria

# Tool call evaluation criteria
result = client.evaluate(
    messages=messages,
    tools=tools,
    criteria=criteria.tool_call
)

# Tool response evaluation criteria
result = client.evaluate(
    messages=messages,
    tools=tools,
    criteria=criteria.tool_response
)

Next Steps