Why Agent Evaluation Matters

As LLM applications evolve from simple chat interfaces to sophisticated agentic systems with tool calling, multi-step reasoning, and complex workflows, traditional evaluation approaches fail to capture what makes agents actually work in production.

The Composo Agent Framework

Start here with our battle-tested framework that evaluates agents across five critical dimensions. We’ve developed this framework through extensive R&D and tested with industry partners.

Proven Through Rigorous Research & Real-World Testing

This framework represents >12 months of intensive R&D with leading AI teams who needed agent evaluation that actually works in production. Here’s what makes it different: The Research Journey
  • Thousands of production agent traces analyzed from both regulated enterprises as well as leading AI startups
  • 12 major framework iterations based on real-world failure modes we discovered
  • Validated across 8 industries including healthcare, finance, legal, and deep knowledge research
  • >85% accuracy in predicting agent success/failure before deployment
  • 3x faster debugging of agent issues compared to manual analysis
Why These Specific Metrics? Our research revealed that agent failures cluster into five distinct patterns. Traditional “did it get the right answer?” evaluation misses >70% of these failure modes:
  • Exploration vs Exploitation imbalance: Agents that either never try new approaches (getting stuck) or never leverage what they’ve learned (inefficient loops)
  • Tool misuse patterns: Subtle errors in parameter formatting that work 90% of the time but fail catastrophically on edge cases
  • Goal drift: Agents that solve a problem but not the user’s problem
  • Hallucinated capabilities: Agents hallucinating as LLMs are always prone to do (e.g. claiming success when tools actually returned errors, or abandoning critical information from earlier in the conversation)
Each metric in our framework directly addresses these production failure modes. This isn’t academic theory—it’s battle-tested engineering derived from millions of real agent interactions. Industry Validation “Composo’s agent framework caught critical issues our own evaluation suite missed. It identified tool-calling patterns that would have caused production outages.” - ML Engineer, Fortune 500 Financial Services “We reduced our agent failure rate by 35% after implementing Composo’s evaluation framework in our CI/CD pipeline.” - Head of AI, Healthcare Startup This framework now evaluates over 10 million agent interactions monthly across our customer base, continuously proving its effectiveness at scale.

Core Agent Metrics

🔍 Exploration Reward agents that plan effectively: exploring new information and capabilities, and investigating unknowns despite uncertainty ⚡ Exploitation Reward agents that plan effectively: exploiting existing knowledge and available context to create reliable plans with predictable outcomes 🔧 Tool Use Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls 🎯 Goal Pursuit Reward agents that work towards the goal specified by the user ✅ Agent Faithfulness Reward agents that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation

Implementation Guide

Agent evaluation is currently only available with our default model, not the lightning model
Get started evaluating your agent in under 5 minutes using our pre-built agent framework:
from composo import Composo, criteria

composo_client = Composo(api_key="YOUR_API_KEY")

# Simple weather agent example
messages = [
    {"role": "user", "content": "What's the weather in Paris?"},
    {"role": "assistant", "content": None, "tool_calls": [
        {
            "id": "call_123",
            "type": "function",
            "function": {
                "name": "get_weather",
                "arguments": "{\"location\": \"Paris, France\"}"
            }
        }
    ]},
    {"role": "tool", "tool_call_id": "call_123", "content": "Currently 15°C with clear skies"},
    {"role": "assistant", "content": "The weather in Paris is currently 15°C with clear skies."}
]

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City and country"}
                },
                "required": ["location"]
            }
        }
    }
]

# Evaluate with the agents framework
results = composo_client.evaluate(
    messages=messages,
    tools=tools,
    criteria=criteria.agent
)

for result in results:
    print(f"Score: {result.score}/1.00")
    print(f"Explanation: {result.explanation}\n")

Evaluating with Individual Metrics

You can also evaluate against specific metrics from the framework:
# Evaluate specific aspects of agent behavior
results = composo_client.evaluate(
    messages=agent_trace,
    tools=tool_definitions,
    criteria=[
        "Reward agents that work towards the goal specified by the user",
        "Reward agents that operate tools correctly in accordance with the tool definition",
        "Reward agents that only make claims directly supported by tool call returns"
    ]
)

Advanced Agent Metrics

Once you’ve mastered the core framework, explore these additional agent-level metrics for deeper insights: Agent Sequencing Reward agents that follow logical sequences, such as gathering required information from user before attempting specific lookups Agent Efficiency Reward agents that are efficient when working towards their goal Agent Thoroughness Reward agents that are fully comprehensive and thorough when working towards their goal

Evaluating Individual Tool Calls

For granular analysis, evaluate specific tool call steps within your agent trace: Tool Call Formulation Reward tool calls that formulate arguments using only information provided by the user or previous tool call returns without fabricating parameters Tool Relevance Reward tool calls that perform actions or retrieve information directly relevant to the goal Response Completeness from Tool Returns Reward responses that incorporate all relevant information from tool call returns needed to comprehensively answer the user's question Response Precision from Tool Returns Reward responses that include only the specific information from tool call returns that directly addresses the user's query Response Faithfulness to Tool Returns Reward responses that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation

Writing Custom Agent Criteria

While our agent framework and additional metrics cover many use cases, you can write custom criteria for your specific domain. See our Criteria Writing guide for detailed instructions on crafting your own criteria. Common patterns for custom agent criteria:
# Healthcare agent
"Reward agents that appropriately defer to medical professionals for diagnosis"

# Financial agent
"Reward agents that verify account permissions before accessing sensitive data"

# Code generation agent
"Reward agents that validate syntax before executing code modifications"

# Research agent
"Reward agents that prioritize peer-reviewed sources over general web content"

Next Steps