Agent Evaluation

Why Agent Evaluation Matters

As LLM applications evolve from simple chat interfaces to sophisticated agentic systems with tool calling, multi-step reasoning, and complex workflows, traditional evaluation approaches fail to capture what makes agents actually work in production.

The Composo Agent Framework

Start here with our battle-tested framework that evaluates agents across five critical dimensions. We’ve developed this framework through extensive R&D and tested with industry partners.

Proven Through Rigorous Research & Real-World Testing

This framework represents >12 months of intensive R&D with leading AI teams who needed agent evaluation that actually works in production. Here’s what makes it different: The Research Journey

Thousands of production agent traces analyzed from both regulated enterprises as well as leading AI startups
12 major framework iterations based on real-world failure modes we discovered
Validated across 8 industries including healthcare, finance, legal, and deep knowledge research
>85% accuracy in predicting agent success/failure before deployment
3x faster debugging of agent issues compared to manual analysis

Why These Specific Metrics? Our research revealed that agent failures cluster into five distinct patterns. Traditional “did it get the right answer?” evaluation misses >70% of these failure modes:

Exploration vs Exploitation imbalance: Agents that either never try new approaches (getting stuck) or never leverage what they’ve learned (inefficient loops)
Tool misuse patterns: Subtle errors in parameter formatting that work 90% of the time but fail catastrophically on edge cases
Goal drift: Agents that solve a problem but not the user’s problem
Hallucinated capabilities: Agents hallucinating as LLMs are always prone to do (e.g. claiming success when tools actually returned errors, or abandoning critical information from earlier in the conversation)

Each metric in our framework directly addresses these production failure modes. This isn’t academic theory—it’s battle-tested engineering derived from millions of real agent interactions. Industry Validation “Composo’s agent framework caught critical issues our own evaluation suite missed. It identified tool-calling patterns that would have caused production outages.” - ML Engineer, Fortune 500 Financial Services “We reduced our agent failure rate by 35% after implementing Composo’s evaluation framework in our CI/CD pipeline.” - Head of AI, Healthcare Startup This framework now evaluates over 10 million agent interactions monthly across our customer base, continuously proving its effectiveness at scale.

Core Agent Metrics

🔍 Exploration

Reward agents that plan effectively: exploring new information and capabilities, and investigating unknowns despite uncertainty

⚡ Exploitation

Reward agents that plan effectively: exploiting existing knowledge and available context to create reliable plans with predictable outcomes

🔧 Tool Use

Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls

🎯 Goal Pursuit Reward agents that work towards the goal specified by the user ✅ Agent Faithfulness

Reward agents that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation

Implementation Guide

Agent evaluation is currently only available with our default model, not the Lightning model

Using Agent Tracing

The recommended approach for agent evaluation is to use our tracing SDK. This allows you to instrument your agent code and capture real-time execution traces for evaluation. Agent Evaluation Criteria:

agent_exploration - Reward agents that plan effectively: exploring new information and capabilities, and investigating unknowns despite uncertainty
agent_exploitation - Reward agents that plan effectively: exploiting existing knowledge and available context to create reliable plans with predictable outcomes
agent_tool_use - Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls
agent_goal_pursuit - Reward agents that work towards the goal specified by the user
agent_faithfulness - Reward agents that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation

Alternatively, criteria.agent is a list that contains all of the above. Get started evaluating your agent in under 5 minutes using our tracing SDK and pre-built agent framework:

from composo import Composo
from composo.models import criteria
from composo.tracing import ComposoTracer, Instruments, AgentTracer, agent_tracer
from openai import OpenAI

# Initialize tracing for OpenAI
ComposoTracer.init(instruments=[Instruments.OPENAI])
composo_client = Composo(api_key="YOUR_API_KEY")
openai_client = OpenAI()

# Define a weather agent as a function
@agent_tracer(name="weather_agent")
def get_weather_info(location):
    return openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"What's the weather in {location}?"}],
        max_tokens=100
    )

# Orchestrator coordinates the agent workflow
with AgentTracer("orchestrator") as tracer:
    # Execute the weather agent
    result = get_weather_info("Paris")

# Evaluate the full agent trace
results = composo_client.evaluate_trace(tracer.trace, criteria=criteria.agent)

for result, criterion in zip(results, criteria.agent):
    print(f"Criterion: {criterion}")
    for agent in result.results_by_agent_name:
      print(f"{agent}:")
      print(f"  summary_statistics:  {result.results_by_agent_name[agent].summary_statistics} ")
      for id in result.results_by_agent_name[agent].results_by_agent_instance_id:
        if result.results_by_agent_name[agent].results_by_agent_instance_id[id]:
          print(f"    Agent instance: {id}")
          print(f"    Score: {result.results_by_agent_name[agent].results_by_agent_instance_id[id].score}")
          print(f"    Explanation: {result.results_by_agent_name[agent].results_by_agent_instance_id[id].explanation}")
    print("-" * 40)

Learn more about agent tracing →

Evaluating with Individual Metrics

You can also evaluate against specific metrics from the framework:

# Evaluate specific aspects of agent behavior
results = composo_client.evaluate_trace(
    tracer.trace,
    criteria=[
        criteria.agent_goal_pursuit,
        criteria.agent_tool_use,
        criteria.agent_faithfulness
    ]
)

Advanced Agent Metrics

Once you’ve mastered the core framework, explore these additional agent-level metrics for deeper insights: Agent Sequencing

Reward agents that follow logical sequences, such as gathering required information from user before attempting specific lookups

Agent Efficiency Reward agents that are efficient when working towards their goal Agent Thoroughness Reward agents that are fully comprehensive and thorough when working towards their goal

Evaluating Individual Tool Calls

For granular analysis, evaluate specific tool call steps within your agent trace: Tool Call Formulation

Reward tool calls that formulate arguments using only information provided by the user or previous tool call returns without fabricating parameters

Tool Relevance Reward tool calls that perform actions or retrieve information directly relevant to the goal Response Completeness from Tool Returns

Reward responses that incorporate all relevant information from tool call returns needed to comprehensively answer the user's question

Response Precision from Tool Returns

Reward responses that include only the specific information from tool call returns that directly addresses the user's query

Response Faithfulness to Tool Returns

Reward responses that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation

Writing Custom Agent Criteria

While our agent framework and additional metrics cover many use cases, you can write custom criteria for your specific domain. See our Criteria Writing guide for detailed instructions on crafting your own criteria. Common patterns for custom agent criteria:

# Healthcare agent
"Reward agents that appropriately defer to medical professionals for diagnosis"

# Financial agent
"Reward agents that verify account permissions before accessing sensitive data"

# Code generation agent
"Reward agents that validate syntax before executing code modifications"

# Research agent
"Reward agents that prioritize peer-reviewed sources over general web content"

Next Steps

Read our Agent Evaluation Blog - Deep dive into evaluation strategies
Explore the Criteria Library - Find more pre-built criteria

Intro

Use cases

Python SDK

Monitoring

Guides

Agent Evaluation

Why Agent Evaluation Matters

The Composo Agent Framework

Proven Through Rigorous Research & Real-World Testing

Core Agent Metrics

Implementation Guide

Using Agent Tracing

Evaluating with Individual Metrics

Advanced Agent Metrics

Evaluating Individual Tool Calls

Writing Custom Agent Criteria

Next Steps

Intro

Use cases

Python SDK

Monitoring

Guides

​Why Agent Evaluation Matters

​The Composo Agent Framework

​Proven Through Rigorous Research & Real-World Testing

​Core Agent Metrics

​Implementation Guide

​Using Agent Tracing

​Evaluating with Individual Metrics

​Advanced Agent Metrics

​Evaluating Individual Tool Calls

​Writing Custom Agent Criteria

​Next Steps

Why Agent Evaluation Matters

The Composo Agent Framework

Proven Through Rigorous Research & Real-World Testing

Core Agent Metrics

Implementation Guide

Using Agent Tracing

Evaluating with Individual Metrics

Advanced Agent Metrics

Evaluating Individual Tool Calls

Writing Custom Agent Criteria

Next Steps