Multi-turn Evaluation

The Challenge

When testing multi-turn agent conversations, teams often model tests as scripted dialogues:

Turn 1: User asks "Run the software compliance monitor"
Turn 2: Agent runs the tool and returns results
Turn 3: User asks "Which applications are unlicensed?"
Turn 4: Agent lists unlicensed applications

The problem: agent responses are non-deterministic. The agent might take a valid but different path:

Turn 1: User asks "Run the software compliance monitor"
Turn 2: Agent responds "Can you confirm you want me to run the software compliance monitor?"
Turn 3: [Pre-scripted user message no longer makes sense]

The agent’s response is correct—it’s asking for confirmation—but the rigid test script breaks because it expected a different flow. This guide covers two approaches to solve this problem.

Approach 1: User Simulation Agent

Instead of testing exact conversation paths, test whether the agent achieves the intended outcome.

How It Works

Define the test by its goal, not its transcript
Use an LLM to simulate the user dynamically, adapting to whatever the agent responds
Evaluate the outcome against your success criteria

Implementation

from composo import Composo
from openai import OpenAI

composo = Composo()
openai_client = OpenAI()

def run_dynamic_test(
    agent_function,
    test_goal: str,
    initial_user_message: str,
    reference_transcript: list[dict] | None = None,
    max_turns: int = 10
):
    """
    Run a multi-turn test with dynamic user simulation against your live agent.

    Args:
        agent_function: Your agent's response function (takes message history, returns response string)
        test_goal: What the test should achieve (e.g., "Complete software compliance check")
        initial_user_message: The first message to send to the agent
        reference_transcript: Optional example conversation showing the intended flow
        max_turns: Maximum conversation turns before stopping
    """

    # Build the user simulator prompt
    reference_context = ""
    if reference_transcript:
        reference_context = f"""
REFERENCE CONVERSATION (for context on what the user is trying to accomplish):
{format_transcript(reference_transcript)}
"""

    simulator_system = f"""You are simulating a user in a test scenario.

GOAL: {test_goal}
{reference_context}
Your job:
- Play the user role to help the agent achieve the goal
- Adapt naturally if the agent asks clarifying questions or takes a different path
- Stay focused on the goal—don't introduce unrelated topics
- If the goal is achieved, respond with exactly: [TEST_COMPLETE]

Respond only with what the user would say next."""

    # Run the conversation dynamically with the actual agent
    conversation = []
    conversation.append({"role": "user", "content": initial_user_message})

    for turn in range(max_turns):
        # Call the ACTUAL agent function being tested
        agent_response = agent_function(conversation)
        conversation.append({"role": "assistant", "content": agent_response})

        # Simulate next user turn
        simulator_response = openai_client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": simulator_system},
                *conversation
            ]
        )
        next_user_message = simulator_response.choices[0].message.content

        # Check if test is complete
        if "[TEST_COMPLETE]" in next_user_message:
            break

        conversation.append({"role": "user", "content": next_user_message})

    # Evaluate the outcome
    result = composo.evaluate(
        messages=conversation,
        criteria=f"Reward conversations where the agent successfully achieves: {test_goal}"
    )

    return {
        "conversation": conversation,
        "goal_achieved": result.score >= 0.8,
        "score": result.score,
        "explanation": result.explanation
    }

def format_transcript(transcript):
    return "\n".join([f"{msg['role'].upper()}: {msg['content']}" for msg in transcript])

Example Usage

# Your agent function that you want to test
def my_agent_function(messages: list[dict]) -> str:
    """Your agent implementation that takes message history and returns a response."""
    # ... your agent logic here ...
    return agent_response_string

# Optional: provide a reference transcript to guide the user simulator
reference_transcript = [
    {"role": "user", "content": "Run the software compliance monitor"},
    {"role": "assistant", "content": "Running compliance monitor... Found 3 unlicensed applications."},
    {"role": "user", "content": "Which applications are unlicensed?"},
    {"role": "assistant", "content": "The unlicensed applications are: Adobe Photoshop, Slack, Zoom."}
]

# Define what success looks like
test_goal = "Identify all unlicensed software applications on the user's system"

# Run the test against your ACTUAL agent
result = run_dynamic_test(
    agent_function=my_agent_function,
    test_goal=test_goal,
    initial_user_message="Run the software compliance monitor",
    reference_transcript=reference_transcript  # Optional
)

print(f"Goal achieved: {result['goal_achieved']}")
print(f"Score: {result['score']}")

Approach 2: Turn-by-Turn Evaluation

If you have a reference conversation flow, you can test your agent’s ability to respond appropriately at each stage by progressively replaying the conversation and evaluating each response independently. Key difference from Approach 1: Instead of letting the conversation evolve naturally (where the agent’s response affects the next user message), this approach uses a fixed sequence of user messages from a reference transcript. This allows you to test each turn independently without compounding effects.

How It Works

Take a reference transcript showing the desired conversation flow
At each user message, generate a fresh response from your agent given the conversation history so far
Evaluate the generated response against your criteria
Use the reference assistant response (not your agent’s response) for the conversation history when testing the next turn
Aggregate scores across all turns

Implementation

from composo import Composo

composo = Composo()

def evaluate_progressive_turns(
    agent_function,
    reference_transcript: list[dict],
    criteria: str | list[str] | dict[int, str | list[str]]
):
    """
    Progressively test agent responses at each turn of a reference conversation.

    For each user message in the transcript, generates a fresh response from your agent
    and evaluates it. This tests how well your agent follows the intended conversation flow.

    Example: Given transcript [U1, A1, U2, A2, U3, A3], this will:
    - Generate A1' from your agent given [U1], evaluate [U1, A1']
    - Generate A2' from your agent given [U1, A1, U2], evaluate [U1, A1, U2, A2']
    - Generate A3' from your agent given [U1, A1, U2, A2, U3], evaluate [U1, A1, U2, A2, U3, A3']

    Args:
        agent_function: Your agent's response function (takes message history, returns response string)
        reference_transcript: Reference conversation showing the desired flow
        criteria: Evaluation criteria. Can be:
            - Single string/list of strings (applied to all turns)
            - Dict mapping turn index to criteria (for turn-specific evaluation)
    """

    results = []
    conversation_history = []

    for i, message in enumerate(reference_transcript):
        if message["role"] == "user":
            # Add user message to history
            conversation_history.append(message)

        elif message["role"] == "assistant":
            # Generate fresh response from YOUR agent given the conversation so far
            agent_response = agent_function(conversation_history)

            # Create the conversation with the generated response
            conversation_to_evaluate = conversation_history + [
                {"role": "assistant", "content": agent_response}
            ]

            # Get criteria for this specific turn (if dict) or use default
            turn_criteria = criteria.get(i, criteria) if isinstance(criteria, dict) else criteria

            # Evaluate this generated response
            result = composo.evaluate(
                messages=conversation_to_evaluate,
                criteria=turn_criteria
            )

            results.append({
                "turn": i,
                "generated_response": agent_response[:100] + "...",
                "score": result.score,
                "explanation": result.explanation
            })

            # Use the REFERENCE assistant response for the next turn's context
            # (so we're testing each turn independently, not compounding errors)
            conversation_history.append(message)

    # Calculate aggregate metrics
    scores = [r["score"] for r in results if r["score"] is not None]

    return {
        "turn_results": results,
        "average_score": sum(scores) / len(scores) if scores else None,
        "min_score": min(scores) if scores else None,
        "all_passed": all(s >= 0.8 for s in scores) if scores else False
    }

Example Usage

# Your agent function that you want to test
def my_agent_function(messages: list[dict]) -> str:
    """Your agent implementation that takes message history and returns a response."""
    # ... your agent logic here ...
    return agent_response_string

# Reference conversation showing the desired flow
reference_transcript = [
    {"role": "user", "content": "Run the software compliance monitor"},
    {"role": "assistant", "content": "I'll run the software compliance monitor now. Scanning your system..."},
    {"role": "user", "content": "What did you find?"},
    {"role": "assistant", "content": "I found 3 applications without valid licenses: Adobe Photoshop, Slack, and Zoom."}
]

# Test your agent at each turn of the reference conversation
result = evaluate_progressive_turns(
    agent_function=my_agent_function,
    reference_transcript=reference_transcript,
    criteria=[
        "Reward responses that accurately execute the user's request",
        "Reward responses that are clear and informative"
    ]
)

print(f"Average score: {result['average_score']}")
print(f"All turns passed: {result['all_passed']}")

for turn in result["turn_results"]:
    print(f"Turn {turn['turn']}: {turn['score']:.2f}")
    print(f"  Generated: {turn['generated_response']}")

Turn-Specific Criteria

Different turns may have different expectations. You can specify criteria per turn and allow for multiple correct behaviors:

# Reference transcript
reference_transcript = [
    {"role": "user", "content": "What's the weather in Paris?"},
    {"role": "assistant", "content": "I'll check the weather for you."},
    {"role": "user", "content": "Thanks"},
    {"role": "assistant", "content": "The weather in Paris is currently 18°C and partly cloudy."}
]

# Turn-specific criteria allowing multiple correct behaviors
result = evaluate_progressive_turns(
    agent_function=my_agent_function,
    reference_transcript=reference_transcript,
    criteria={
        1: [
            "Reward if the agent asks for clarification about which Paris (France, Texas, etc.)",
            "Reward if the agent acknowledges and proceeds to check the weather",
            "Reward if the agent immediately provides weather information"
        ],
        3: [
            "Reward if the agent provides the weather information",
            "Reward if the agent confirms the request before providing information"
        ]
    }
)

Adding multiple criteria allows you to specify that clarifying, acknowledging, or directly answering are all acceptable behaviors.

Combining with Agent Tracing

For comprehensive testing, combine either approach with Agent Tracing to capture detailed execution data:

from composo import Composo, ComposoTracer, Instruments, AgentTracer
from composo.models import criteria

ComposoTracer.init(instruments=[Instruments.OPENAI])
composo = Composo()

def run_traced_dynamic_test(
    agent_function,
    test_goal: str,
    initial_user_message: str,
    reference_transcript: list[dict] | None = None
):
    with AgentTracer("test_agent") as tracer:
        # Run dynamic test against your actual agent (from Approach 1)
        result = run_dynamic_test(
            agent_function=agent_function,
            test_goal=test_goal,
            initial_user_message=initial_user_message,
            reference_transcript=reference_transcript
        )

    # Evaluate with agent-specific criteria
    trace_results = composo.evaluate_trace(
        tracer.trace,
        criteria=criteria.agent  # Uses full agent evaluation framework
    )

    return {
        "conversation": result["conversation"],
        "agent_metrics": trace_results,
        "goal_achieved": result["goal_achieved"],
        "score": result["score"]
    }

# You can also wrap Approach 2 with AgentTracer for turn-by-turn analysis

Getting Started

Criteria Guide

Testing

Monitoring

Cookbooks

Community Examples

Multi-turn Evaluation

The Challenge

Approach 1: User Simulation Agent

How It Works

Implementation

Example Usage

Approach 2: Turn-by-Turn Evaluation

How It Works

Implementation

Example Usage

Turn-Specific Criteria

Combining with Agent Tracing

Getting Started

Criteria Guide

Testing

Monitoring

Cookbooks

Community Examples

​The Challenge

​Approach 1: User Simulation Agent

​How It Works

​Implementation

​Example Usage

​Approach 2: Turn-by-Turn Evaluation

​How It Works

​Implementation

​Example Usage

​Turn-Specific Criteria

​Combining with Agent Tracing

The Challenge

Approach 1: User Simulation Agent

How It Works

Implementation

Example Usage

Approach 2: Turn-by-Turn Evaluation

How It Works

Implementation

Example Usage

Turn-Specific Criteria

Combining with Agent Tracing