The Challenge
When testing multi-turn agent conversations, teams often model tests as scripted dialogues:
Turn 1: User asks "Run the software compliance monitor"
Turn 2: Agent runs the tool and returns results
Turn 3: User asks "Which applications are unlicensed?"
Turn 4: Agent lists unlicensed applications
The problem: agent responses are non-deterministic. The agent might take a valid but different path:
Turn 1: User asks "Run the software compliance monitor"
Turn 2: Agent responds "Can you confirm you want me to run the software compliance monitor?"
Turn 3: [Pre-scripted user message no longer makes sense]
The agent’s response is correct—it’s asking for confirmation—but the rigid test script breaks because it expected a different flow.
This guide covers two approaches to solve this problem.
Approach 1: User Simulation Agent
Instead of testing exact conversation paths, test whether the agent achieves the intended outcome.
How It Works
- Define the test by its goal, not its transcript
- Use an LLM to simulate the user dynamically, adapting to whatever the agent responds
- Evaluate the outcome against your success criteria
Implementation
from composo import Composo
from openai import OpenAI
composo = Composo()
openai_client = OpenAI()
def run_dynamic_test(
agent_function,
test_goal: str,
initial_user_message: str,
reference_transcript: list[dict] | None = None,
max_turns: int = 10
):
"""
Run a multi-turn test with dynamic user simulation against your live agent.
Args:
agent_function: Your agent's response function (takes message history, returns response string)
test_goal: What the test should achieve (e.g., "Complete software compliance check")
initial_user_message: The first message to send to the agent
reference_transcript: Optional example conversation showing the intended flow
max_turns: Maximum conversation turns before stopping
"""
# Build the user simulator prompt
reference_context = ""
if reference_transcript:
reference_context = f"""
REFERENCE CONVERSATION (for context on what the user is trying to accomplish):
{format_transcript(reference_transcript)}
"""
simulator_system = f"""You are simulating a user in a test scenario.
GOAL: {test_goal}
{reference_context}
Your job:
- Play the user role to help the agent achieve the goal
- Adapt naturally if the agent asks clarifying questions or takes a different path
- Stay focused on the goal—don't introduce unrelated topics
- If the goal is achieved, respond with exactly: [TEST_COMPLETE]
Respond only with what the user would say next."""
# Run the conversation dynamically with the actual agent
conversation = []
conversation.append({"role": "user", "content": initial_user_message})
for turn in range(max_turns):
# Call the ACTUAL agent function being tested
agent_response = agent_function(conversation)
conversation.append({"role": "assistant", "content": agent_response})
# Simulate next user turn
simulator_response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": simulator_system},
*conversation
]
)
next_user_message = simulator_response.choices[0].message.content
# Check if test is complete
if "[TEST_COMPLETE]" in next_user_message:
break
conversation.append({"role": "user", "content": next_user_message})
# Evaluate the outcome
result = composo.evaluate(
messages=conversation,
criteria=f"Reward conversations where the agent successfully achieves: {test_goal}"
)
return {
"conversation": conversation,
"goal_achieved": result.score >= 0.8,
"score": result.score,
"explanation": result.explanation
}
def format_transcript(transcript):
return "\n".join([f"{msg['role'].upper()}: {msg['content']}" for msg in transcript])
Example Usage
# Your agent function that you want to test
def my_agent_function(messages: list[dict]) -> str:
"""Your agent implementation that takes message history and returns a response."""
# ... your agent logic here ...
return agent_response_string
# Optional: provide a reference transcript to guide the user simulator
reference_transcript = [
{"role": "user", "content": "Run the software compliance monitor"},
{"role": "assistant", "content": "Running compliance monitor... Found 3 unlicensed applications."},
{"role": "user", "content": "Which applications are unlicensed?"},
{"role": "assistant", "content": "The unlicensed applications are: Adobe Photoshop, Slack, Zoom."}
]
# Define what success looks like
test_goal = "Identify all unlicensed software applications on the user's system"
# Run the test against your ACTUAL agent
result = run_dynamic_test(
agent_function=my_agent_function,
test_goal=test_goal,
initial_user_message="Run the software compliance monitor",
reference_transcript=reference_transcript # Optional
)
print(f"Goal achieved: {result['goal_achieved']}")
print(f"Score: {result['score']}")
Approach 2: Turn-by-Turn Evaluation
If you have a reference conversation flow, you can test your agent’s ability to respond appropriately at each stage by progressively replaying the conversation and evaluating each response independently.
Key difference from Approach 1: Instead of letting the conversation evolve naturally (where the agent’s response affects the next user message), this approach uses a fixed sequence of user messages from a reference transcript. This allows you to test each turn independently without compounding effects.
How It Works
- Take a reference transcript showing the desired conversation flow
- At each user message, generate a fresh response from your agent given the conversation history so far
- Evaluate the generated response against your criteria
- Use the reference assistant response (not your agent’s response) for the conversation history when testing the next turn
- Aggregate scores across all turns
Implementation
from composo import Composo
composo = Composo()
def evaluate_progressive_turns(
agent_function,
reference_transcript: list[dict],
criteria: str | list[str] | dict[int, str | list[str]]
):
"""
Progressively test agent responses at each turn of a reference conversation.
For each user message in the transcript, generates a fresh response from your agent
and evaluates it. This tests how well your agent follows the intended conversation flow.
Example: Given transcript [U1, A1, U2, A2, U3, A3], this will:
- Generate A1' from your agent given [U1], evaluate [U1, A1']
- Generate A2' from your agent given [U1, A1, U2], evaluate [U1, A1, U2, A2']
- Generate A3' from your agent given [U1, A1, U2, A2, U3], evaluate [U1, A1, U2, A2, U3, A3']
Args:
agent_function: Your agent's response function (takes message history, returns response string)
reference_transcript: Reference conversation showing the desired flow
criteria: Evaluation criteria. Can be:
- Single string/list of strings (applied to all turns)
- Dict mapping turn index to criteria (for turn-specific evaluation)
"""
results = []
conversation_history = []
for i, message in enumerate(reference_transcript):
if message["role"] == "user":
# Add user message to history
conversation_history.append(message)
elif message["role"] == "assistant":
# Generate fresh response from YOUR agent given the conversation so far
agent_response = agent_function(conversation_history)
# Create the conversation with the generated response
conversation_to_evaluate = conversation_history + [
{"role": "assistant", "content": agent_response}
]
# Get criteria for this specific turn (if dict) or use default
turn_criteria = criteria.get(i, criteria) if isinstance(criteria, dict) else criteria
# Evaluate this generated response
result = composo.evaluate(
messages=conversation_to_evaluate,
criteria=turn_criteria
)
results.append({
"turn": i,
"generated_response": agent_response[:100] + "...",
"score": result.score,
"explanation": result.explanation
})
# Use the REFERENCE assistant response for the next turn's context
# (so we're testing each turn independently, not compounding errors)
conversation_history.append(message)
# Calculate aggregate metrics
scores = [r["score"] for r in results if r["score"] is not None]
return {
"turn_results": results,
"average_score": sum(scores) / len(scores) if scores else None,
"min_score": min(scores) if scores else None,
"all_passed": all(s >= 0.8 for s in scores) if scores else False
}
Example Usage
# Your agent function that you want to test
def my_agent_function(messages: list[dict]) -> str:
"""Your agent implementation that takes message history and returns a response."""
# ... your agent logic here ...
return agent_response_string
# Reference conversation showing the desired flow
reference_transcript = [
{"role": "user", "content": "Run the software compliance monitor"},
{"role": "assistant", "content": "I'll run the software compliance monitor now. Scanning your system..."},
{"role": "user", "content": "What did you find?"},
{"role": "assistant", "content": "I found 3 applications without valid licenses: Adobe Photoshop, Slack, and Zoom."}
]
# Test your agent at each turn of the reference conversation
result = evaluate_progressive_turns(
agent_function=my_agent_function,
reference_transcript=reference_transcript,
criteria=[
"Reward responses that accurately execute the user's request",
"Reward responses that are clear and informative"
]
)
print(f"Average score: {result['average_score']}")
print(f"All turns passed: {result['all_passed']}")
for turn in result["turn_results"]:
print(f"Turn {turn['turn']}: {turn['score']:.2f}")
print(f" Generated: {turn['generated_response']}")
Turn-Specific Criteria
Different turns may have different expectations. You can specify criteria per turn and allow for multiple correct behaviors:
# Reference transcript
reference_transcript = [
{"role": "user", "content": "What's the weather in Paris?"},
{"role": "assistant", "content": "I'll check the weather for you."},
{"role": "user", "content": "Thanks"},
{"role": "assistant", "content": "The weather in Paris is currently 18°C and partly cloudy."}
]
# Turn-specific criteria allowing multiple correct behaviors
result = evaluate_progressive_turns(
agent_function=my_agent_function,
reference_transcript=reference_transcript,
criteria={
1: [
"Reward if the agent asks for clarification about which Paris (France, Texas, etc.)",
"Reward if the agent acknowledges and proceeds to check the weather",
"Reward if the agent immediately provides weather information"
],
3: [
"Reward if the agent provides the weather information",
"Reward if the agent confirms the request before providing information"
]
}
)
Adding multiple criteria allows you to specify that clarifying, acknowledging, or directly answering are all acceptable behaviors.
Combining with Agent Tracing
For comprehensive testing, combine either approach with Agent Tracing to capture detailed execution data:
from composo import Composo, ComposoTracer, Instruments, AgentTracer
from composo.models import criteria
ComposoTracer.init(instruments=[Instruments.OPENAI])
composo = Composo()
def run_traced_dynamic_test(
agent_function,
test_goal: str,
initial_user_message: str,
reference_transcript: list[dict] | None = None
):
with AgentTracer("test_agent") as tracer:
# Run dynamic test against your actual agent (from Approach 1)
result = run_dynamic_test(
agent_function=agent_function,
test_goal=test_goal,
initial_user_message=initial_user_message,
reference_transcript=reference_transcript
)
# Evaluate with agent-specific criteria
trace_results = composo.evaluate_trace(
tracer.trace,
criteria=criteria.agent # Uses full agent evaluation framework
)
return {
"conversation": result["conversation"],
"agent_metrics": trace_results,
"goal_achieved": result["goal_achieved"],
"score": result["score"]
}
# You can also wrap Approach 2 with AgentTracer for turn-by-turn analysis