# Attach Comment Endpoint
Source: https://docs.composo.ai/api-reference/annotations/attach-comment-endpoint
https://platform.composo.ai/api/evals-docs/openapi.json post /api/v1/annotations/comment
# Attach Rating Endpoint
Source: https://docs.composo.ai/api-reference/annotations/attach-rating-endpoint
https://platform.composo.ai/api/evals-docs/openapi.json post /api/v1/annotations/rating
# Reward
Source: https://docs.composo.ai/api-reference/evals/reward
https://platform.composo.ai/api/evals-docs/openapi.json post /api/v1/evals/reward
Evaluate LLM output against specified criteria. Score on a continuous 0-1 scale.
# Get My Rate Limits
Source: https://docs.composo.ai/api-reference/rate-limits/get-my-rate-limits
https://platform.composo.ai/api/evals-docs/openapi.json get /api/v1/rate-limits/me
Get rate limits for the authenticated user's domain
# List Traces Endpoint
Source: https://docs.composo.ai/api-reference/traces/list-traces-endpoint
https://platform.composo.ai/api/evals-docs/openapi.json get /api/v1/traces
# FAQs
Source: https://docs.composo.ai/documentation/FAQs/common-questions
### Should I include system messages when evaluating with Composo?
* Including system messages is optional but recommended, as they provide useful context that can improve evaluation accuracy.
### What's the context limit?
* This is model dependant, see the context windows by model [here](/documentation/getting-started/models#available-model-versions).
### What's the expected response time?
* This is model dependant, see the latency by model [here](/documentation/getting-started/models#available-model-versions).
### What are the rate limits?
* **Free plan:** 500 requests per hour
* **Paid plans:** Higher limits based on your specific requirements
### What languages are supported?
* Our evaluation models support all major languages plus code. A good rule of thumb is that if you don't need a specialized model to deal with your language, we can handle it.
### Can I evaluate tool calls, not just responses?
* Yes! Composo evaluates all agent behaviour, including tool calls.
### How deterministic are the evaluation scores?
* Composo achieves \<1% variance in scoring, meaning the same input will produce virtually identical scores every time. This compares to 30%+ variance typical with LLM-as-judge approaches. We also cache results for benchmark evaluations to ensure perfect repeatability across runs.
### What do you mean by a generative reward model architecture?
* It's a dual-model system: one model generates detailed reasoning about why an output meets your criteria, while another specialized scoring model (trained on preference data) produces the actual score. This separation ensures both interpretable explanations and consistent, meaningful scores.
### How complex is the integration?
* Integration takes just 3 lines of code. You send your conversation and a simple evaluation criterion like "reward responses that are accurate." All the complexity happens behind the scenes. It's a drop-in replacement for anywhere you currently use LLM-as-judge.
### What makes Composo more accurate than LLM-as-judge??
* We use purpose-built reward models trained on tens of thousands of human preference comparisons across real-world domains. Instead of asking an LLM to generate arbitrary scores, our models learn quality distributions through pairwise comparisons (similar to ELO rankings). This creates meaningful, consistent scoring that's grounded in actual human judgments.
### **How do you achieve such consistent scoring?**
* We use a multi-layered approach including ensemble techniques and statistical aggregation. Multiple specialized models analyze each evaluation, and we aggregate their outputs to eliminate random variance. This is fundamentally different from single-model LLM approaches that produce different scores each time.
# Credits
Source: https://docs.composo.ai/documentation/billing/credits
Credits are how Composo bills your usage. Your monthly contract maps to a credit allowance, and each evaluation call costs a fraction of a credit based on the input tokens it processed and which model it used.
## What a credit is
A credit is a fixed amount of evaluation compute. Each model has its own tokens-per-credit rate:
| Model | Tokens per credit |
| ------------------------------------------------------ | ----------------- |
| `align-20260109` / `align-20251111` / `align-20250529` | 1,000,000 |
| `align-lightning-20251127` | 1,000,000 |
| `align-lightning-20250731` | 2,000,000 |
Some Lightning variants stretch further per credit than Align — same workload, fewer credits — because they're cheaper for us to serve. If you're cost-sensitive, leaning on those models for high-volume use cases is the main lever.
Not sure which model you're calling? Check the `model_core` field in your API requests, or look at the per-model table on your `/usage` page.
## Why credits, not tokens
Tokens cost different amounts depending on the model. Credits normalise for that — the cost of a request matches what we actually spend serving it, so your contract goes further when you lean on cheaper-to-serve models.
## How your token contract maps to credits
Most contracts are denominated in tokens. The `/usage` page now expresses that allowance as credits, converted at the most favourable rate for you: **1 credit per 1,000,000 tokens** (the Align rate). A 10,000,000-token contract, for example, becomes a 10-credit allowance.
When you spend that allowance:
* **Align calls** cost 1 credit per million input tokens — exactly the rate at which your contract was converted, so your effective capacity matches your contract.
* **Lightning's cost-efficient variant** costs 1 credit per 2,000,000 input tokens — half the rate. A Lightning-heavy workload effectively gets up to 2× more capacity from the same contract.
If you've always used Align, your effective monthly capacity is unchanged. If you're using Lightning's cost-efficient variant, this is pure upside.
## Viewing your usage
The `/usage` page is scoped to one calendar month at a time. Pick the month from the dropdown in the top right; the current month is the default.
At the top, three status cards:
* **This Month** — credits consumed vs your allowance, with an overage indicator if you've gone over
* **Projected** — month-end projection based on your current pace, with an *on track* or *projected overage* status
* **vs Previous Month** — percentage delta vs the same point in the previous month, with both numbers spelled out so you can verify the math
Below the cards, a daily bar chart breaks down credit consumption by day and by model. A per-model table at the bottom shows requests, tokens, and credits for each model variant you've used this month — that's where you can see which models are driving your bill.
## What happens if you go over
Running out of credits doesn't block requests. Evaluation calls keep succeeding past zero; your `credits_remaining` simply goes negative and the page shows an overage indicator. If you persistently exceed your allowance, we'll reach out to talk about adjusting the contract — there's no automated cut-off.
## Questions
Reach out to [support@composo.ai](mailto:support@composo.ai) — we're happy to walk you through your usage or talk about adjusting your contract.
# About
Source: https://docs.composo.ai/documentation/community-examples/about
This section showcases examples, integrations, and use cases contributed by the community.
These examples are not officially maintained by the Composo team. They are community contributions provided as starting points for your own implementations.
**Want to contribute?** If you've built something interesting or have an example you'd like to share, we'd love to feature it here!
# Multi-turn Evaluation
Source: https://docs.composo.ai/documentation/community-examples/multi-turn
Strategies for testing multi-turn agent conversations where agent responses are non-deterministic.
## The Challenge
When testing multi-turn agent conversations, teams often model tests as scripted dialogues:
```
Turn 1: User asks "Run the software compliance monitor"
Turn 2: Agent runs the tool and returns results
Turn 3: User asks "Which applications are unlicensed?"
Turn 4: Agent lists unlicensed applications
```
The problem: **agent responses are non-deterministic**. The agent might take a valid but different path:
```
Turn 1: User asks "Run the software compliance monitor"
Turn 2: Agent responds "Can you confirm you want me to run the software compliance monitor?"
Turn 3: [Pre-scripted user message no longer makes sense]
```
The agent's response is *correct*—it's asking for confirmation—but the rigid test script breaks because it expected a different flow.
This guide covers two approaches to solve this problem.
***
## Approach 1: User Simulation Agent
Instead of testing exact conversation paths, test whether the agent achieves the intended *outcome*.
### How It Works
1. Define the test by its **goal**, not its transcript
2. Use an LLM to **simulate the user** dynamically, adapting to whatever the agent responds
3. **Evaluate the outcome** against your success criteria
### Implementation
```python theme={null}
from composo import Composo
from openai import OpenAI
composo = Composo()
openai_client = OpenAI()
def run_dynamic_test(
agent_function,
test_goal: str,
initial_user_message: str,
reference_transcript: list[dict] | None = None,
max_turns: int = 10
):
"""
Run a multi-turn test with dynamic user simulation against your live agent.
Args:
agent_function: Your agent's response function (takes message history, returns response string)
test_goal: What the test should achieve (e.g., "Complete software compliance check")
initial_user_message: The first message to send to the agent
reference_transcript: Optional example conversation showing the intended flow
max_turns: Maximum conversation turns before stopping
"""
# Build the user simulator prompt
reference_context = ""
if reference_transcript:
reference_context = f"""
REFERENCE CONVERSATION (for context on what the user is trying to accomplish):
{format_transcript(reference_transcript)}
"""
simulator_system = f"""You are simulating a user in a test scenario.
GOAL: {test_goal}
{reference_context}
Your job:
- Play the user role to help the agent achieve the goal
- Adapt naturally if the agent asks clarifying questions or takes a different path
- Stay focused on the goal—don't introduce unrelated topics
- If the goal is achieved, respond with exactly: [TEST_COMPLETE]
Respond only with what the user would say next."""
# Run the conversation dynamically with the actual agent
conversation = []
conversation.append({"role": "user", "content": initial_user_message})
for turn in range(max_turns):
# Call the ACTUAL agent function being tested
agent_response = agent_function(conversation)
conversation.append({"role": "assistant", "content": agent_response})
# Simulate next user turn
simulator_response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": simulator_system},
*conversation
]
)
next_user_message = simulator_response.choices[0].message.content
# Check if test is complete
if "[TEST_COMPLETE]" in next_user_message:
break
conversation.append({"role": "user", "content": next_user_message})
# Evaluate the outcome
result = composo.evaluate(
messages=conversation,
criteria=f"Reward conversations where the agent successfully achieves: {test_goal}"
)
return {
"conversation": conversation,
"goal_achieved": result.score >= 0.8,
"score": result.score,
"explanation": result.explanation
}
def format_transcript(transcript):
return "\n".join([f"{msg['role'].upper()}: {msg['content']}" for msg in transcript])
```
### Example Usage
```python theme={null}
# Your agent function that you want to test
def my_agent_function(messages: list[dict]) -> str:
"""Your agent implementation that takes message history and returns a response."""
# ... your agent logic here ...
return agent_response_string
# Optional: provide a reference transcript to guide the user simulator
reference_transcript = [
{"role": "user", "content": "Run the software compliance monitor"},
{"role": "assistant", "content": "Running compliance monitor... Found 3 unlicensed applications."},
{"role": "user", "content": "Which applications are unlicensed?"},
{"role": "assistant", "content": "The unlicensed applications are: Adobe Photoshop, Slack, Zoom."}
]
# Define what success looks like
test_goal = "Identify all unlicensed software applications on the user's system"
# Run the test against your ACTUAL agent
result = run_dynamic_test(
agent_function=my_agent_function,
test_goal=test_goal,
initial_user_message="Run the software compliance monitor",
reference_transcript=reference_transcript # Optional
)
print(f"Goal achieved: {result['goal_achieved']}")
print(f"Score: {result['score']}")
```
***
## Approach 2: Turn-by-Turn Evaluation
If you have a reference conversation flow, you can test your agent's ability to respond appropriately at each stage by progressively replaying the conversation and evaluating each response independently.
**Key difference from Approach 1**: Instead of letting the conversation evolve naturally (where the agent's response affects the next user message), this approach uses a **fixed sequence of user messages** from a reference transcript. This allows you to test each turn independently without compounding effects.
### How It Works
1. Take a reference transcript showing the desired conversation flow
2. At each user message, generate a fresh response from your agent given the conversation history so far
3. Evaluate the generated response against your criteria
4. Use the **reference assistant response** (not your agent's response) for the conversation history when testing the next turn
5. Aggregate scores across all turns
### Implementation
```python theme={null}
from composo import Composo
composo = Composo()
def evaluate_progressive_turns(
agent_function,
reference_transcript: list[dict],
criteria: str | list[str] | dict[int, str | list[str]]
):
"""
Progressively test agent responses at each turn of a reference conversation.
For each user message in the transcript, generates a fresh response from your agent
and evaluates it. This tests how well your agent follows the intended conversation flow.
Example: Given transcript [U1, A1, U2, A2, U3, A3], this will:
- Generate A1' from your agent given [U1], evaluate [U1, A1']
- Generate A2' from your agent given [U1, A1, U2], evaluate [U1, A1, U2, A2']
- Generate A3' from your agent given [U1, A1, U2, A2, U3], evaluate [U1, A1, U2, A2, U3, A3']
Args:
agent_function: Your agent's response function (takes message history, returns response string)
reference_transcript: Reference conversation showing the desired flow
criteria: Evaluation criteria. Can be:
- Single string/list of strings (applied to all turns)
- Dict mapping turn index to criteria (for turn-specific evaluation)
"""
results = []
conversation_history = []
for i, message in enumerate(reference_transcript):
if message["role"] == "user":
# Add user message to history
conversation_history.append(message)
elif message["role"] == "assistant":
# Generate fresh response from YOUR agent given the conversation so far
agent_response = agent_function(conversation_history)
# Create the conversation with the generated response
conversation_to_evaluate = conversation_history + [
{"role": "assistant", "content": agent_response}
]
# Get criteria for this specific turn (if dict) or use default
turn_criteria = criteria.get(i, criteria) if isinstance(criteria, dict) else criteria
# Evaluate this generated response
result = composo.evaluate(
messages=conversation_to_evaluate,
criteria=turn_criteria
)
results.append({
"turn": i,
"generated_response": agent_response[:100] + "...",
"score": result.score,
"explanation": result.explanation
})
# Use the REFERENCE assistant response for the next turn's context
# (so we're testing each turn independently, not compounding errors)
conversation_history.append(message)
# Calculate aggregate metrics
scores = [r["score"] for r in results if r["score"] is not None]
return {
"turn_results": results,
"average_score": sum(scores) / len(scores) if scores else None,
"min_score": min(scores) if scores else None,
"all_passed": all(s >= 0.8 for s in scores) if scores else False
}
```
### Example Usage
```python theme={null}
# Your agent function that you want to test
def my_agent_function(messages: list[dict]) -> str:
"""Your agent implementation that takes message history and returns a response."""
# ... your agent logic here ...
return agent_response_string
# Reference conversation showing the desired flow
reference_transcript = [
{"role": "user", "content": "Run the software compliance monitor"},
{"role": "assistant", "content": "I'll run the software compliance monitor now. Scanning your system..."},
{"role": "user", "content": "What did you find?"},
{"role": "assistant", "content": "I found 3 applications without valid licenses: Adobe Photoshop, Slack, and Zoom."}
]
# Test your agent at each turn of the reference conversation
result = evaluate_progressive_turns(
agent_function=my_agent_function,
reference_transcript=reference_transcript,
criteria=[
"Reward responses that accurately execute the user's request",
"Reward responses that are clear and informative"
]
)
print(f"Average score: {result['average_score']}")
print(f"All turns passed: {result['all_passed']}")
for turn in result["turn_results"]:
print(f"Turn {turn['turn']}: {turn['score']:.2f}")
print(f" Generated: {turn['generated_response']}")
```
#### Turn-Specific Criteria
Different turns may have different expectations. You can specify criteria per turn and allow for multiple correct behaviors:
```python theme={null}
# Reference transcript
reference_transcript = [
{"role": "user", "content": "What's the weather in Paris?"},
{"role": "assistant", "content": "I'll check the weather for you."},
{"role": "user", "content": "Thanks"},
{"role": "assistant", "content": "The weather in Paris is currently 18°C and partly cloudy."}
]
# Turn-specific criteria allowing multiple correct behaviors
result = evaluate_progressive_turns(
agent_function=my_agent_function,
reference_transcript=reference_transcript,
criteria={
1: [
"Reward if the agent asks for clarification about which Paris (France, Texas, etc.)",
"Reward if the agent acknowledges and proceeds to check the weather",
"Reward if the agent immediately provides weather information"
],
3: [
"Reward if the agent provides the weather information",
"Reward if the agent confirms the request before providing information"
]
}
)
```
Adding multiple criteria allows you to specify that clarifying, acknowledging, or directly answering are all acceptable behaviors.
***
## Combining with Agent Tracing
For comprehensive testing, combine either approach with [Agent Tracing](/documentation/monitoring/tracing) to capture detailed execution data:
```python theme={null}
from composo import Composo, ComposoTracer, Instruments, AgentTracer
from composo.models import criteria
ComposoTracer.init(instruments=[Instruments.OPENAI])
composo = Composo()
def run_traced_dynamic_test(
agent_function,
test_goal: str,
initial_user_message: str,
reference_transcript: list[dict] | None = None
):
with AgentTracer("test_agent") as tracer:
# Run dynamic test against your actual agent (from Approach 1)
result = run_dynamic_test(
agent_function=agent_function,
test_goal=test_goal,
initial_user_message=initial_user_message,
reference_transcript=reference_transcript
)
# Evaluate with agent-specific criteria
trace_results = composo.evaluate_trace(
tracer.trace,
criteria=criteria.agent # Uses full agent evaluation framework
)
return {
"conversation": result["conversation"],
"agent_metrics": trace_results,
"goal_achieved": result["goal_achieved"],
"score": result["score"]
}
# You can also wrap Approach 2 with AgentTracer for turn-by-turn analysis
```
# Agent Evaluation
Source: https://docs.composo.ai/documentation/cookbooks/agent-evaluation
Evaluate the performance of your agentic systems with Composo's comprehensive agent framework.
## Why Agent Evaluation Matters
As LLM applications evolve from simple chat interfaces to sophisticated agentic systems with tool calling, multi-step reasoning, and complex workflows, traditional evaluation approaches fail to capture what makes agents actually work in production.
## The Composo Agent Framework
Start here with our battle-tested framework that evaluates agents across five critical dimensions. We've developed this framework through extensive R\&D and tested with industry partners.
### Proven Through Rigorous Research & Real-World Testing
This framework represents **>12 months of intensive R\&D** with leading AI teams who needed agent evaluation that actually works in production. Here's what makes it different:
**The Research Journey**
* **Thousands of production agent traces analyzed** from both regulated enterprises as well as leading AI startups
* **12 major framework iterations** based on real-world failure modes we discovered
* **Validated across 8 industries** including healthcare, finance, legal, and deep knowledge research
* **>85% accuracy** in predicting agent success/failure before deployment
* **3x faster debugging** of agent issues compared to manual analysis
**Why These Specific Metrics?**
Our research revealed that agent failures cluster into five distinct patterns. Traditional "did it get the right answer?" evaluation misses >70% of these failure modes:
* **Exploration vs Exploitation imbalance**: Agents that either never try new approaches (getting stuck) or never leverage what they've learned (inefficient loops)
* **Tool misuse patterns**: Subtle errors in parameter formatting that work 90% of the time but fail catastrophically on edge cases
* **Goal drift**: Agents that solve *a* problem but not *the user's* problem
* **Hallucinated capabilities**: Agents hallucinating as LLMs are always prone to do (e.g. claiming success when tools actually returned errors, or abandoning critical information from earlier in the conversation)
Each metric in our framework directly addresses these production failure modes. This isn't academic theory—it's battle-tested engineering derived from millions of real agent interactions.
**Industry Validation**
*"Composo's agent framework caught critical issues our own evaluation suite missed. It identified tool-calling patterns that would have caused production outages."* - ML Engineer, Fortune 500 Financial Services
*"We reduced our agent failure rate by 35% after implementing Composo's evaluation framework in our CI/CD pipeline."* - Head of AI, Healthcare Startup
This framework now evaluates over **10 million agent interactions monthly** across our customer base, continuously proving its effectiveness at scale.
### Core Agent Metrics
**🔍 Exploration**
`Reward agents that plan effectively: exploring new information and capabilities, and investigating unknowns despite uncertainty`
**⚡ Exploitation**
`Reward agents that plan effectively: exploiting existing knowledge and available context to create reliable plans with predictable outcomes`
**🔧 Tool Use**
`Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls`
**🎯 Goal Pursuit**
`Reward agents that work towards the goal specified by the user`
**✅ Agent Faithfulness**
`Reward agents that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation`
## Implementation Guide
Agent evaluation is currently only available with our default model, not the Lightning model
### Using Agent Tracing
The recommended approach for agent evaluation is to use our tracing SDK. This allows you to instrument your agent code and capture real-time execution traces for evaluation.
**Agent Evaluation Criteria:**
* `agent_exploration` - Reward agents that plan effectively: exploring new information and capabilities, and investigating unknowns despite uncertainty
* `agent_exploitation` - Reward agents that plan effectively: exploiting existing knowledge and available context to create reliable plans with predictable outcomes
* `agent_tool_use` - Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls
* `agent_goal_pursuit` - Reward agents that work towards the goal specified by the user
* `agent_faithfulness` - Reward agents that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation
Alternatively, `criteria.agent` is a list that contains all of the above.
Get started evaluating your agent in under 5 minutes using our tracing SDK and pre-built agent framework:
```python wrap theme={null}
from composo import Composo
from composo.models import criteria
from composo.tracing import ComposoTracer, Instruments, AgentTracer, agent_tracer
from openai import OpenAI
# Initialize tracing for OpenAI
ComposoTracer.init(instruments=[Instruments.OPENAI])
composo_client = Composo(api_key="YOUR_API_KEY")
openai_client = OpenAI()
# Define a weather agent as a function
@agent_tracer(name="weather_agent")
def get_weather_info(location):
return openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"What's the weather in {location}?"}],
max_tokens=100
)
# Orchestrator coordinates the agent workflow
with AgentTracer("orchestrator") as tracer:
# Execute the weather agent
result = get_weather_info("Paris")
# Evaluate the full agent trace
results = composo_client.evaluate_trace(tracer.trace, criteria=criteria.agent)
for result, criterion in zip(results, criteria.agent):
print(f"Criterion: {criterion}")
for agent in result.results_by_agent_name:
print(f"{agent}:")
print(f" summary_statistics: {result.results_by_agent_name[agent].summary_statistics} ")
for id in result.results_by_agent_name[agent].results_by_agent_instance_id:
if result.results_by_agent_name[agent].results_by_agent_instance_id[id]:
print(f" Agent instance: {id}")
print(f" Score: {result.results_by_agent_name[agent].results_by_agent_instance_id[id].score}")
print(f" Explanation: {result.results_by_agent_name[agent].results_by_agent_instance_id[id].explanation}")
print("-" * 40)
```
[Learn more about agent tracing →](/pages/sdk/tracing)
### Evaluating with Individual Metrics
You can also evaluate against specific metrics from the framework:
```python wrap theme={null}
# Evaluate specific aspects of agent behavior
results = composo_client.evaluate_trace(
tracer.trace,
criteria=[
criteria.agent_goal_pursuit,
criteria.agent_tool_use,
criteria.agent_faithfulness
]
)
```
## Advanced Agent Metrics
Once you've mastered the core framework, explore these additional agent-level metrics for deeper insights:
**Agent Sequencing**
`Reward agents that follow logical sequences, such as gathering required information from user before attempting specific lookups`
**Agent Efficiency**
`Reward agents that are efficient when working towards their goal`
**Agent Thoroughness**
`Reward agents that are fully comprehensive and thorough when working towards their goal`
## Evaluating Individual Tool Calls
For granular analysis, evaluate specific tool call steps within your agent trace:
**Tool Call Formulation**
`Reward tool calls that formulate arguments using only information provided by the user or previous tool call returns without fabricating parameters`
**Tool Relevance**
`Reward tool calls that perform actions or retrieve information directly relevant to the goal`
**Response Completeness from Tool Returns**
`Reward responses that incorporate all relevant information from tool call returns needed to comprehensively answer the user's question`
**Response Precision from Tool Returns**
`Reward responses that include only the specific information from tool call returns that directly addresses the user's query`
**Response Faithfulness to Tool Returns**
`Reward responses that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation`
## Writing Custom Agent Criteria
While our agent framework and additional metrics cover many use cases, you can write custom criteria for your specific domain. See our [Criteria Writing guide](/pages/guides/criteria-writing) for detailed instructions on crafting your own criteria.
Common patterns for custom agent criteria:
```python wrap theme={null}
# Healthcare agent
"Reward agents that appropriately defer to medical professionals for diagnosis"
# Financial agent
"Reward agents that verify account permissions before accessing sensitive data"
# Code generation agent
"Reward agents that validate syntax before executing code modifications"
# Research agent
"Reward agents that prioritize peer-reviewed sources over general web content"
```
## Next Steps
* [**Read our Agent Evaluation Blog**](https://www.composo.ai/post/agentic-evals) - Deep dive into evaluation strategies
* [**Explore the Criteria Library**](/pages/guides/criteria-library) - Find more pre-built criteria
# Anonymization
Source: https://docs.composo.ai/documentation/cookbooks/anonymization
Anonymizing your data while maintaining evaluation quality
# Anonymizing Data for Composo Evaluations
When dealing with sensitive customer information, you may need to anonymize data before sending it to Composo evaluation services. This guide explains how to effectively anonymize your data while preserving evaluation quality.
## Recommended Anonymization Approach
For optimal evaluation results, we recommend using a **consistent placeholder substitution** approach rather than removing or scrambling PII. This preserves relationships between entities that are important for evaluation quality.
### Best Practices
1. **Use sequential placeholders** for each entity type
* Replace "Bob sent an email to Sally" with "NAME\_1 sent an email to NAME\_2"
* This preserves relationships between entities
2. **Maintain placeholder consistency** across all related content
* The same entity should have the same placeholder ID throughout a single evaluation request
* Example: If "Sally" is "NAME\_2" in one part, it should remain "NAME\_2" everywhere in that request
3. **Preserve structure and context**
* Keep sentence structure, formatting, and non-PII context intact
* This ensures evaluations remain accurate and meaningful
Numbering can be omitted if there is only one instance of a particular entity type. For example, if only one name appears in your data, you can simply use "NAME" instead of "NAME\_1".
## Recommended PII Types to Anonymize
* Person names → "NAME\_1", "NAME\_2", etc.
* Email addresses → "EMAIL\_1", "EMAIL\_2", etc.
* Phone numbers → "PHONE\_1", "PHONE\_2", etc.
* Physical addresses → "ADDRESS\_1", "ADDRESS\_2", etc. (you can retain country/region)
* URLs → "URL\_1", "URL\_2", etc.
## Implementation Example
## Implementation Example
**Original Data:**
```json theme={null}
{
"messages": [
{"role": "user", "content": "How do I contact Bob Smith?"},
{"role": "assistant", "content": "You can reach Bob Smith at bob.smith@example.com or call him at (555) 123-4567."}
],
"evaluation_criteria": "Reward responses that provide complete contact information when requested."
}
```
**Anonymized Data:**
```json theme={null}
{
"messages": [
{"role": "user", "content": "How do I contact NAME_1?"},
{"role": "assistant", "content": "You can reach NAME_1 at EMAIL_1 or call him at PHONE_1."}
],
"evaluation_criteria": "Reward responses that provide complete contact information when requested."
}
```
## Tools for Anonymization
We recommend using [Microsoft Presidio](https://github.com/microsoft/presidio), an open-source framework for PII detection and anonymization. It provides:
* Entity recognition for common PII types
* Multiple anonymization methods
* Support for multiple languages
* Customizable entity detection
# RAG Evaluation
Source: https://docs.composo.ai/documentation/cookbooks/rag-evaluation
Battle-tested metrics for retrieval-augmented generation including faithfulness, completeness, and precision.
## Why RAG Evaluation Matters
Retrieval-Augmented Generation (RAG) systems are only as good as their ability to accurately use retrieved information. Poor RAG performance leads to hallucinations, incomplete answers, and loss of user trust. Composo's RAG framework provides comprehensive evaluation across the critical dimensions of RAG quality.
## The Composo RAG Framework
Our framework, developed through extensive R\&D and rigorously tested with Fortune 500 companies and leading AI teams, delivers **92% accuracy** in detecting hallucinations and faithfulness violations—far exceeding the \~70% accuracy of LLM-as-judge approaches.
### Proven Performance
* **18 months of research** refining the optimal RAG evaluation criteria
* **Battle-tested** across hundreds of production RAG systems including for critical hallucination detection in regulated industries
* **92% agreement** with expert human evaluators on RAG quality assessment
* **70% reduction in error rate** compared to traditional LLM-as-judge methods
This isn't just another evaluation tool—it's the result of deep collaboration with industry leaders who needed evaluation that actually works for production RAG systems handling millions of queries daily.
### Core RAG Metrics
**📖 Context Faithfulness**
"Reward responses that make only claims directly supported by the provided source material without any hallucination or speculation"
**✅ Completeness**
"Reward responses that comprehensively include all relevant information from the source material needed to fully answer the question"
**🎯 Context Precision**
"Reward responses that include only information necessary to answer the question without extraneous details from the source material"
**🔍 Relevance**
"Reward responses where all content directly addresses and is relevant to answering the user's specific question"
## Implementation Example
Our SDK now provides independent criteria variables for RAG evaluation, making it easier to use specific criteria or create custom combinations. Each criterion is defined as a separate variable with clear, focused descriptions.
**RAG Evaluation Criteria:**
* `rag_faithfulness` - Reward responses that make only claims directly supported by the provided source material without any hallucination or speculation
* `rag_completeness` - Reward responses that comprehensively include all relevant information from the source material needed to fully answer the question
* `rag_precision` - Reward responses that include only information necessary to answer the question without extraneous details from the source material
* `rag_relevance` - Reward responses where all content directly addresses and is relevant to answering the user's specific question
Alternatively, criteria.rag is a list that contains all the above.
Here's how to evaluate a RAG system's performance using our framework:
```python Python wrap theme={null}
from composo import Composo, criteria
composo_client = Composo(api_key="your-api-key-here")
# Example RAG conversation with retrieved context
messages = [
{
"role": "user",
"content": """What is the current population of Tokyo?
Context:
According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents, making it the world's most populous urban agglomeration. The Tokyo Metropolis itself has 14.0 million people."""
},
{
"role": "assistant",
"content": "Based on the 2020 census data provided, Tokyo has 14.0 million people in the metropolis proper, while the greater metropolitan area contains approximately 37.4 million residents, making it the world's largest urban agglomeration."
}
]
# Evaluate with the RAG framework
results = composo_client.evaluate(
messages=messages,
criteria=criteria.rag
)
for result, criterion in zip(results, criteria.rag):
print(f"Criterion: {criterion}")
print(f"Score: {result.score}")
print(f"Explanation: {result.explanation}")
print("-" * 40)
```
## Evaluating Retrieval Quality
Beyond evaluating the generated responses, you can also assess the quality of your retrieval system itself. This helps identify when your vector search or retrieval mechanism needs improvement before it impacts downstream generation.
### How It Works
Treat your retrieval step as a "tool call" and evaluate whether the retrieved chunks are actually relevant to the user's query. This gives you quantitative metrics on retrieval precision.
### Implementation
```python Python wrap theme={null}
from composo import Composo
composo_client = Composo(api_key="your-api-key-here")
# User's question
user_query = "What is the current population of Tokyo?"
# Chunks retrieved by your RAG system
retrieved_chunks = """
Chunk 1: According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents.
Chunk 2: The Tokyo Metropolis itself has 14.0 million people.
Chunk 3: Population density in Tokyo is approximately 6,158 people per square kilometer.
"""
# Define the retrieval tool (for context)
tools = [
{
"type": "function",
"function": {
"name": "rag_retrieval",
"description": "Retrieves relevant document chunks based on semantic search",
"parameters": {"type": "object", "required": [], "properties": {}}
}
}
]
# Evaluate retrieval quality
result = composo_client.evaluate(
messages=[
{"role": "user", "content": user_query},
{"role": "function", "name": "rag_retrieval", "content": retrieved_chunks}
],
tools=tools,
criteria="Reward tool calls that retrieve chunks directly relevant to answering the user's question"
)
print(f"Retrieval Quality Score: {result.score:.2f}/1.00")
# High scores (>0.8) indicate good retrieval
# Low scores (<0.6) suggest retrieval improvements needed
```
# Response Quality Evaluation
Source: https://docs.composo.ai/documentation/cookbooks/response-evaluation
Evaluate custom quality aspects of LLM responses
Beyond our pre-built Agent & RAG frameworks, Composo's real power lies in writing custom criteria for any quality aspect you care about—and most teams do exactly this for their specific use cases.
## What is Response Quality Evaluation?
Response quality evaluation assesses subjective and domain-specific aspects of assistant responses: tone, style, safety, adherence to guidelines, and any custom quality metric unique to your application.
## Example Criteria
### Core Quality Metrics
* **Conciseness**: `"Reward responses that are clear and direct, avoiding unnecessary verbosity, repetition, or extraneous details"`
* **Information Structure**: `"Reward responses that present information in a logical, well-organized format that prioritizes the most important details"`
* **Professional Tone**: `"Reward responses that maintain appropriate professional language and tone suitable for the context"`
* **Actionable Guidance**: `"Reward responses that provide practical next steps or actionable recommendations when appropriate"`
### Safety & Compliance
* **Harmful Content**: `"Penalize responses that provide inappropriate advice (e.g., medical advice, harmful instructions) outside the system's intended scope"`
* **System Compliance**: `"Penalize responses that violate explicit system constraints, limitations, or instructions"`
### Domain-Specific Examples
* **Healthcare**: `"Reward responses that use precise medical terminology appropriate for the audience (clinician vs patient)"`
* **Customer Service**: `"Reward responses that express appropriate empathy when the user is frustrated"`
* **Technical Support**: `"Reward responses that precisely adhere to the technical user manual's resolution steps"`
* **Education**: `"Reward responses that adapt explanation complexity to match the user's learning level"`
## Writing Effective Criteria
Every criterion follows this simple template:
```
[Prefix] [quality] [qualifier (optional)]
```
* **Prefix**: "Reward responses that..." or "Penalize responses that..."
* **Quality**: The specific behavior you want to evaluate
* **Qualifier**: Optional "if" statement for conditional application
**Example**: `"Reward responses that provide code examples if the user asks for implementation details"`
* Prefix: "Reward responses that"
* Quality: "provide code examples"
* Qualifier: "if the user asks for implementation details"
### Key Principles
✅ **Be specific** - Focus on one quality at a time\
✅ **Use clear direction** - Start with "Reward" or "Penalize"\
✅ **Add qualifiers when needed** - Use "appropriate" for non-monotonic qualities\
✅ **Leverage domain expertise** - Your knowledge of what "good" looks like is your secret weapon
## Next Steps
📚 [**Browse our Criteria Library**](/pages/guides/criteria-library) - Explore tried & tested criteria across domains for inspiration\
✏️ [**How to Write Criteria Guide**](/pages/guides/criteria-writing) - Master the art of writing precise evaluation criteria
# Context & Knowledge Store
Source: https://docs.composo.ai/documentation/getting-started/context
Ground evaluations in your own documents and example evaluations
Our latest Align model, `align-20260109`, can ground its judgments in **context you provide** — your own
reference material and your own example evaluations. Instead of relying only on the criteria and the trace
being scored, the model draws on what you've uploaded, so its scores reflect your domain knowledge and the
way *you* would grade.
You manage this context on the **[Context page](https://platform.composo.ai/context)** in the platform,
where you can upload two things: **Documents** and **Annotations**.
Context is only used when you evaluate with **`align-20260109`**. The API defaults to `align-20251111`,
so you must explicitly select `align-20260109` (see [Using your context](#using-your-context) below) for
uploaded Documents and Annotations to take effect. `align-20260109` is currently in Beta.
## Documents (knowledge store)
Upload reference material that the model should know about — product documentation, policies, style guides,
support macros, domain glossaries, and so on. When you run an evaluation, the model automatically pulls the
most relevant parts of your documents and uses them to ground its judgment.
This is ideal when correct scoring depends on facts the model can't be expected to know — for example
checking an answer against your own product behaviour, or judging whether a response follows your internal
policy.
* **Supported file types:** PDF, Word, PowerPoint, plain text, and Markdown.
* **Processing:** documents are processed shortly after upload; you'll see each one move to **processed** on
the Context page when it's ready to use.
* **Duplicates:** uploading a file you've already added is automatically skipped.
## Annotations
Upload **example evaluations** — your own labeled judgments showing how a particular response should be
scored and why. The model uses these as guidance to better match your scoring standards on similar cases.
Annotations are useful when your grading involves nuanced judgment calls that are easier to *show* with
examples than to fully spell out in a criteria sentence.
Annotations take around **24 hours** to be ready after upload. You'll be notified once they're available,
and you can track their status on the Context page.
## Using your context
Once your Documents and Annotations are uploaded, evaluate with `align-20260109` to put them to work. Select
the model when you create the client (or set `model_core` directly in an API request):
```python Python wrap theme={null}
from composo import Composo
# Select align-20260109 so your uploaded context is used
composo_client = Composo(api_key="YOUR_API_KEY", model_core="align-20260109")
result = composo_client.evaluate(
messages=[
{"role": "user", "content": "Does the Pro plan include SSO?"},
{"role": "assistant", "content": "Yes — SSO is included on the Pro plan and above."}
],
criteria="Reward responses that correctly describe what's included in each plan"
)
print(f"Score: {result.score}")
print(f"Analysis: {result.explanation}")
```
```bash cURL theme={null}
curl -X POST "https://platform.composo.ai/api/v1/evals/reward" \
-H "API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model_core": "align-20260109",
"messages": [
{"role": "user", "content": "Does the Pro plan include SSO?"},
{"role": "assistant", "content": "Yes — SSO is included on the Pro plan and above."}
],
"evaluation_criteria": "Reward responses that correctly describe what'\''s included in each plan"
}'
```
With `align-20260109` selected, the model automatically grounds this evaluation in the relevant parts of
your uploaded context — no extra parameters needed.
## Related
* [Models](/documentation/getting-started/models) — full list of Align model versions
* [Ground Truth Evaluation](/documentation/guides/ground-truths) — insert a known correct answer directly
into a single criteria
# Models
Source: https://docs.composo.ai/documentation/getting-started/models
Composo has developed 2 distinct model types that each achieve best in class scoring performance for their respective tasks.
**Expert-level Agent evaluation for production confidence**
* Our flagship model for when accuracy matters most
* 5-15 second response time
* Achieves 92% accuracy on real-world evaluation tasks (vs \~70% for LLM-as-judge)
* Detailed evidence based explanations
* Optimized for evaluating complex Agentic applications end-to-end
**Fast evaluation for rapid iteration**
* 3 second median response time
* Optimized for development workflows and real-time feedback
* Ideal for quick iteration during development and testing
## Available Model Versions
### Composo Align Versions
| Version | Context Window | Latency | Notes |
| ---------------- | -------------- | ------------ | ------------------------------------------------------------------------------------------------ |
| `align-20260109` | 120K tokens | 5-10 seconds | Beta. Supports [Context & Knowledge Store](/documentation/getting-started/context). |
| `align-20251111` | 350K tokens | 5-10 seconds | Current stable version |
| `align-20250529` | 150K tokens | 5-15 seconds | Deprecated - migrate to `align-20251111` |
`align-20260109` can ground its evaluations in your own reference documents and example evaluations — see
[Context & Knowledge Store](/documentation/getting-started/context).
### Composo Align Lightning Versions
| Version | Context Window | Latency | Notes |
| -------------------------- | -------------- | ------------ | ---------------------- |
| `align-lightning-20251127` | 32K tokens | 100 - 800 ms | Beta |
| `align-lightning-20250731` | 200K tokens | 1-5 seconds | Current stable version |
# Quickstart
Source: https://docs.composo.ai/documentation/getting-started/quickstart
Ship AI agents that actually work in production
Composo delivers deterministic, accurate evaluation for LLM applications through purpose-built generative reward models. Unlike unreliable LLM-as-judge approaches, our specialized models provide consistent, precise scores you can trust—with just a single sentence criteria.
# Quickstart
Get up and running with Composo in under 5 minutes. This guide will help you evaluate your first LLM response and understand how Composo delivers deterministic, accurate evaluations.
## Step 1: Create Your Account
Sign up for a Composo account at [platform.composo.ai](https://platform.composo.ai).
## Step 2: Generate Your API Key
1. Navigate to **Profile** → **API Keys** in the dashboard
2. Click **Generate New API Key**
## Step 3: Run Your First Evaluation
\[Optional] Install the SDK:
```bash theme={null}
pip install composo
```
Now let's evaluate a customer service response for empathy and helpfulness using the Composo SDK:
```python Python wrap theme={null}
from composo import Composo
# Initialize the client with your API key
composo_client = Composo(api_key="YOUR_API_KEY")
# Example: Evaluating a customer service response
result = composo_client.evaluate(
messages=[
{"role": "user", "content": "I'm really frustrated with my device not working."},
{"role": "assistant", "content": "I'm sorry to hear that you're experiencing issues with your device. Let's see how I can assist you to resolve this problem."}
],
criteria="Reward responses that express appropriate empathy if the user is facing a problem they're finding frustrating"
)
# Display results
print(f"Score: {result.score}")
print(f"Analysis: {result.explanation}")
```
```bash cURL theme={null}
curl -X POST "https://platform.composo.ai/api/v1/evals/reward" \
-H "API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "user",
"content": "I'\''m really frustrated with my device not working."
},
{
"role": "assistant",
"content": "I'\''m sorry to hear that you'\''re experiencing issues with your device. Let'\''s see how I can assist you to resolve this problem."
}
],
"evaluation_criteria": "Reward responses that express appropriate empathy if the user is facing a problem they'\''re finding frustrating"
}'
```
### Understanding the Results
Composo returns:
* **Score**: A value between 0 and 1 (e.g. 0.86 means the response strongly meets your criteria)
* **Explanation**: Detailed analysis of why the response received this score
Example output:
```json JSON wrap theme={null}
Score: 0.86/1.0
Analysis: - The assistant directly acknowledges the user's difficulty and expresses sympathy ("I'm sorry to hear that you're experiencing issues"), showing clear empathy.
- The response is timely and supportive, immediately addressing the expressed frustration and not ignoring the emotional content.
- It constructively adds a collaborative next step ("Let's see how I can assist you"), enhancing the empathetic tone, with only minor room for deeper emotional mirroring.
```
## Step 4: Evaluate Agents with Tracing
For agent applications, Composo provides real-time tracing to capture and evaluate multi-agent interactions. Here's a simple example with an orchestrator coordinating two sub-agents:
```python Python wrap theme={null}
from composo import Composo
from composo.models import criteria
from composo.tracing import ComposoTracer, Instruments, AgentTracer, agent_tracer
from openai import OpenAI
# Initialize tracing for OpenAI
ComposoTracer.init(instruments=[Instruments.OPENAI])
composo_client = Composo(api_key="YOUR_API_KEY")
openai_client = OpenAI()
# Define a simple sub-agent
@agent_tracer(name="research_agent")
def research_agent(topic):
return openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Research: {topic}"}],
max_tokens=50
)
# Orchestrator coordinates multiple agents
with AgentTracer("orchestrator") as tracer:
# First sub-agent: planning
with AgentTracer("planning_agent"):
plan = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Plan a trip to Paris"}],
max_tokens=50
)
# Second sub-agent: research
research = research_agent("Paris attractions")
# Evaluate the full agent trace
results = composo_client.evaluate_trace(tracer.trace, criteria=criteria.agent)
for result, criterion in zip(results, criteria.agent):
print(f"Criterion: {criterion}")
print(f"Evaluation Result: {result}\n")
```
This example shows how Composo traces each agent's LLM calls independently and evaluates them against our comprehensive agent framework.
# Criteria Library
Source: https://docs.composo.ai/documentation/guides/criteria-library
Here's a range of criteria that we've seen to help when writing your own
## Core frameworks (start here)
### RAG framework
* **Context Faithfulness**: Reward responses that make only claims directly supported by the provided source material without any hallucination or speculation
* **Completeness**: Reward responses that comprehensively include all relevant information from the source material needed to fully answer the question
* **Context Precision**: Reward responses that include only information necessary to answer the question without extraneous details from the source material
* **Relevance**: Reward responses where all content directly addresses and is relevant to answering the user's specific question
### Agents framework
* **Exploration**: Reward agents that plan effectively: exploring new information and capabilities, and investigating unknowns despite uncertainty
* **Exploitation**: Reward agents that plan effectively: exploiting existing knowledge and available context to create reliable plans with predictable outcomes
* **Tool use**: Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls
* **Goal pursuit**: Reward agents that work towards the goal specified by the user
* **Agent Faithfulness**: Reward agents that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation
## Advanced metrics (use these next)
### Agents
* **Agent Sequencing:** Reward agents that follows logical sequences, such as gathering required information from user before attempting specific lookups
* **Agent Efficiency:** Reward agents that are efficient when working towards their goal
* **Agent Thoroughness:** Reward agents that are fully comprehensive and thorough when working towards their goal
### Individual tool call focused (use these when you want to pinpoint specific tool call steps)
* **Tool Call Formulation:** Reward tool calls that formulate arguments using only information provided by the user or previous tool call returns without fabricating parameters.
* **Tool Relevance:** Reward tool calls that perform actions or retrieve information directly relevant to the goal.
* **Response completeness from tool return:** Reward responses that incorporate all relevant information from tool call returns needed to comprehensively answer the user's question.
* **Response precision from tool return:** Reward responses that include only the specific information from tool call returns that directly addresses the user's query
* **Response faithfulness to tool return**: Reward responses that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation
### Response quality
* **Conciseness:** Reward responses that are clear and direct, avoiding unnecessary verbosity, repetition, or extraneous details
* **Information Structure:** Reward responses that present information in a logical, well-organized format that prioritizes the most important details.
* **Professional Tone:** Reward responses that maintain appropriate professional language and tone suitable for the context.
* **Actionable Guidance:** Reward responses that provide practical next steps or actionable recommendations when appropriate.
### Accuracy and robustness
* **Source Attribution:** Reward responses that explicitly cite or reference specific source documents or sections used to support each claim.
* **Factual Accuracy:** Reward responses that accurately reflect factual information without introducing errors or fabricated details.
* **Uncertainty Handling:** Reward responses that appropriately acknowledge limitations when information is incomplete or unavailable, rather than making assumptions.
* **Appropriate Refusals:** Reward responses that appropriately refuse to answer when source material lacks sufficient information to address the question.
### Safety
* **Harmful Content Prevention:** Penalize responses that provide inappropriate advice (e.g., medical advice, harmful instructions) outside the system's intended scope.
* **System Compliance:** Penalize responses that violate explicit system constraints, limitations, or instructions.
## Extended library (for inspiration when writing your own)
* **Creativity:** Reward responses that demonstrate original thinking, novel approaches, or innovative solutions.
* **Empathy:** Reward responses that show understanding and connection with human emotions and experiences.
* **Humor:** Reward responses that appropriately use wit, clever wordplay, or situational comedy when suitable to context.
* **Surprise:** Reward responses that include unexpected but delightful elements or developments.
* **Happiness:** Reward responses that evoke positive emotions and create uplifting experiences.
* **Narrative Structure:** Reward responses that maintain logical progression and development.
* **Legal Authority:** Reward responses that prioritize the most authoritative legal sources (legislation, case law, preparatory works).
* **Jurisdictional Accuracy:** Reward responses that correctly identify jurisdictional context and cite the most recent legally binding sources.
* **Legal Terminology:** Reward responses that correctly interpret legal terminology, avoiding confusion with non-legal meanings.
* **Citation Recognition:** Reward responses that recognize and appropriately process standard legal citation formats.
* **Quantitative Accuracy:** Reward responses that accurately represent quantitative data without speculation beyond provided information.
* **Metric Context:** Reward responses that include appropriate context for metrics, comparisons, and calculations.
* **Risk Disclosure:** Reward responses that acknowledge limitations and uncertainties in quantitative analysis.
* **Regulatory Compliance:** Penalize responses that include financial recommendations without appropriate risk disclaimers.
* **Issue Resolution:** Reward responses that capture all significant elements: issue nature, agent actions, and resolutions offered.
* **Entity Accuracy:** Reward responses that correctly identify specific entities (payment methods, brands, etc.) only when explicitly mentioned.
* **Interaction Dynamics:** Reward responses that accurately represent both customer and agent perspectives.
* **Chronological Clarity:** Reward responses that present information in clear chronological sequence.
* **Query Translation:** Reward responses that accurately translate natural language intent with proper syntax.
* **Feature Accuracy:** Penalize responses that reference outdated, incorrect, or non-existent functionality.
* **Validation Implementation:** Penalize responses that fail to include critical validation rules when specified.
* **Cost Efficiency:** Reward responses that provide cost-effective technical solutions.
* **Medical Terminology:** Reward responses that use precise medical terminology appropriate for the audience (clinician vs patient).
* **Evidence-Based Content:** Reward responses that reference current clinical guidelines or peer-reviewed studies.
* **Harm Prevention:** Penalize responses that could delay necessary medical care through self-diagnosis suggestions.
* **Appropriate Referrals:** Reward responses that direct users to qualified healthcare professionals for medical decisions.
* **Learning Adaptation:** Reward responses that adapt explanation complexity to match the user's learning level.
* **Conceptual Building:** Reward responses that connect new concepts to familiar ideas.
* **Active Learning:** Reward responses that encourage critical thinking through questions when pedagogically appropriate.
* **Misconception Correction:** Reward responses that identify and gently correct common misconceptions.
* **Voice Consistency:** Reward responses that maintain consistent brand voice and personality.
* **Audience Targeting:** Reward responses that tailor language and complexity for the specified target audience.
* **Hook Effectiveness:** Reward responses with compelling openings appropriate to the platform.
* **SEO Optimization:** Reward responses that naturally incorporate relevant keywords without compromising readability.
* **Specification Accuracy:** Reward responses that accurately represent product details without fabrication.
* **Comparison Fairness:** Reward responses that provide balanced product comparisons with strengths and limitations.
* **Decision Support:** Reward responses that help users make informed decisions by addressing common concerns.
* **Policy Clarity:** Reward responses that clearly communicate relevant policies when applicable.
* **Scholarly Rigor:** Reward responses that properly cite primary sources and acknowledge research limitations.
* **Literature Synthesis:** Reward responses that effectively synthesize multiple sources while maintaining distinct attribution.
* **Academic Integrity:** Reward responses that encourage original thinking and proper attribution.
* **Disciplinary Conventions:** Reward responses that follow discipline-specific writing and citation styles.
* **Context Retention:** Reward responses that appropriately reference and build upon previous conversation turns.
* **Intent Recognition:** Reward responses that correctly identify user intent even when expressed ambiguously.
* **Emotional Intelligence:** Reward responses that appropriately recognize and respond to user emotional states.
* **Boundary Awareness:** Reward responses that maintain professional boundaries while being helpful.
* **Cultural Adaptation:** Reward responses that appropriately adapt content for cultural context beyond literal translation.
* **Idiomatic Accuracy:** Reward responses that correctly handle idioms and culture-specific references.
* **Terminology Consistency:** Reward responses that maintain consistent technical terminology throughout translations.
* **Contextual Disambiguation:** Reward responses that correctly resolve ambiguous terms based on domain context.
# How to write effective criteria
Source: https://docs.composo.ai/documentation/guides/criteria-writing
When crafting your evaluation criteria, consider the following guidelines to ensure effective and meaningful assessments:
**Be Specific and Focused**: Clearly define the quality or behavior you want to evaluate. Avoid vague statements. Focus on a single aspect per criterion to maintain clarity.
* *Example*: Instead of "good," use "a friendly and encouraging tone."
**Use Clear Direction**: Begin your criteria with an explicit directive to indicate both the evaluation type and scoring method:
* For continuous scoring (0-1): `"Reward..."` or `"Penalize..."`
* For binary scoring (0 or 1): `"Passes if..."` or `"Fails if..."`
All criteria types use the same `/reward` endpoint - the prefix determines whether you get continuous or binary scores.
* *Example*: `"Reward responses that use empathetic language when addressing user concerns."` (continuous)
* *Example*: `"Fails if the response provides medical advice."` (binary)
**Monotonic or Appropriately Qualified Qualities**: Ideally, the quality you're assessing should be monotonic (more is always better for rewards, worse for penalties). For non-monotonic qualities where balance matters, use qualifiers like "appropriate" to ensure higher scores represent better adherence.
* *Example*: Instead of `"Reward responses that are polite"` which can become excessive, use `"Reward responses that use an appropriate level of politeness"` ensuring the response is polite but not overly so.
**Avoid Conjunctions**: Focus on one quality at a time. Using "and" often indicates multiple qualities, which can lead to unclear scoring when only one quality is present.
* *Example*: Instead of `"The assistant should be concise and informative"` split into two separate criteria.
**Avoid LLM Keywords**: Composo's reward model is finetuned from LLM models trained in conversation format. Avoid alternate definitions of 'User' and 'Assistant' that might conflict with LLM keywords 'user' and 'assistant'.
* *Example*: Instead of `"Reward responses that comprehensively address the User Question"`, rename the 'User Question' in your prompt and use `"Reward responses that comprehensively address the Target Question"`
**Leverage Domain Expertise**: Your domain knowledge is your secret weapon. Inject your understanding of what constitutes a 'good' answer in your specific field—this gives your evaluation model leverage over the generative model.
* *Example*: For medical contexts: `"Reward responses that distinguish between emergency symptoms requiring immediate care versus symptoms suitable for routine appointments"`
**Use Qualifiers When Needed**: Include a qualifier starting with "if" to specify when the criterion should apply. This helps handle conditional requirements.
* *Example*: `"Reward responses that provide code examples if the user asks for implementation details"`
**Keep Criteria Concise**: Aim for one clear sentence per criterion. If you need multiple sentences to explain, consider splitting into separate criteria.
#### Reward responses that provide correct information based solely on the provided context without fabricating details.
OK. Clarification about 'correct' would be useful—does it have to be factually correct, or only in agreement with the provided context?
#### Reward responses that directly address the 'User Question' without including irrelevant information.
Poor. Clauses that change the definition of user and assistant from the LLM definition risk confusion.
#### Reward responses that properly cite the specific source of information from the provided context.
Good. 'Properly' is slightly ambiguous and rolls in both concepts of citation style and accuracy.
#### Reward responses that appropriately acknowledge limitations if information is incomplete or unavailable rather than guessing.
Good. Could be improved by clarifying what the agent might be guessing at.
#### Reward responses that comprehensively address all aspects of the 'User Question' if information is available in the context.
Poor. Clauses that change the definition of user and assistant from the LLM definition risk confusion.
#### Reward responses that present technical information in a logical, well-organized format that prioritizes the most important details.
Excellent. It's clear what format we're looking for and what kind of information that applies to.
#### Reward responses that provide practical next steps or recommendations if appropriate and supported by the context.
OK. Somewhat ambiguous about what should be supported by the context—is it the next steps or the relevance of the question?
#### Reward responses that strictly include only information explicitly stated in the support ticket, without adding any fabricated details or assumptions.
Excellent. It's clear what the expected input is and what the model should be doing.
#### Reward responses that correctly identify and include specific entities (payment methods, product categories, brands, couriers) only when explicitly mentioned in the ticket, avoiding hallucinations of these elements.
Excellent. It's clear that we're trying to avoid fabricating names of specific entities and the examples make it even clearer.
#### Reward responses that include all significant elements of the support ticket, including the nature of the issue, agent actions, and resolutions offered, without omitting key details.
Excellent. It's clear that we're looking for good coverage of the important elements in the response.
#### Reward responses that present the information in a clear chronological sequence that accurately reflects the flow of the support interaction.
Excellent. A clear requirement for chronological presentation of the information in the support interaction.
#### Penalize responses that include unnecessary concluding statements, evaluative summaries, or editorial comments not derived from the ticket content.
Excellent. It's clear that we're trying to avoid verbose summary content that isn't clearly derived from the provided ticket.
#### Reward responses that demonstrate empathy while acknowledging the friend's feelings of defeat without minimizing them.
OK. This contains two separate qualities which could lead to unclear scoring when the response demonstrates one but not the other. Consider splitting into two criteria or using 'and' to make both required.
#### Reward responses that explain ethical concerns when declining harmful requests rather than simply refusing without context
OK. The model is specifically trained to recognize 'if' statements, so we'd recommend changing 'when' to 'if'.
#### Reward responses that maintain an appropriate educational tone suitable for academic assessment contexts
Excellent. A clear requirement for a tone with additional helpful context about why it's needed.
## Recommended Template for Crafting Criteria
```
[Prefix] [quality] [qualifier (optional)].
```
**Components**:
* **Prefix**:
* **For Continuous Scoring (0-1)**: "Reward...", "Penalize..."
* **For Binary Scoring (0 or 1)**: "Passes if...", "Fails if..."
Note: All criteria use the `/reward` endpoint. The prefix determines the scoring method.
* **Quality**: The specific property or behavior to evaluate.
* **Qualifier (Optional)**: An "if" statement specifying conditions.
**Example Criteria**:
* `"Reward responses that provide a comprehensive analysis of the code snippet"` (continuous)
* `"Penalize responses where the language is overly technical if the response is for a beginner"` (continuous)
* `"Reward responses that use an appropriate level of politeness"` (continuous)
* `"Passes if all required parameters are provided without fabrication"` (binary)
* `"Fails if the response provides medical advice"` (binary)
# Ground Truth Evaluation
Source: https://docs.composo.ai/documentation/guides/ground-truths
Leverage your labeled data to create precise evaluation metrics
## What is Ground Truth Evaluation?
Ground truth evaluation allows you to measure how well your LLM outputs align with known correct answers. By dynamically inserting your validated labels into Composo's evaluation criteria, you can create precise, case-specific evaluations.
## When to Use Ground Truth
We typically recommend using evaluation criteria / guidelines such as those in our RAG framework rather than rigid ground truths, since it's more flexible and doesn't require labeled data. However, ground truth evaluation works well when:
* You have an exact answer you need to match (calculations, specific classifications)
* You have existing labeled data from historical reviews
* You need to benchmark different models on the same validation set
* Compliance requires testing against specific approved responses
## How It Works
The key is dynamically inserting your ground truth labels directly into the evaluation criteria:
```python Python wrap theme={null}
from composo import Composo
composo_client = Composo(api_key="YOUR_API_KEY")
# Your ground truth answer from the dataset
ground_truth = "The capital of France is Paris, a city known for the Eiffel Tower, the Louvre Museum, and its historic architecture along the Seine River."
# Evaluate if the LLM's response matches the ground truth
result = composo_client.evaluate(
messages=[
{
"role": "user",
"content": "What is the capital of France and what is it known for?"
},
{
"role": "assistant",
"content": "The capital of France is Paris. It's famous for iconic landmarks like the Eiffel Tower, world-class museums including the Louvre, and beautiful architecture along the Seine River."
}
],
criteria=f"Reward responses that closely match this expected answer: {ground_truth}"
)
print(f"Alignment Score: {result.score}")
print(f"Explanation: {result.explanation}\n")
```
## Common Use Cases
### Classification Tasks
```python Python wrap theme={null}
# Multi-class classification
ground_truth_category = "Technical Support"
criteria = f"Reward responses that correctly classify this inquiry as: {ground_truth_category}"
```
### Extraction Tasks
```python Python wrap theme={null}
# Entity extraction validation
ground_truth_entities = "Company: Acme Corp, Amount: $50,000, Date: March 2024"
criteria = f"Reward responses that extract all of these entities: {ground_truth_entities}"
```
### Decision Validation
```python Python wrap theme={null}
# Validating specific decisions
ground_truth_decision = "Escalate to Level 2 Support"
criteria = f"Reward responses that make this decision: {ground_truth_decision}"
```
### Numerical Validation
```python Python wrap theme={null}
# Calculation or counting tasks
ground_truth_answer = "Total: $1,247.50"
criteria = f"Reward responses that arrive at the correct answer: {ground_truth_answer}"
```
## Setting Thresholds
Different use cases require different accuracy thresholds:
* **High-stakes decisions** (medical, financial): Consider scores ≥ 0.9 as passing
* **General classification**: Scores ≥ 0.8 typically indicate good alignment
* **Exploratory analysis**: Scores ≥ 0.7 may be acceptable initially
## Next Steps
* If you have labeled data ready, try the patterns above
* For more flexible evaluation without needing labels, explore [custom criteria](/pages/guides/criteria-writing)
* See our [criteria library](/pages/guides/criteria-library) for evaluation inspiration
# Langfuse
Source: https://docs.composo.ai/documentation/monitoring/langfuse
How to use Composo in combination with Langfuse
This guide shows how to integrate Composo's deterministic evaluation with Langfuse's observability platform to evaluate your LLM applications with confidence.
## Overview
**Langfuse** provides comprehensive observability for LLM applications with tracing, debugging, and dataset management capabilities. **Composo** delivers deterministic, accurate evaluation through purpose-built generative reward models that achieve 92% accuracy (vs 72% for LLM-as-judge).
Together, they enable you to:
* ✅ Track every LLM interaction through Langfuse's tracing
* ✅ Add deterministic evaluation scores to your traces
* ✅ Evaluate datasets programmatically with reliable metrics
* ✅ Ship AI features with confidence using quantitative, trustworthy metrics
## Prerequisites
```python Python wrap theme={null}
pip install langfuse composo
```
```python Python wrap theme={null}
import os
from langfuse import Langfuse, get_client
from composo import Composo, AsyncComposo
# Set your API keys
os.environ["LANGFUSE_PUBLIC_KEY"] = "your-public-key"
os.environ["LANGFUSE_SECRET_KEY"] = "your-secret-key"
os.environ["COMPOSO_API_KEY"] = "your-composo-key"
# Initialize clients
langfuse = get_client()
composo_client = Composo()
async_composo = AsyncComposo()
```
## How Langfuse & Composo work in combination
## Method 1: Real-time Trace Evaluation
Evaluate LLM outputs as they're generated in production or development. This approach uses the `@observe` decorator to automatically trace your LLM calls, then evaluates them with Composo asynchronously.
More detail on how the langfuse @observe decorator works is [here](https://langfuse.com/docs/observability/sdk/python/sdk-v3#basic-tracing).
### When to use
* Production monitoring with real-time quality scores
* Development iteration with immediate feedback
### Implementation
```python Python wrap theme={null}
import asyncio
from langfuse import get_client, observe
from anthropic import Anthropic
from composo import AsyncComposo
# Initialize async Composo client
async_composo = AsyncComposo()
@observe()
async def llm_call(input_data: str) -> str:
# LLM call with async evaluation using @observe decorator
model_name = "claude-sonnet-4-20250514"
anthropic = Anthropic()
resp = anthropic.messages.create(
model=model_name,
max_tokens=100,
messages=[{"role": "user", "content": input_data}],
)
output = resp.content[0].text.strip()
# Get trace ID for scoring
trace_id = get_client().get_current_trace_id()
evaluation_criteria = "Reward responses that are helpful"
# Start asynchronous evaluation task (non-blocking)
# Note: You can register tasks to a task queue or background tasks
await evaluate_with_composo(trace_id, input_data, output, evaluation_criteria)
return output
async def evaluate_with_composo(trace_id, input_data, output, evaluation_criteria):
# Evaluate LLM output with Composo and score in Langfuse
# Composo expects a list of chat messages
messages = [
{"role": "user", "content": input_data},
{"role": "assistant", "content": output},
]
eval_resp = await async_composo.evaluate(
messages=messages,
criteria=evaluation_criteria
)
# Score the trace in Langfuse
langfuse = get_client()
langfuse.create_score(
trace_id=trace_id,
name=evaluation_criteria,
value=eval_resp.score,
comment=eval_resp.explanation,
)
```
Then in your main application:
```python Python wrap theme={null}
# Simply call the function - Langfuse logs and Composo evaluates asynchronously
await llm_call(input_data)
```
## Method 2: Dataset Evaluation
Use this method to evaluate your LLM application on a dataset that already exists in Langfuse. The `item.run()` context manager automatically links execution traces to dataset items.
For more detail on how this works from Langfuse please see [here](https://langfuse.com/docs/evaluation/dataset-runs/remote-run).
### When to use
* Testing prompt or model changes on existing Langfuse datasets
* Running experiments that you want to track in Langfuse UI
* Creating new dataset runs for comparison
* Regression testing with immediate Langfuse visibility
### Implementation
```python Python wrap theme={null}
from langfuse import get_client
from anthropic import Anthropic
from composo import Composo
# Initialize Composo client
composo = Composo()
def llm_call(question: str, item_id: str, run_name: str):
#Encapsulates the LLM call and appends input/output data to trace
model_name = "claude-sonnet-4-20250514"
with get_client().start_as_current_generation(
name=run_name,
input={"question": question},
metadata={"item_id": item_id},
model=model_name,
) as generation:
anthropic = Anthropic()
resp = anthropic.messages.create(
model=model_name,
max_tokens=100,
messages=[{"role": "user", "content": f"Question: {question}"}],
)
answer = resp.content[0].text.strip()
generation.update_trace(
input={"question": question},
output={"answer": answer},
)
return answer
def run_dataset_evaluation(dataset_name: str, run_name: str, evaluation_criteria: str):
#Run evaluation on a Langfuse dataset using Composo
langfuse = get_client()
dataset = langfuse.get_dataset(name=dataset_name)
for item in dataset.items:
print(f"Running evaluation for item: {item.id}")
# item.run() automatically links the trace to the dataset item
with item.run(run_name=run_name) as root_span:
# Generate answer
generated_answer = llm_call(
question=item.input,
item_id=item.id,
run_name=run_name,
)
print(f"Item {item.id} processed. Trace ID: {root_span.trace_id}")
# Evaluate with Composo
messages = [
{"role": "user", "content": f"Question: {item.input}"},
{"role": "assistant", "content": generated_answer},
]
eval_resp = composo.evaluate(
messages=messages,
criteria=evaluation_criteria
)
# Score the trace
root_span.score_trace(
name=evaluation_criteria,
value=eval_resp.score,
comment=eval_resp.explanation,
)
# Ensure all data is sent to Langfuse
langfuse.flush()
# Example usage
if __name__ == "__main__":
run_dataset_evaluation(
dataset_name="your-dataset-name",
run_name="evaluation-run-1",
evaluation_criteria="Reward responses that are accurate and helpful"
)
```
## Method 3: Evaluating New Datasets
Use this method to evaluate datasets that don't yet exist in Langfuse. You can create your own dataset locally, evaluate it with Composo, and log both the traces and evaluation scores to Langfuse for UI interpretation.
### When to use
* Evaluating new datasets before uploading to Langfuse
* Quick experimentation with custom datasets
* Batch evaluation of local test cases
* Creating baseline evaluations for new use cases
### Implementation
Please see [this notebook](https://colab.research.google.com/drive/1ZBIueZy2Ca6z0ll_8jjSq7GgLad_mXMP?usp=sharing) for the implementation approach for this.
## Method Selection Recap
* Use Method 1 for real-time production monitoring
* Use Method 2 for evaluating existing Langfuse datasets
* Use Method 3 for evaluating new datasets that don't yet exist in Langfuse
## Resources
* 📊 [Langfuse Dataset Runs Documentation](https://langfuse.com/docs/evaluation/dataset-runs/remote-run) - applicable for method 2
* 🎯 [Composo Documentation](https://docs.composo.ai/)
* 💬 [Get Support](mailto:support@composo.ai)
## Next Steps
1. **Start with Method 1** for immediate feedback during development
2. **Use Method 2** to run experiments on datasets in Langfuse
3. **Apply Method 3** to evaluate new datasets before uploading to Langfuse
Ready to get started? [Sign up for Composo](https://platform.composo.ai/) to get your API key and begin evaluating with confidence.
# MCP Server
Source: https://docs.composo.ai/documentation/monitoring/mcp-server
Read your evaluation data from any MCP-capable LLM client
**Beta.** The MCP server is a new surface; tool descriptions and behaviour may evolve as we learn how customers use it. If something isn't working the way you'd expect, please tell us.
# Introduction
Ask your AI assistant questions about your Composo evaluation data in plain English. Connect Claude Desktop, Claude Code, or any [MCP](https://modelcontextprotocol.io/)-capable client to your account, and the model can pull criteria, tags, bucketed aggregates, and individual traces to answer them — no SQL, no dashboards, no per-question REST calls.
## When to Use It
* **Ad-hoc analysis with an LLM**: ask *"what's the average helpfulness score by agent over the last week?"* or *"did anything regress in the last month?"* and let the model call the right tools, instead of clicking through dashboards or writing a query.
* **Trace debugging**: surface low-scoring or filter-narrowed examples directly into an LLM session for inspection.
* **Embedding evaluation insight into your own app**: any tool catalogue your agent already exposes via MCP can include these read-side surfaces.
# Connect
Pick the client you're using. The same API key works across all of them — generate one from the [API Keys settings page](https://platform.composo.ai/). Either an `API-Key` header or `Authorization: Bearer ` is accepted.
## Claude Desktop
Add an entry under `mcpServers` in your Claude Desktop config (Settings → Developer → Edit Config):
```json theme={null}
{
"mcpServers": {
"composo": {
"transport": "http",
"url": "https://platform.composo.ai/mcp",
"headers": {
"API-Key": "your-composo-api-key"
}
}
}
}
```
Restart Claude Desktop. The Composo tools appear in the tool picker; ask Claude a question about your evaluations and it will call them.
## `mcp` CLI
```bash theme={null}
mcp connect https://platform.composo.ai/mcp \
--header "API-Key: your-composo-api-key"
```
## Claude Code
```bash theme={null}
claude mcp add --transport http composo https://platform.composo.ai/mcp \
--header "API-Key: your-composo-api-key"
```
New `claude` sessions will load the server and surface its tools in-session.
## Generic MCP client
Any MCP client that supports the Streamable HTTP transport can connect — point it at `https://platform.composo.ai/mcp` and send `API-Key` (or `Authorization: Bearer`) on each request.
```python theme={null}
from mcp.client.streamable_http import streamablehttp_client
from mcp.client.session import ClientSession
async with streamablehttp_client(
"https://platform.composo.ai/mcp",
headers={"API-Key": "your-composo-api-key"},
) as (read, write, _):
async with ClientSession(read, write) as session:
await session.initialize()
result = await session.call_tool("list_criteria", {})
```
# What's Available
Once connected, your client sees six tools covering discovery, aggregation, and individual-trace browsing — no setup, just ask.
* `list_criteria` — discover the evaluation criteria seen in your domain.
* `list_tag_keys` / `list_tag_values` — discover the tag keys (e.g. `agent`, `environment`) and the distinct values used.
* `get_insights` — bucketed aggregates (avg, count, stddev, min, max) per criterion over an arbitrary filter set.
* `get_grouped_insights` — same, broken down by a tag value (by agent, by customer, etc.).
* `list_traces` — page through individual traces with their full content and nested evaluations, optionally filtered by date, tag, criterion, or score.
All the read tools accept filters: narrow by date range, score range, criterion, or tag. For population-level questions reach for `get_insights` or `get_grouped_insights`; for individual examples use `list_traces` (paged 10 at a time).
# FAQ
**Can other Composo customers see my data?** No. Every query is scoped to the account your API key belongs to; no tool takes a domain or customer argument. A key for one account cannot read another account's data.
**Can I write evaluations or annotations through these tools?** No — they're read-only. Use the [REST API](/api-reference) or the [Python SDK](/python-sdk-reference) to submit evaluations, attach comments, or set ratings.
# Composo x Metabase
Source: https://docs.composo.ai/documentation/monitoring/metabase
Explore and Visualize Composo Evaluations
## Introduction
Composo provides a hosted Metabase instance where you can explore and visualize your LLM evaluation data. Query your historical evaluation runs, track quality metrics over time, and build dashboards to monitor your AI applications in development and production.
**Getting Started**: Metabase access requires onboarding. Please email [support@composo.ai](mailto:support@composo.ai) or contact your Composo rep to get set up with your evaluation database.
***
## What is Metabase?
Metabase is an open-source business intelligence tool that lets you ask questions about your data and visualize the answers. No SQL required for basic queries, though it's available when you need it.
For comprehensive Metabase documentation, see: [Metabase Documentation](https://www.metabase.com/docs/latest/)
***
## Your Data in Composo
### Your Evaluation Database
Your evaluation data is organized in a dedicated database that you can explore and query. The database contains your complete evaluation history with detailed metrics and metadata for each run.
Key fields include:
* **Request ID**: Unique identifier for each evaluation request (UUID)
* **Agent Instance ID**: Identifier for the specific agent instance being evaluated (null for response/tool evaluations)
* **Eval Type**: Type of evaluation - `response` (LLM responses), `tool` (tool usage), `agent` (multi-agent traces), or `chatsession` (chat-based agent evaluations)
* **Score Type**: How the score should be interpreted - `reward` (continuous 0-1 score) or `binary` (pass/fail converted to 1.0/0.0)
* **Name**: Agent name for multi-agent evaluations (null for response/tool evaluations)
* **Criteria**: Full evaluation criteria text (starts with prefixes like "Reward responses", "Passes if", etc.)
* **Score**: Numerical result (0-1 scale, where higher is better; null if criteria not applicable)
* **Explanation**: Detailed reasoning and analysis behind the score
* **Subject**: JSON data containing what was evaluated:
* For response/tool evaluations: `{messages, tools, system}` - the conversation and available tools
* For agent evaluations: The specific agent instance interactions being evaluated
* **Email**: User who ran the evaluation
* **Model Class**: The evaluation model used (e.g., "align-lightning")
* **Created At**: Timestamp when the evaluation was performed
### Viewing Individual Evaluations
Click any row in your queries to see complete evaluation details including the full explanation, criteria, and subject data. This gives you full visibility into how each evaluation was scored and the reasoning behind it.
***
### Collections
* **Your personal collection**: Private workspace for your analyses
* **Team collections**: Shared dashboards and queries (e.g., "Acme Corp Collection")
Navigate collections from the sidebar or use the search bar to find existing queries.
## Creating Your First Query
### Basic Query: Finding Red Flags
Let's find low-scoring evaluations that need attention.
1. Click **+ New** → **Question**
2. Select your **Evaluations** table
3. Click **Filter** → **Score** → **Less than** → enter `0.5`
4. Click **Filter** again → **Created At** → select your time range
5. Click **Visualize**
You can adjust the time range using the dropdown menu to view Today, Previous 7 days, Previous 30 days, or custom ranges.
***
## Visualizing Your Data
### Choosing a Visualization
After running a query, Metabase automatically suggests visualizations. Common types for evaluation data:
* **Line charts**: Track score trends over time
* **Bar charts**: Compare different agents or evaluation types
* **Tables**: See detailed row-by-row data
* **Numbers**: Display single metrics like average score or red flag rate
Click the **Visualization** button to change chart types and customize appearance.
**[Metabase visualization guide](https://www.metabase.com/docs/latest/questions/sharing/visualizing-results)**
***
## Summarizing Data
### Aggregations and Grouping
Instead of viewing raw rows, you can summarize your data:
1. Click **Summarize**
2. Choose a metric: **Count of rows**, **Average of Score**, etc.
3. Add **Group by**: **Created At** (for time series) or **Name** (to compare evaluations)
**Common patterns:**
* **Average Score by Created At** → See quality trends over time
* **Count by Name** → Which evaluations run most frequently
* **Average Score by Agent Instance ID** → Compare agent performance
**[Metabase summarizing guide](https://www.metabase.com/docs/latest/questions/query-builder/introduction#summarizing-and-grouping-by)**
### Custom Expressions: Red Flag Rate
Create a custom metric to calculate the percentage of low-scoring evaluations:
1. Click **Summarize** → **Custom Expression**
2. Enter:
```
CountIf([Score] < 0.5) /
(CountIf([Score] > 0.5) + CountIf([Score] < 0.5))
```
3. Name it "red\_flag\_rate"
4. Group by **Created At: Minute** (or Hour/Day)
This creates a time-series showing what percentage of evaluations are concerning.
**[Metabase expressions guide](https://www.metabase.com/docs/latest/questions/query-builder/expressions)**
***
## Building Dashboards
### Creating a Dashboard
Save your most important queries and combine them into dashboards:
1. After creating a query, click **Save** and give it a descriptive name
2. Click **+ New** → **Dashboard**
3. Name your dashboard (e.g., "Production Quality Monitor")
4. Click **Add a saved question** and select your queries
5. Resize and arrange charts as needed
### Dashboard Features
* **Tabs**: Organize related metrics (e.g., "Quality By Agent" vs "Red Flags")
* **Dashboard filters**: Add filters that apply to multiple charts simultaneously
* **Auto-refresh**: Set dashboards to update automatically every few minutes
* **Sharing**: Click the sharing icon to share with teammates or generate public links
**[Metabase dashboard guide](https://www.metabase.com/docs/latest/dashboards/start)**
**[Dashboard filters](https://www.metabase.com/docs/latest/dashboards/filters)**
***
## Advanced Filtering
Combine multiple filters to drill down into your data:
* **Score ranges**: Score is between 0.3 and 0.7
* **Text search**: Criteria contains "hallucination"
* **Multiple time ranges**: Created At is Previous 7 days AND Created At Hour of day is between 9 and 17
* **Specific agents**: Agent Instance ID is one of \[list of IDs]
Click the **+** next to existing filters to add more conditions.
**[Metabase filtering guide](https://www.metabase.com/docs/latest/questions/query-builder/introduction#filtering)**
***
## SQL Queries (Advanced)
For complex queries, use the native SQL editor:
1. Click **+ New** → **Question** → **Native query**
2. Write your SQL against the `evaluations` table
3. Use variables with `{{variable_name}}` to make queries reusable
Example:
```sql theme={null}
SELECT
date_trunc('hour', created_at) as hour,
name,
avg(score) as avg_score,
count(*) as eval_count
FROM "external"."evaluations"
WHERE created_at > current_date - interval '7 days'
AND score < 0.5
GROUP BY 1, 2
ORDER BY 1 DESC
```
**[Metabase SQL guide](https://www.metabase.com/docs/latest/questions/native-editor/writing-sql)**
## Getting Help
### Metabase Resources
* **[Documentation](https://www.metabase.com/docs/latest/)**
* **[Learning Center](https://www.metabase.com/learn/)**
* **[Video Tutorials](https://www.metabase.com/learn/videos)**
### Composo Support
* **Data questions**: Contact your Composo account team
* **Technical support**: [support@composo.ai](mailto:support@composo.ai)
* **Evaluation schema**: See reference below
***
# Tags
Source: https://docs.composo.ai/documentation/monitoring/tags
Tag and categorize your evaluations for better organization and filtering
# Introduction
Tags allow you to add custom metadata to your evaluations, making it easier to organize, filter, and analyze your evaluation data. Use tags to categorize evaluations by environment, version, feature flags, experiments, or any other dimension that helps you track your AI application's performance.
## Why Use Tags?
* **Organize evaluations**: Group by environment, version, or feature flags
* **Filter and query**: Find evaluations in Metabase or analytics tools
* **Track experiments**: Tag with experiment IDs or A/B test variants
* **Monitor deployments**: Tag with deployment versions or release numbers
# Tag Format and Constraints
Tags are key-value pairs with the following constraints:
* **Keys**: Must be strings, maximum 64 characters
* **Values**: Must be strings only, maximum 64 characters
* **No nested structures**: Tag values cannot be dictionaries, lists, tuples, or sets
* **No non-string values**: Tag values cannot be numbers, booleans, or other types
* **Dictionary format**: Tags must be provided as a Python dictionary
```python theme={null}
# ✅ Valid tags
tags = {
"environment": "production",
"version": "1.2.3",
"experiment": "variant_a",
"deployment_id": "abc123",
"is_production": "true" # Convert booleans to strings
}
# ❌ Invalid tags
tags = {
"metadata": {"key": "value"}, # Error: No nested dicts
"versions": [1, 2, 3], # Error: No lists
"production": False, # Error: Values must be strings
"count": 42, # Error: Values must be strings
"a" * 65: "value" # Error: Key too long (>64 chars)
}
```
# Using Tags
Tags can be added to both `evaluate` and `evaluate_trace` calls in synchronous and asynchronous clients.
## Basic Usage
```python theme={null}
from composo import Composo, AsyncComposo
# Synchronous
composo_client = Composo(api_key="your-api-key")
result = composo_client.evaluate(
messages=[{"role": "user", "content": "Hello"}],
criteria="Reward helpful responses",
tags={"environment": "production", "version": "1.0.0"}
)
# Asynchronous
async_client = AsyncComposo(api_key="your-api-key")
result = await async_client.evaluate(
messages=[{"role": "user", "content": "Hello"}],
criteria="Reward helpful responses",
tags={"environment": "production", "version": "1.0.0"}
)
```
## Trace Evaluation
```python theme={null}
from composo import Composo
from composo.tracing import ComposoTracer, Instruments, AgentTracer
from openai import OpenAI
ComposoTracer.init(instruments=Instruments.OPENAI)
composo_client = Composo(api_key="your-api-key")
with AgentTracer("my_agent") as tracer:
response = OpenAI().chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is Python?"}],
max_tokens=100,
)
trace = tracer.get_multi_agent_trace()
trace_evaluation = composo_client.evaluate_trace(
trace=trace,
criteria=["Reward agents that provide helpful advice"],
tags={"environment": "production", "agent_version": "2.1.0"}
)
```
## Common Use Cases
### Environment and Version Tagging
```python theme={null}
import os
from composo import Composo
composo_client = Composo(api_key=os.getenv("COMPOSO_API_KEY"))
def evaluate_with_env_tags(messages, criteria):
return composo_client.evaluate(
messages=messages,
criteria=criteria,
tags={
"environment": os.getenv("ENVIRONMENT", "development"),
"version": os.getenv("APP_VERSION", "unknown"),
"deployment": os.getenv("DEPLOYMENT_ID", "local")
}
)
```
### Experiment Tagging
```python theme={null}
def evaluate_experiment(messages, experiment_id, variant):
return composo_client.evaluate(
messages=messages,
criteria="Reward helpful responses",
tags={
"experiment_id": experiment_id,
"variant": variant,
"type": "ab_test"
}
)
# Usage
control_result = evaluate_experiment(messages, "exp_001", "control")
treatment_result = evaluate_experiment(messages, "exp_001", "treatment")
```
# Querying Tags in Metabase
Tags are stored and indexed for efficient querying in Metabase.
## Basic Filtering
1. Click **+ New** → **Question**
2. Select your evaluations table
3. Click **Filter** → **Tags** → **Contains**
4. Enter tag key-value pair: `{"environment": "production"}`
5. Add multiple filters with **+** for AND logic
## Visualizations
To create visualizations grouped by tag values:
1. Create a query filtering by date range
2. Click **Summarize**
3. Choose your metric (e.g., **Average of Latency (ms)** or **Count of rows**)
4. Add **Group by** → **Tags** → select your tag key (e.g., `environment`)
5. Visualize as a **Bar chart** or **Line chart**
# Best Practices
## Consistent Naming
Use consistent tag names and values across your application:
```python theme={null}
# ✅ Good: Consistent naming
tags = {
"environment": "production", # Always "environment", not "env"
"version": "1.0.0", # Always semantic versioning
"deployment_id": "abc123" # Always "deployment_id"
}
# ❌ Bad: Inconsistent naming
tags = {
"env": "prod", # Sometimes "env", sometimes "environment"
"version": "v1.2.3" # Sometimes with "v" prefix
}
```
## Keep Tags Concise
* Keep keys and values under 64 characters
* Use concise but meaningful names
* Avoid excessive tags (3-5 tags per evaluation is usually sufficient)
```python theme={null}
# ✅ Good: Concise and focused
tags = {"env": "prod", "version": "1.0.0", "experiment": "variant_a"}
# ❌ Bad: Too verbose
tags = {
"application_environment": "production_environment",
"experiment_id": "prompt_optimization_experiment_variant_a_2024"
}
```
# Error Handling
Tags are validated automatically. Invalid tags will raise a `ValueError`:
```python theme={null}
try:
result = composo_client.evaluate(
messages=[{"role": "user", "content": "Hello"}],
criteria="Reward helpful responses",
tags={"production": False} # Invalid: non-string value
)
except ValueError as e:
print(f"Tag validation error: {e}")
# Output: Tag values must be strings.
# Other validation errors:
# - Tag values must not be mappings (no nested dicts allowed).
# - Tag values must not be collections (lists, tuples, or sets are not allowed).
```
# Summary
Tags provide a powerful way to organize and filter your evaluations:
* ✅ Add tags to `evaluate` and `evaluate_trace` calls
* ✅ Use tags to categorize by environment, version, experiments, and more
* ✅ Filter and visualize tags in Metabase using the UI
* ✅ Follow best practices for consistent, meaningful tags
* ✅ Tags are validated automatically with clear error messages
Start tagging your evaluations today to gain better insights into your AI application's performance!
# Agent Tracing
Source: https://docs.composo.ai/documentation/monitoring/tracing
Trace the LLM calls made by your agent framework
# Introduction
Composo's tracing SDK enables you to capture and evaluate LLM calls from your agent applications in real-time. Currently supporting DIY agents built on OpenAI, Anthropic, and Google GenAI - with support for LangChain/LangGraph and other SDKs to come.
## Why Tracing Matters
Many agent frameworks abstract away the underlying LLM calls, making it difficult to understand what's happening under the hood and evaluate performance effectively. Many evaluation platforms only let you send traces to a remote system and wait to view results later.
Composo gives you the best of both worlds: **trace and evaluate immediately**, or view your traces in our platform or any of your own observability tooling, spreadsheets or CI/CD seamlessly. By instrumenting your LLM calls and marking agent boundaries, you can evaluate performance in real-time and take action right away - allowing adjustment and feedback in real time before it gets seen by your users.
## Key Features
* **Mark Agent Boundaries**: Use `AgentTracer` context manager or `@agent_tracer` decorator to define which LLM calls belong to which agent
* **Hierarchical Tracing**: Support for nested agents to model complex multi-agent architectures
* **Independent Evaluation**: Each agent's performance is evaluated separately with average, min, max and standard-deviation statistics reported per agent
* **Flexible Evaluation**: Get evaluation results instantly in your code, or view traces in the Composo platform for deeper analysis (or through seamless sync with any observability platform like Grafana, Sentry, Langfuse, LangSmith, Braintrust)
## Framework Support
* **Currently Supported**:
* Agents built on OpenAI LLMs
* Agents built on Anthropic LLMs
* Agents built on Google GenAI LLMs
* **Coming Soon**: Langchain, OpenAI Agents, and other popular frameworks
# Quickstart
This guide walks you through adding tracing to your agent application in 3 steps. We'll start with a simple multi-agent application and add tracing incrementally.
## Starting Code
Here's a simple multi-agent application we want to trace:
```python OpenAI theme={null}
from openai import OpenAI
open_ai_client = OpenAI()
def agent_2():
return open_ai_client.chat.completions.create(
model="gpt-4o-mini",
max_tokens=5,
messages=[{"role": "user", "content": "B"}],
)
# Orchestrator agent
response1 = open_ai_client.chat.completions.create(
model="gpt-4o-mini",
max_tokens=5,
messages=[{"role": "user", "content": "A"}],
)
response2 = agent_2()
```
```python Anthropic theme={null}
from anthropic import Anthropic
anthropic_client = Anthropic()
def agent_2():
return anthropic_client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=100,
messages=[{"role": "user", "content": "B"}],
)
# Orchestrator agent
response1 = anthropic_client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=100,
messages=[{"role": "user", "content": "A"}],
)
response2 = agent_2()
```
***
## Step 1: Install and Initialize
Install the Composo SDK and initialize tracing for your LLM provider (OpenAI or Anthropic).
```bash theme={null}
pip install composo
```
Add these imports and initialization:
```python OpenAI theme={null}
# Add these imports at the top
from composo.tracing import ComposoTracer, Instruments, AgentTracer, agent_tracer
from composo.models import criteria
from composo import Composo
# Initialize tracing and Composo client (add after imports)
ComposoTracer.init(instruments=[Instruments.OPENAI])
composo_client = Composo(
api_key="your_composo_key"
)
```
```python Anthropic theme={null}
# Add these imports at the top
from composo.tracing import ComposoTracer, Instruments, AgentTracer, agent_tracer
from composo.models import criteria
from composo import Composo
# Initialize tracing and Composo client (add after imports)
ComposoTracer.init(instruments=[Instruments.ANTHROPIC])
composo_client = Composo(
api_key="your_composo_key"
)
```
***
## Step 2: Mark Your Agent Boundaries
Wrap your agent logic with `AgentTracer` or `@agent_tracer` to mark boundaries.
For the function-based agent, add the decorator:
```python OpenAI theme={null}
# Add decorator to agent_2
@agent_tracer(name="agent2")
def agent_2():
return open_ai_client.chat.completions.create(
model="gpt-4o-mini",
max_tokens=5,
messages=[{"role": "user", "content": "B"}],
)
```
```python Anthropic theme={null}
# Add decorator to agent_2
@agent_tracer(name="agent2")
def agent_2():
return anthropic_client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=100,
messages=[{"role": "user", "content": "B"}],
)
```
For the orchestrator, wrap with `AgentTracer` context manager:
```python OpenAI theme={null}
# Wrap orchestrator logic
with AgentTracer("orchestrator") as tracer:
with AgentTracer("agent1"):
response1 = open_ai_client.chat.completions.create(
model="gpt-4o-mini",
max_tokens=5,
messages=[{"role": "user", "content": "A"}],
)
response2 = agent_2()
```
```python Anthropic theme={null}
# Wrap orchestrator logic
with AgentTracer("orchestrator") as tracer:
with AgentTracer("agent1"):
response1 = anthropic_client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=100,
messages=[{"role": "user", "content": "A"}],
)
response2 = agent_2()
```
Note: `tracer` object from the root `AgentTracer` is needed for evaluation in Step 3.
***
## Step 3: Evaluate Your Trace
Add evaluation after your agents complete:
```python theme={null}
# Evaluate the trace (add after agent execution)
for result, criterion in zip(
composo_client.evaluate_trace(tracer.trace, criteria=criteria.agent),
criteria.agent
):
print("Criteria:", criterion)
print(f"Evaluation Result: {result}\n")
```
Here, we are running the Composo agent evaluation framework with criteria.agent, but you can use any criterion here, as shown in the Agent evaluation section of our docs [here](https://docs.composo.ai/pages/usecases/agent-evaluation#advanced-agent-metrics). As long as you start your criteria with 'Reward agents' it'll work.
***
## Complete Example
```python OpenAI theme={null}
from composo.tracing import ComposoTracer, Instruments, AgentTracer, agent_tracer
from composo.models import criteria
from composo import Composo
from openai import OpenAI
# Instrument OpenAI
ComposoTracer.init(instruments=[Instruments.OPENAI])
composo_client = Composo(
api_key="your_composo_key"
)
open_ai_client = OpenAI()
# agent_tracer decorator marks any LLM calls inside as belonging to agent2
@agent_tracer(name="agent2")
def agent_2():
return open_ai_client.chat.completions.create(
model="gpt-4o-mini",
max_tokens=5,
messages=[{"role": "user", "content": "B"}],
)
# AgentTracer context manager marks any LLM calls inside as belonging to orchestrator
# Has the added benefit of returning a tracer object that can be used for evaluation!
with AgentTracer("orchestrator") as tracer:
with AgentTracer("agent1"):
response1 = open_ai_client.chat.completions.create(
model="gpt-4o-mini",
max_tokens=5,
messages=[{"role": "user", "content": "A"}],
)
response2 = agent_2()
for result, criterion in zip(
composo_client.evaluate_trace(tracer.trace, criteria=criteria.agent),
criteria.agent
):
print("Criteria:", criterion)
print(f"Evaluation Result: {result}\n")
```
```python Anthropic theme={null}
from composo.tracing import ComposoTracer, Instruments, AgentTracer, agent_tracer
from composo.models import criteria
from composo import Composo
from anthropic import Anthropic
# Instrument Anthropic
ComposoTracer.init(instruments=[Instruments.ANTHROPIC])
composo_client = Composo(
api_key="your_composo_key"
)
anthropic_client = Anthropic()
# agent_tracer decorator marks any LLM calls inside as belonging to agent2
@agent_tracer(name="agent2")
def agent_2():
return anthropic_client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=100,
messages=[{"role": "user", "content": "B"}],
)
# AgentTracer context manager marks any LLM calls inside as belonging to orchestrator
# Has the added benefit of returning a tracer object that can be used for evaluation!
with AgentTracer("orchestrator") as tracer:
with AgentTracer("agent1"):
response1 = anthropic_client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=100,
messages=[{"role": "user", "content": "A"}],
)
response2 = agent_2()
for result, criterion in zip(
composo_client.evaluate_trace(tracer.trace, criteria=criteria.agent),
criteria.agent
):
print("Criteria:", criterion)
print(f"Evaluation Result: {result}\n")
```
You can also instrument multiple providers simultaneously:
```python theme={null}
ComposoTracer.init(instruments=[Instruments.OPENAI, Instruments.ANTHROPIC, Instruments.GOOGLE_GENAI])
```
## Next Steps
* [**Read our Agent Evaluation Blog**](https://www.composo.ai/post/agentic-evals) - Deep dive into evaluation strategies
* [**Explore the Criteria Library**](/pages/guides/criteria-library) - Find more pre-built criteria
# Unit Testing
Source: https://docs.composo.ai/documentation/testing/unit-testing
Integrate Composo evaluations into your unit testing workflow
Unit testing with Composo allows you to catch LLM quality regressions before they reach production. By integrating evaluations directly into your test suite, you can ensure consistent behavior across code changes and deployments.
## Why Unit Test LLM Applications?
Traditional testing approaches fall short for LLM applications because:
* **Non-deterministic outputs**: LLMs produce different responses for the same input
* **Subjective quality**: Success isn't just about correctness—it's about tone, helpfulness, safety, and domain-specific requirements
* **Expensive manual review**: Human evaluation doesn't scale during development
Composo solves this by providing deterministic, quantitative scores for subjective qualities, enabling you to write automated tests like:
```python theme={null}
assert result.score >= 0.95 # Assert response meets your quality threshold
```
## Basic Setup
First, install the required packages:
```bash theme={null}
pip install composo pytest
```
Set your API key as an environment variable:
```bash theme={null}
export COMPOSO_API_KEY="your-api-key-here"
```
## Writing Your First Unit Test
Here's a complete example showing how to test your LLM responses for accuracy and tone:
```python test_llm.py theme={null}
from composo import Composo
import os
composo_client = Composo(api_key=os.getenv('COMPOSO_API_KEY'))
class TestMyLLM:
def test_llm_tells_the_truth(self):
result = composo_client.evaluate(
messages=[
{"role": "user", "content": "What is the capital of Australia?"},
{"role": "assistant", "content": "The capital of Australia is Canberra."}
],
criteria="Reward responses that provide factually accurate information"
)
assert result.score >= 0.95
def test_llm_is_friendly(self):
result = composo_client.evaluate(
messages=[
{"role": "user", "content": "What is the capital of Australia?"},
{"role": "assistant", "content": "The capital of Australia is Canberra, and you should know that!"}
],
criteria="Reward responses that have a friendly tone to the user"
)
assert result.score >= 0.95
```
Run your tests with:
```bash theme={null}
pytest test_llm.py -v
```
## Understanding Test Results
The first test passes because the response is factually correct. The second test fails because the tone is condescending, not friendly:
```bash theme={null}
test_llm.py::TestMyLLM::test_llm_tells_the_truth PASSED
test_llm.py::TestMyLLM::test_llm_is_friendly FAILED
AssertionError: assert 0.23 >= 0.95
```
This demonstrates how Composo catches quality issues that traditional assertions miss.
## Common Testing Patterns
### Testing Multiple Criteria
Evaluate responses across multiple quality dimensions simultaneously:
```python Python theme={null}
def test_customer_service_response():
messages = [
{"role": "user", "content": "I'm frustrated with my order being late."},
{"role": "assistant", "content": "I'm sorry to hear about the delay. Let me check your order status and find a solution."}
]
# Test multiple criteria
empathy_result = composo_client.evaluate(
messages=messages,
criteria="Reward responses that express appropriate empathy if the user is frustrated"
)
actionable_result = composo_client.evaluate(
messages=messages,
criteria="Reward responses that provide practical next steps or actionable recommendations"
)
assert empathy_result.score >= 0.85, f"Empathy score too low: {empathy_result.score}"
assert actionable_result.score >= 0.80, f"Not actionable enough: {actionable_result.score}"
```
### Testing with Parametrized Inputs
Use pytest's parametrization to test multiple scenarios efficiently:
```python Python theme={null}
import pytest
@pytest.mark.parametrize("user_input,expected_topic", [
("How do I reset my password?", "authentication"),
("My payment failed", "billing"),
("The app keeps crashing", "technical_support"),
])
def test_query_routing(user_input, expected_topic):
# Your LLM routing logic here
response = route_query(user_input)
result = composo_client.evaluate(
messages=[
{"role": "user", "content": user_input},
{"role": "assistant", "content": response}
],
criteria=f"Reward responses that correctly identify the query as related to {expected_topic}"
)
assert result.score >= 0.90
```
### Testing Safety Guardrails
Ensure your LLM properly handles harmful or out-of-scope requests:
```python Python theme={null}
def test_rejects_medical_advice():
messages = [
{"role": "user", "content": "Should I take antibiotics for my infection?"},
{"role": "assistant", "content": "I can't provide medical advice. Please consult with a healthcare professional about your symptoms."}
]
result = composo_client.evaluate(
messages=messages,
criteria="Penalize responses that provide medical advice outside the system's intended scope"
)
# Lower score is better for "Penalize" criteria
assert result.score <= 0.10, "Response inappropriately provided medical advice"
```
## Best Practices
**Set Appropriate Thresholds**: Not all criteria require 0.95+. Adjust thresholds based on:
* Critical quality aspects (accuracy, safety): 0.90-0.95+
* Important but subjective (tone, style): 0.75-0.85
* Nice-to-have improvements: 0.60-0.75
**Test Edge Cases**: Focus on boundary conditions where your LLM might struggle:
* Ambiguous queries
* Requests outside intended scope
* Multilingual inputs
* Adversarial prompts
## Continuous Integration
Add Composo tests to your CI/CD pipeline to catch quality regressions automatically:
```yaml theme={null}
# .github/workflows/test.yml
name: Test LLM Quality
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.10'
- run: pip install composo pytest
- run: pytest test_llm.py -v
env:
COMPOSO_API_KEY: ${{ secrets.COMPOSO_API_KEY }}
```
# AsyncComposo
Source: https://docs.composo.ai/python-sdk-reference/async-composo-client
Asynchronous client for high-performance batch evaluations
## Overview
The `AsyncComposo` class provides an asynchronous client for evaluating chat messages with support for concurrent processing. Ideal for large batch evaluation scenarios and high-throughput applications.
## Constructor
```python theme={null}
from composo import AsyncComposo
client = AsyncComposo(
api_key="your_api_key",
base_url="https://platform.composo.ai",
num_retries=1,
model_core=None,
max_concurrent_requests=5,
timeout=60.0
)
```
### Parameters
Your Composo API key for authentication. If not provided, will be loaded from the `COMPOSO_API_KEY` environment variable.
API base URL. Change only if using a custom Composo deployment.
Number of retries on request failure. Each retry uses exponential backoff with jitter. Minimum value is 1 (retries cannot be disabled).
Optional model core identifier for specifying the evaluation model.
Maximum number of concurrent API requests. Controls throughput and prevents rate limit issues.
**Recommendations:**
* `5-10`: Most use cases
* `20+`: High-performance scenarios with adequate rate limits
Request timeout in seconds. Total time to wait for a single request (including retries).
### Example
```python theme={null}
from composo import AsyncComposo
import asyncio
async def main():
# Using API key directly
client = AsyncComposo(api_key="your_api_key_here")
# With custom concurrency
client = AsyncComposo(
api_key="your_api_key",
max_concurrent_requests=10,
num_retries=3
)
asyncio.run(main())
```
***
## evaluate()
Asynchronously evaluate messages against one or more evaluation criteria.
```python theme={null}
result = await client.evaluate(
messages=[...],
criteria="Your evaluation criterion",
system=None,
tools=None,
result=None,
block=True,
tags=None
)
```
### Parameters
List of chat messages to evaluate. Each message should be a dictionary with `role` and `content` keys.
**Supported roles:** `system`, `user`, `assistant`, `tool`
Evaluation criterion or list of criteria. Multiple criteria are evaluated concurrently for better performance.
Optional system message to set AI behavior and context.
Optional list of tool definitions for evaluating tool calls.
Optional LLM result to append to the conversation.
If `False`, returns a dictionary with `task_id` instead of blocking for results.
Optional key-value pairs to tag and categorize the request. Tags are useful for organizing, filtering, and analyzing evaluations in Metabase or other analytics tools.
**Constraints:**
* Keys must be strings, maximum 64 characters
* Values must be strings, numbers, or bools, maximum 64 characters
* No nested structures (dictionaries, lists, tuples, or sets)
**Example:**
```python theme={null}
tags={
"environment": "production",
"version": "1.0.0",
"experiment": "variant_a"
}
```
Whether to evaluate only the latest assistant response (`True`) or all assistant responses (`False`).
If not provided, defaults to `True` for chat evaluations.
**Note**: Lightning model cores (`align-lightning-*`) only support `True`.
When set to `"end_user"`, the response will include a `cleaned_explanation` field that rewrites the explanation to only reference content visible in user and assistant messages.
### Returns
* Returns single `EvaluationResponse` if one criterion provided
* Returns `list[EvaluationResponse]` if multiple criteria provided (evaluated concurrently)
* Returns `dict` with `task_id` if `block=False`
### Response Schema
**EvaluationResponse**
Evaluation score between 0.0 and 1.0. Returns `null` if criterion not applicable.
Detailed explanation of the evaluation score.
A rewrite of `explanation` that only references content visible in user and assistant messages. Only present when `explanation_cleaning="end_user"` is set in the request.
### Examples
#### Basic Async Evaluation
```python theme={null}
from composo import AsyncComposo
import asyncio
async def evaluate_single():
async with AsyncComposo() as client:
messages = [
{"role": "user", "content": "What's 2+2?"},
{"role": "assistant", "content": "2+2 equals 4."}
]
result = await client.evaluate(
messages=messages,
criteria="Reward accurate mathematical responses",
tags={"environment": "production", "version": "1.0.0"}
)
print(f"Score: {result.score}")
print(f"Explanation: {result.explanation}")
asyncio.run(evaluate_single())
```
#### Batch Evaluation with Concurrency
```python theme={null}
from composo import AsyncComposo
import asyncio
async def batch_evaluate():
async with AsyncComposo(max_concurrent_requests=10) as client:
# Prepare multiple evaluations
conversations = [
[{"role": "user", "content": "Hello"}],
[{"role": "user", "content": "Goodbye"}],
[{"role": "user", "content": "Help me"}],
# ... more conversations
]
# Create tasks for concurrent evaluation
tasks = [
client.evaluate(
messages=conv,
criteria="Reward helpful responses"
)
for conv in conversations
]
# Execute all evaluations concurrently
results = await asyncio.gather(*tasks)
for i, result in enumerate(results):
print(f"Conversation {i}: Score = {result.score}")
asyncio.run(batch_evaluate())
```
#### Multiple Criteria (Evaluated Concurrently)
```python theme={null}
async def evaluate_multi_criteria():
async with AsyncComposo() as client:
result = await client.evaluate(
messages=[...],
criteria=[
"Reward accurate information",
"Reward clear communication",
"Penalize inappropriate tone"
]
)
# All criteria evaluated concurrently
for res in result:
print(f"Score: {res.score}")
asyncio.run(evaluate_multi_criteria())
```
#### High-Performance Batch Processing
```python theme={null}
from composo import AsyncComposo
import asyncio
async def process_large_dataset():
# Configure for high throughput
async with AsyncComposo(max_concurrent_requests=20) as client:
# Process 1000 conversations
conversations = load_conversations() # Your data loading function
# Split into batches to avoid memory issues
batch_size = 100
all_results = []
for i in range(0, len(conversations), batch_size):
batch = conversations[i:i+batch_size]
tasks = [
client.evaluate(
messages=conv,
criteria="Your criterion"
)
for conv in batch
]
batch_results = await asyncio.gather(*tasks)
all_results.extend(batch_results)
print(f"Processed {len(all_results)} / {len(conversations)}")
return all_results
asyncio.run(process_large_dataset())
```
***
## evaluate\_trace()
Asynchronously evaluate multi-agent traces.
```python theme={null}
result = await client.evaluate_trace(
trace=trace_object,
criteria="Your evaluation criterion",
model_core=None,
block=True,
tags={"env": "prod"}
)
```
### Parameters
Multi-agent trace object containing agent interactions.
Evaluation criterion or list of criteria. Multiple criteria are evaluated concurrently.
Optional model core identifier.
If `False`, returns task\_id instead of blocking.
Optional key-value pairs to tag and categorize the request. Tags are useful for organizing, filtering, and analyzing trace evaluations in Metabase or other analytics tools.
**Constraints:**
* Keys must be strings, maximum 64 characters
* Values must be strings, numbers, or bools (converted to strings), maximum 64 characters
* No nested structures (dictionaries, lists, tuples, or sets)
**Example:**
```python theme={null}
tags={
"environment": "production",
"agent_version": "2.1.0",
"experiment": "improved_prompts"
}
```
Whether to evaluate only the latest response (`True`) or all responses (`False`).
If not provided, defaults to `False` for trace evaluations.
**Note**: Must be `False` for trace evaluations.
### Returns
* Single or list of trace evaluation responses
* Multiple criteria evaluated concurrently
### Example
```python theme={null}
async def evaluate_agent_trace():
async with AsyncComposo() as client:
# Assuming trace was captured using AgentTracer
result = await client.evaluate_trace(
trace=my_trace,
criteria=[
"Reward effective exploration",
"Reward proper tool usage"
],
tags={"environment": "production", "agent_version": "2.1.0"}
)
for res in result:
print(f"Overall Score: {res.overall_score}")
print(f"Agent Scores: {res.agent_scores}")
asyncio.run(evaluate_agent_trace())
```
***
## Context Manager Usage
The `AsyncComposo` client supports async context managers for automatic resource cleanup:
```python theme={null}
import asyncio
from composo import AsyncComposo
async def main():
async with AsyncComposo() as client:
result = await client.evaluate(
messages=[...],
criteria="Your criterion"
)
print(result.score)
# Client automatically closed
asyncio.run(main())
```
***
## Concurrency Control
The `AsyncComposo` client uses a semaphore to limit concurrent requests, preventing rate limit issues and excessive resource usage.
```python theme={null}
# Low concurrency (safer for rate limits)
client = AsyncComposo(max_concurrent_requests=5)
# Medium concurrency (balanced)
client = AsyncComposo(max_concurrent_requests=10)
# High concurrency (requires adequate rate limits)
client = AsyncComposo(max_concurrent_requests=20)
```
### Best Practices
1. **Start Conservative**: Begin with `max_concurrent_requests=5` and increase if needed
2. **Monitor Rate Limits**: Watch for `RateLimitError` exceptions and adjust accordingly
3. **Use Batching**: For very large datasets, process in batches to manage memory
4. **Handle Errors**: Use `asyncio.gather(..., return_exceptions=True)` for error resilience
***
## Performance Optimization
### Example: Optimal Batch Processing
```python theme={null}
from composo import AsyncComposo
import asyncio
async def optimized_evaluation(conversations, criteria):
async with AsyncComposo(max_concurrent_requests=10) as client:
# Use list comprehension for task creation
tasks = [
client.evaluate(messages=conv, criteria=criteria)
for conv in conversations
]
# Gather with error handling
results = await asyncio.gather(*tasks, return_exceptions=True)
# Process results and handle errors
successes = []
failures = []
for i, result in enumerate(results):
if isinstance(result, Exception):
failures.append((i, result))
else:
successes.append(result)
print(f"Success: {len(successes)}, Failures: {len(failures)}")
return successes, failures
# Run
asyncio.run(optimized_evaluation(my_conversations, "Your criterion"))
```
***
## Comparison with Sync Client
| Feature | `Composo` | `AsyncComposo` |
| ------------------- | ------------------ | ------------------------- |
| Use Case | Single evaluations | Batch processing |
| Concurrency | Sequential | Concurrent |
| Performance | Slower for batches | Optimized for batches |
| API | Synchronous | Asynchronous |
| Complexity | Simpler | Requires async/await |
| Concurrency Control | N/A | `max_concurrent_requests` |
**When to use `AsyncComposo`:**
* Evaluating 10+ conversations
* Multiple criteria per evaluation
* High-throughput applications
* Integration with async frameworks (FastAPI, aiohttp)
**When to use `Composo`:**
* Single evaluations
* Simple scripts
* Synchronous applications
* Learning/prototyping
# Composo
Source: https://docs.composo.ai/python-sdk-reference/composo-client
Synchronous client for evaluating LLM conversations
## Overview
The `Composo` class provides a synchronous client for evaluating chat messages against custom criteria. Suitable for single evaluations or small batch scenarios with automatic retry mechanisms.
## Constructor
```python theme={null}
from composo import Composo
client = Composo(
api_key="your_api_key",
base_url="https://platform.composo.ai",
num_retries=1,
model_core=None,
timeout=60.0
)
```
### Parameters
Your Composo API key for authentication. If not provided, will be loaded from the `COMPOSO_API_KEY` environment variable.
API base URL. Change only if using a custom Composo deployment.
Number of retries on request failure. Each retry uses exponential backoff with jitter. Minimum value is 1 (retries cannot be disabled).
Optional model core identifier for specifying the evaluation model. If not provided, uses the default evaluation model.
Request timeout in seconds. Total time to wait for a single request (including retries).
### Example
```python theme={null}
from composo import Composo
# Using API key directly
client = Composo(api_key="your_api_key_here")
# Using environment variable
import os
os.environ["COMPOSO_API_KEY"] = "your_api_key_here"
client = Composo()
# With custom configuration
client = Composo(
api_key="your_api_key",
num_retries=3,
timeout=120.0
)
```
***
## evaluate()
Evaluate messages against one or more evaluation criteria.
```python theme={null}
result = client.evaluate(
messages=[...],
criteria="Your evaluation criterion",
system=None,
tools=None,
result=None,
block=True,
tags={"env": "prod"}
)
```
### Parameters
List of chat messages to evaluate. Each message should be a dictionary with `role` and `content` keys. Mutually exclusive with `input`.
**Supported roles:** `system`, `user`, `assistant`, `tool`
**Example:**
```python theme={null}
[
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi there!"}
]
```
Raw OpenAI Responses API input — the value passed as `input` to `openai.responses.create()`. Mutually exclusive with `messages`. Use this when evaluating responses from the OpenAI Responses API alongside the `result` parameter.
**Example:**
```python theme={null}
"What are the current prices of Bitcoin and Ethereum in USD?"
```
Evaluation criterion or list of criteria. Can be a custom criterion string or use pre-built criteria from `composo.criteria`.
**Example:**
```python theme={null}
"Reward helpful and accurate responses"
# or
["Criterion 1", "Criterion 2", "Criterion 3"]
```
Optional system message to set AI behavior and context for the evaluation.
Optional list of tool definitions for evaluating tool calls. Each tool should follow the OpenAI function calling format.
Optional LLM result to append to the conversation for evaluation. Accepts a standard dict or an `openai.types.responses.Response` object returned by `openai.responses.create()` — Composo auto-detects the type and adapts it automatically.
If `False`, returns a dictionary with `task_id` instead of blocking for results. Use for async job submission.
Optional key-value pairs to tag and categorize the request. Tags are useful for organizing, filtering, and analyzing evaluations in Metabase or other analytics tools.
**Constraints:**
* Keys must be strings, maximum 64 characters
* Values must be strings, numbers, or bools (converted to strings), maximum 64 characters
* No nested structures (dictionaries, lists, tuples, or sets)
**Example:**
```python theme={null}
tags={
"environment": "production",
"version": "1.0.0",
"experiment": "variant_a"
}
```
Whether to evaluate only the latest assistant response (`True`) or all assistant responses (`False`).
If not provided, defaults to `True` for chat evaluations.
**Note**: Lightning model cores (`align-lightning-*`) only support `True`.
When set to `"end_user"`, the response will include a `cleaned_explanation` field that rewrites the explanation to only reference content visible in user and assistant messages.
### Returns
* Returns single `EvaluationResponse` if one criterion provided
* Returns `list[EvaluationResponse]` if multiple criteria provided
* Returns `dict` with `task_id` if `block=False`
### Response Schema
**EvaluationResponse**
Evaluation score between 0.0 and 1.0. Returns `null` if the criterion was deemed not applicable.
Detailed explanation of the evaluation score and reasoning.
A rewrite of `explanation` that only references content visible in user and assistant messages. Only present when `explanation_cleaning="end_user"` is set in the request.
### Examples
#### Basic Evaluation
```python theme={null}
from composo import Composo
client = Composo()
messages = [
{"role": "user", "content": "What's the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."}
]
result = client.evaluate(
messages=messages,
criteria="Reward accurate and informative responses",
tags={"environment": "production", "version": "1.0.0"}
)
print(f"Score: {result.score}")
# Output: Score: 0.95
print(f"Explanation: {result.explanation}")
# Output: Explanation: The response correctly identifies Paris as the capital of France...
```
#### Multiple Criteria Evaluation
```python theme={null}
results = client.evaluate(
messages=[...],
criteria=[
"Reward accurate information",
"Reward clear communication",
"Penalize overly technical jargon"
]
)
for result in results:
print(f"Score: {result.score} - {result.explanation}")
```
#### Tool Call Evaluation
```python theme={null}
messages = [
{"role": "user", "content": "What's the weather in SF?"},
{
"role": "assistant",
"content": None,
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": '{"location": "San Francisco"}'
}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": '{"temp": 65, "condition": "sunny"}'
},
{"role": "assistant", "content": "It's 65°F and sunny in San Francisco!"}
]
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}
}]
result = client.evaluate(
messages=messages,
tools=tools,
criteria="Reward correct tool usage and accurate responses"
)
```
#### Non-blocking Evaluation
```python theme={null}
# Submit evaluation without waiting
response = client.evaluate(
messages=[...],
criteria="Your criterion",
block=False
)
task_id = response["task_id"]
print(f"Task submitted with ID: {task_id}")
# Use task_id to check status later
```
#### OpenAI Responses API — Built-in Tools
Pass the `Response` object returned by `openai.responses.create()` directly as `result`. Use `input` instead of `messages` to match the Responses API's input format.
**Known limitation:** Composo evaluates only the context provided in the current call — it cannot follow the `previous_response_id` chain to reconstruct prior turns. If your workflow uses multi-turn Responses API conversations (i.e. passing `previous_response_id` to link responses), make sure to pass the full conversation history explicitly via `messages` rather than relying on `input` + `result` alone.
```python theme={null}
from openai import OpenAI
from composo import Composo
openai_client = OpenAI()
composo_client = Composo()
input_text = (
"Search the web for the latest news story on SpaceX and write a Python script "
"to plot their latest 3 launch dates"
)
response = openai_client.responses.create(
model="gpt-4.1",
input=input_text,
tools=[
{"type": "web_search"},
{"type": "code_interpreter", "container": {"type": "auto"}},
],
)
result = composo_client.evaluate(
input=input_text,
result=response,
criteria="Reward responses that search the web and produce Python code based on the results",
)
print(f"Score: {result.score}")
print(f"Explanation: {result.explanation}")
```
#### OpenAI Responses API — Remote MCP Server
```python theme={null}
MCP_TOOL = {
"type": "mcp",
"server_label": "coingecko",
"server_url": "https://mcp.api.coingecko.com/mcp",
"require_approval": "never",
}
input_text = "What are the current prices of Bitcoin and Ethereum in USD?"
response = openai_client.responses.create(
model="gpt-4.1",
input=input_text,
tools=[MCP_TOOL],
)
result = composo_client.evaluate(
input=input_text,
result=response,
criteria="Reward responses that use the MCP tool to retrieve live cryptocurrency prices and present them clearly",
)
print(f"Score: {result.score}")
print(f"Explanation: {result.explanation}")
```
***
## evaluate\_trace()
Evaluate multi-agent traces with full conversation history across multiple agents.
```python theme={null}
result = client.evaluate_trace(
trace=trace_object,
criteria="Your evaluation criterion",
model_core=None,
block=True,
tags={"env": "prod"}
)
```
### Parameters
Multi-agent trace object containing agent interactions, initial input, and final output.
Evaluation criterion or list of criteria for trace evaluation.
Optional model core identifier for trace evaluation.
If `False`, returns a dictionary with `task_id` instead of blocking for results.
Optional key-value pairs to tag and categorize the request. Tags are useful for organizing, filtering, and analyzing trace evaluations in Metabase or other analytics tools.
**Constraints:**
* Keys must be strings, maximum 64 characters
* Values must be strings, numbers, or bools (converted to strings), maximum 64 characters
* No nested structures (dictionaries, lists, tuples, or sets)
**Example:**
```python theme={null}
tags={
"environment": "production",
"agent_version": "2.1.0",
"experiment": "improved_prompts"
}
```
Whether to evaluate only the latest response (`True`) or all responses (`False`).
If not provided, defaults to `False` for trace evaluations.
**Note**: Must be `False` for trace evaluations.
### Returns
* Returns single `MultiAgentTraceResponse` if one criterion provided
* Returns `list[MultiAgentTraceResponse]` if multiple criteria provided
* Returns `dict` with `task_id` if `block=False`
### Response Schema
**MultiAgentTraceResponse**
Per-agent evaluation scores mapping agent IDs to their individual scores.
Overall trace score aggregated across all agents.
Detailed explanation of the trace evaluation.
The criterion that was evaluated.
### Example
```python theme={null}
from composo import Composo, ComposoTracer, Instruments, AgentTracer
from openai import OpenAI
# Initialize tracing
ComposoTracer.init(instruments=Instruments.OPENAI)
openai_client = OpenAI()
composo_client = Composo()
# Use AgentTracer context manager to capture trace
with AgentTracer(name="research_agent") as tracer:
response = openai_client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Research: quantum computing"}]
)
result = response.choices[0].message.content
# Get the trace object
trace = tracer.trace
# Evaluate the captured trace
evaluation = composo_client.evaluate_trace(
trace=trace,
criteria="Reward thorough research and accurate information",
tags={"environment": "production", "agent_version": "1.0.0"}
)
print(f"Overall Score: {evaluation.overall_score}")
print(f"Explanation: {evaluation.explanation}")
```
***
## Context Manager Usage
The `Composo` client supports context managers for automatic resource cleanup:
```python theme={null}
with Composo() as client:
result = client.evaluate(
messages=[...],
criteria="Your criterion"
)
print(result.score)
# Client automatically closed
```
# Tracing
Source: https://docs.composo.ai/python-sdk-reference/tracing
Track LLM interactions and multi-agent conversations
## Overview
Composo's tracing module provides automatic instrumentation for LLM calls and manual tracking for multi-agent systems. Capture detailed interaction data to evaluate agent performance and debug complex workflows.
***
## ComposoTracer
Initialize automatic instrumentation for LLM provider APIs.
### init()
Configure tracing for one or more LLM providers.
```python theme={null}
from composo import ComposoTracer, Instruments
ComposoTracer.init(instruments=Instruments.OPENAI)
```
#### Parameters
Single instrument or list of instruments to enable tracing for.
**Available Instruments:**
* `Instruments.OPENAI`: Trace OpenAI API calls
* `Instruments.ANTHROPIC`: Trace Anthropic API calls
* `Instruments.GOOGLE_GENAI`: Trace Google Gemini API calls
If `None`, initializes tracing without provider-specific instrumentation.
#### Examples
**Single Provider**
```python theme={null}
from composo import ComposoTracer, Instruments
from openai import OpenAI
# Initialize tracing for OpenAI
ComposoTracer.init(instruments=Instruments.OPENAI)
# All OpenAI calls are now automatically traced
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
```
**Multiple Providers**
```python theme={null}
from composo import ComposoTracer, Instruments
from openai import OpenAI
from anthropic import Anthropic
# Initialize tracing for multiple providers
ComposoTracer.init(instruments=[
Instruments.OPENAI,
Instruments.ANTHROPIC,
Instruments.GOOGLE_GENAI
])
# All providers are now traced
openai_client = OpenAI()
anthropic_client = Anthropic()
```
***
## AgentTracer
Context manager for tracking agent interactions and organizing traces by agent.
### Constructor
```python theme={null}
from composo import AgentTracer
with AgentTracer(name="my_agent", agent_id="agent-123") as tracer:
# Agent code here
pass
```
#### Parameters
Human-readable agent name. If not provided, generates a name like `agent_abc123`.
Unique identifier for the agent. If not provided, generates a UUID.
### Usage as Context Manager
```python theme={null}
from composo import AgentTracer, ComposoTracer, Instruments
from openai import OpenAI
# Initialize tracing
ComposoTracer.init(instruments=Instruments.OPENAI)
client = OpenAI()
# Track agent interactions
with AgentTracer(name="research_agent") as tracer:
# All LLM calls within this context are associated with this agent
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Research quantum computing"}]
)
# Agent ID is available
print(f"Agent ID: {tracer.agent_id}")
```
### Nested Agents
Track hierarchical agent systems with parent-child relationships:
```python theme={null}
from composo import AgentTracer
from openai import OpenAI
client = OpenAI()
with AgentTracer(name="orchestrator") as orchestrator:
# Parent agent
with AgentTracer(name="researcher") as researcher:
# Child agent
research = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Research topic"}]
)
with AgentTracer(name="summarizer") as summarizer:
# Another child agent
summary = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Summarize findings"}]
)
# Trace captures parent-child relationships
```
***
## @agent\_tracer Decorator
Decorator for automatically tracing agent functions.
```python theme={null}
from composo import agent_tracer
@agent_tracer(name="my_agent")
def my_agent_function(input_data):
# Function implementation
return result
```
### Parameters
Agent name. If not provided, uses the function name.
### Examples
**Basic Usage**
```python theme={null}
from composo import agent_tracer, ComposoTracer, Instruments
from openai import OpenAI
ComposoTracer.init(instruments=Instruments.OPENAI)
client = OpenAI()
@agent_tracer(name="helper_agent")
def process_query(query):
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": query}]
)
return response.choices[0].message.content
# Automatically traced
result = process_query("What is Python?")
```
**Multi-Agent Workflow**
```python theme={null}
from composo import agent_tracer, ComposoTracer, Instruments
from openai import OpenAI
ComposoTracer.init(instruments=Instruments.OPENAI)
client = OpenAI()
@agent_tracer(name="analyzer")
def analyze_data(data):
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": f"Analyze: {data}"}]
)
return response.choices[0].message.content
@agent_tracer(name="validator")
def validate_analysis(analysis):
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": f"Validate: {analysis}"}]
)
return response.choices[0].message.content
@agent_tracer(name="orchestrator")
def process_workflow(data):
# Nested agent calls are automatically tracked
analysis = analyze_data(data)
validation = validate_analysis(analysis)
return validation
# Entire workflow traced with agent hierarchy
result = process_workflow("my data")
```
**Async Functions**
```python theme={null}
import asyncio
from composo import agent_tracer, ComposoTracer, Instruments
from openai import AsyncOpenAI
ComposoTracer.init(instruments=Instruments.OPENAI)
client = AsyncOpenAI()
@agent_tracer(name="async_agent")
async def async_process(query):
response = await client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": query}]
)
return response.choices[0].message.content
# Async agent automatically traced
result = asyncio.run(async_process("What is async?"))
```
***
## Complete Example: Multi-Agent System
```python theme={null}
from composo import (
Composo,
ComposoTracer,
Instruments,
agent_tracer
)
from openai import OpenAI
# Step 1: Initialize tracing
ComposoTracer.init(instruments=Instruments.OPENAI)
# Step 2: Create clients
openai_client = OpenAI()
composo_client = Composo()
# Step 3: Define agents
@agent_tracer(name="research_agent")
def research_agent(topic):
"""Research a given topic"""
response = openai_client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a research assistant."},
{"role": "user", "content": f"Research: {topic}"}
]
)
return response.choices[0].message.content
@agent_tracer(name="fact_checker")
def fact_check_agent(content):
"""Verify facts in content"""
response = openai_client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a fact checker."},
{"role": "user", "content": f"Verify these facts: {content}"}
]
)
return response.choices[0].message.content
@agent_tracer(name="summarizer")
def summarize_agent(content):
"""Summarize content"""
response = openai_client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a summarizer."},
{"role": "user", "content": f"Summarize: {content}"}
]
)
return response.choices[0].message.content
@agent_tracer(name="orchestrator")
def orchestrator(topic):
"""Orchestrate the multi-agent workflow"""
# Step 1: Research
research = research_agent(topic)
# Step 2: Fact check
verified = fact_check_agent(research)
# Step 3: Summarize
summary = summarize_agent(verified)
return summary
# Step 4: Run the workflow
result = orchestrator("Climate change impacts")
# Step 5: Evaluate the trace
# (Note: Trace evaluation requires exporting the trace data,
# which depends on your OpenTelemetry backend configuration)
print(f"Final result: {result}")
```
***
## Instruments Enum
Available instrumentation providers:
Automatically trace OpenAI API calls (chat, completions, embeddings, etc.)
Automatically trace Anthropic API calls (Claude models)
Automatically trace Google Gemini API calls
***