Why Composo?
Engineering & product teams building enterprise AI applications tell us they need to:“Test & iterate faster during development”
“Rapidly find and fix edge cases in production”
“Have 100% confidence in quality when we ship”Manual evals don’t scale. LLM-as-judge is unreliable with 30%+ variance. Composo’s purpose-built evaluation models deliver:
- 92% accuracy vs 72% for LLM-as-judge
- Deterministic scoring - same input always produces same output
- 70% reduction in error rate over alternatives
- Simple integration - just write a single sentence to create any custom criteria
Evaluation Frameworks (start here)
Composo provides industry-leading frameworks to get you started immediately: 🤖 Agent Framework Our comprehensive agent evaluation framework covers planning, tool use, and goal achievement. Learn more → 📚 RAG Framework Battle-tested metrics for retrieval-augmented generation including faithfulness, completeness, and precision. Learn more → 🎯 Criteria Library The real power of Composo is writing your own custom criteria in plain English - and most teams do exactly this for their specific use cases. Browse our extensive library of pre-built criteria for common evaluation scenarios to help inspire you here. View library →What Are Evaluation Criteria?
Evaluation criteria are simple, single-sentence instructions that tell Composo exactly what to evaluate in your LLM outputs.Three Types of Evaluation
- Response Evaluation - Evaluates the latest assistant response
- Tool Call Evaluation - Evaluates the latest tool call and its parameters
- Agent Evaluation - Evaluates the full end-to-end agent trace
Two Scoring Methods
Each evaluation type supports two scoring methods:- Reward Score Evaluation: For continuous scoring (recommended for most use cases).
- Binary Evaluation: Use for simple pass/fail assessments against specific rules or policies. Perfect for content moderation and clear-cut criteria.
"Reward responses that..."
- Positive response evaluation"Penalize responses that..."
- Negative response evaluation"Reward tool calls that..."
- Positive tool call evaluation"Penalize tool calls that..."
- Negative tool call evaluation"Reward agents that..."
- Positive agent evaluation"Penalize agents that..."
- Negative agent evaluation
"Response passes if..."
/"Response fails if..."
- Response evaluation"Tool call passes if..."
/"Tool call fails if..."
- Tool call evaluation"Agent passes if..."
/"Agent fails if..."
- Agent evaluation
For Example:
- Input: A customer service conversation
- Criteria:
"Reward responses that express appropriate empathy when the user is frustrated"
- Result: Composo analyzes the response and returns a score from 0-1 based on how well it meets this criteria
Composo’s models
Composo offers two purpose-built evaluation models to match your needs:Composo Lightning
Fast evaluation for rapid iteration- 3 second median response time
- Optimized for development workflows and real-time feedback
- Ideal for quick iteration during development and testing
- Works with LLM outputs & retrieval, not tool calling or agentic examples
Composo Align
Expert-level evaluation for production confidence- 5-15 second response time
- Achieves 92% accuracy on real-world evaluation tasks (vs ~70% for LLM-as-judge)
- 70% reduction in error rate compared to alternatives
- Our flagship model for when accuracy matters most
- A custom-trained reasoning model that analyzes inputs against criteria
- A specialized scoring model that produces calibrated, deterministic scores
Key Differences from LLM-as-Judge
- Deterministic: Same inputs always produce identical scores
- Calibrated: Scores meaningfully distributed across 0-1 range
- Consistent: Robust to minor wording changes in criteria
- Accurate: Trained specifically for evaluation, not general text generation
Message Format
Both endpoints accept the same message format:Get Started with Composo
Ready to see how Composo compares to your current evaluation approach? Get started in 15 minutes with 500 free credits- Sign up at platform.composo.ai
- Install the SDK:
pip install composo
- Start getting eval results in <15 minutes