Skip to main content
See the full LLMs.txt here. Composo delivers deterministic, accurate evaluation for LLM applications through purpose-built generative reward models. Unlike unreliable LLM-as-judge approaches, our specialized models provide consistent, precise scores you can trust—with just a single sentence criteria.

Why Composo?

Engineering & product teams building enterprise AI applications tell us they need to:
“Test & iterate faster during development”
“Rapidly find and fix edge cases in production”
“Have 100% confidence in quality when we ship”
Manual evals don’t scale. LLM-as-judge is unreliable with 30%+ variance. Composo’s purpose-built evaluation models deliver:
  • 92% accuracy vs 72% for LLM-as-judge
  • Deterministic scoring - same input always produces same output
  • 70% reduction in error rate over alternatives
  • Simple integration - just write a single sentence to create any custom criteria

Evaluation Frameworks (start here)

Composo provides industry-leading frameworks to get you started immediately: 🤖 Agent Framework Our comprehensive agent evaluation framework covers planning, tool use, and goal achievement. Built for real-time tracing of multi-agent systems - capture and evaluate agent interactions as they happen with our SDK. Learn more → For agent applications, we recommend using our tracing SDK to instrument your code and evaluate agents in real-time. 📚 RAG Framework Battle-tested metrics for retrieval-augmented generation including faithfulness, completeness, and precision. Learn more → 🎯 Criteria Library The real power of Composo is writing your own custom criteria in plain English. Write your own, or browse our extensive library of pre-built criteria for common evaluation scenarios. View library →

What Are Evaluation Criteria?

Evaluation criteria are simple, single-sentence instructions that tell Composo exactly what to evaluate in your LLM outputs.

Three Types of Evaluation

  1. Response Evaluation - Evaluates the latest assistant response
  2. Tool Call Evaluation - Evaluates the latest tool call and its parameters
  3. Agent Evaluation - Evaluates the full end-to-end agent trace

Two Scoring Methods

Each evaluation type supports two scoring methods:
  1. Reward Score Evaluation: For continuous scoring (recommended for most use cases).
  2. Binary Evaluation: Use for simple pass/fail assessments against specific rules or policies. Perfect for content moderation and clear-cut criteria.
You specify what to evaluate through your criteria. So for reward score evaluation:
  • "Reward responses that..." - Positive response evaluation
  • "Penalize responses that..." - Negative response evaluation
  • "Reward tool calls that..." - Positive tool call evaluation
  • "Penalize tool calls that..." - Negative tool call evaluation
  • "Reward agents that..." - Positive agent evaluation
  • "Penalize agents that..." - Negative agent evaluation
And for binary evaluation:
  • "Response passes if..." / "Response fails if..." - Response evaluation
  • "Tool call passes if..." / "Tool call fails if..." - Tool call evaluation
  • "Agent passes if..." / "Agent fails if..." - Agent evaluation

For Example:

  • Input: A customer service conversation
  • Criteria: "Reward responses that express appropriate empathy when the user is frustrated"
  • Result: Composo analyzes the response and returns a score from 0-1 based on how well it meets this criteria
This single sentence is all you need - no complex rubrics, no prompt engineering, no unreliable LLM judges. Just describe what good (or bad) looks like, and Composo handles the rest.

Composo’s models

Composo offers two purpose-built evaluation models to match your needs:

Composo Lightning

Fast evaluation for rapid iteration
  • 3 second median response time
  • Optimized for development workflows and real-time feedback
  • Ideal for quick iteration during development and testing
  • Works with LLM outputs & retrieval, not tool calling or agentic examples

Composo Align

Expert-level evaluation for production confidence
  • 5-15 second response time
  • Achieves 92% accuracy on real-world evaluation tasks (vs ~70% for LLM-as-judge)
  • 70% reduction in error rate compared to alternatives
  • Our flagship model for when accuracy matters most
Both models use our generative reward model architecture that combines:
  • A custom-trained reasoning model that analyzes inputs against criteria
  • A specialized scoring model that produces calibrated, deterministic scores
This dual-model approach lets you choose between speed and power: use Lightning for rapid development cycles, and Align for production deployments where maximum accuracy is critical.

Key Differences from LLM-as-Judge

  1. Deterministic: Same inputs always produce identical scores
  2. Calibrated: Scores meaningfully distributed across 0-1 range
  3. Consistent: Robust to minor wording changes in criteria
  4. Accurate: Trained specifically for evaluation, not general text generation

Get Started with Composo

Ready to see how Composo compares to your current evaluation approach? Get started in 15 minutes with 500 free credits
  • Sign up at platform.composo.ai
  • Install the SDK: pip install composo
  • Start getting eval results in <15 minutes
We love to work closely and give 1-1 support, if you’d like to chat then feel free to book in here. For any questions, you can also reach us at [email protected].
I