See the full LLMs.txt here. Composo delivers deterministic, accurate evaluation for LLM applications through purpose-built generative reward models. Unlike unreliable LLM-as-judge approaches, our specialized models provide consistent, precise scores you can trust—with just a single sentence criteria.

Why Composo?

Engineering & product teams building enterprise AI applications tell us they need to:
“Test & iterate faster during development”
“Rapidly find and fix edge cases in production”
“Have 100% confidence in quality when we ship”
Manual evals don’t scale. LLM-as-judge is unreliable with 30%+ variance. Composo’s purpose-built evaluation models deliver:
  • 92% accuracy vs 72% for LLM-as-judge
  • Deterministic scoring - same input always produces same output
  • 70% reduction in error rate over alternatives
  • Simple integration - just write a single sentence to create any custom criteria

Evaluation Frameworks (start here)

Composo provides industry-leading frameworks to get you started immediately: 🤖 Agent Framework Our comprehensive agent evaluation framework covers planning, tool use, and goal achievement. Learn more → 📚 RAG Framework Battle-tested metrics for retrieval-augmented generation including faithfulness, completeness, and precision. Learn more → 🎯 Criteria Library The real power of Composo is writing your own custom criteria in plain English - and most teams do exactly this for their specific use cases. Browse our extensive library of pre-built criteria for common evaluation scenarios to help inspire you here. View library →

What Are Evaluation Criteria?

Evaluation criteria are simple, single-sentence instructions that tell Composo exactly what to evaluate in your LLM outputs.

Three Types of Evaluation

  1. Response Evaluation - Evaluates the latest assistant response
  2. Tool Call Evaluation - Evaluates the latest tool call and its parameters
  3. Agent Evaluation - Evaluates the full end-to-end agent trace

Two Scoring Methods

Each evaluation type supports two scoring methods:
  1. Reward Score Evaluation: For continuous scoring (recommended for most use cases).
  2. Binary Evaluation: Use for simple pass/fail assessments against specific rules or policies. Perfect for content moderation and clear-cut criteria.
You specify what to evaluate through your criteria. So for reward score evaluation:
  • "Reward responses that..." - Positive response evaluation
  • "Penalize responses that..." - Negative response evaluation
  • "Reward tool calls that..." - Positive tool call evaluation
  • "Penalize tool calls that..." - Negative tool call evaluation
  • "Reward agents that..." - Positive agent evaluation
  • "Penalize agents that..." - Negative agent evaluation
And for binary evaluation:
  • "Response passes if..." / "Response fails if..." - Response evaluation
  • "Tool call passes if..." / "Tool call fails if..." - Tool call evaluation
  • "Agent passes if..." / "Agent fails if..." - Agent evaluation

For Example:

  • Input: A customer service conversation
  • Criteria: "Reward responses that express appropriate empathy when the user is frustrated"
  • Result: Composo analyzes the response and returns a score from 0-1 based on how well it meets this criteria
This single sentence is all you need - no complex rubrics, no prompt engineering, no unreliable LLM judges. Just describe what good (or bad) looks like, and Composo handles the rest.

Composo’s models

Composo offers two purpose-built evaluation models to match your needs:

Composo Lightning

Fast evaluation for rapid iteration
  • 3 second median response time
  • Optimized for development workflows and real-time feedback
  • Ideal for quick iteration during development and testing
  • Works with LLM outputs & retrieval, not tool calling or agentic examples

Composo Align

Expert-level evaluation for production confidence
  • 5-15 second response time
  • Achieves 92% accuracy on real-world evaluation tasks (vs ~70% for LLM-as-judge)
  • 70% reduction in error rate compared to alternatives
  • Our flagship model for when accuracy matters most
Both models use our generative reward model architecture that combines:
  • A custom-trained reasoning model that analyzes inputs against criteria
  • A specialized scoring model that produces calibrated, deterministic scores
This dual-model approach lets you choose between speed and power: use Lightning for rapid development cycles, and Align for production deployments where maximum accuracy is critical.

Key Differences from LLM-as-Judge

  1. Deterministic: Same inputs always produce identical scores
  2. Calibrated: Scores meaningfully distributed across 0-1 range
  3. Consistent: Robust to minor wording changes in criteria
  4. Accurate: Trained specifically for evaluation, not general text generation

Message Format

Both endpoints accept the same message format:
{
    "messages": [
        {"role": "system", "content": "..."},
        {"role": "user", "content": "..."},
        {"role": "assistant", "content": "..."},
        {"role": "tool", "tool_call_id": "...", "content": "..."},
        {"role": "assistant", "content": "..."}
    ],
    "evaluation_criteria": "Reward responses that...",
    "tools": [...]  // Optional, for tool call evaluation
}

Get Started with Composo

Ready to see how Composo compares to your current evaluation approach? Get started in 15 minutes with 500 free credits
  • Sign up at platform.composo.ai
  • Install the SDK: pip install composo
  • Start getting eval results in <15 minutes
We love to work closely and give 1-1 support, if you’d like to chat then feel free to book in here. For any questions, you can also reach us at [email protected].