Composo delivers deterministic, accurate evaluation for LLM applications through purpose-built generative reward models. Unlike unreliable LLM-as-judge approaches, our specialized models provide consistent, precise scores you can trust—with just a single sentence criteria.

Why Composo?

Composo is designed to empower engineering & product teams to ship 10x faster, with complete confidence. What we hear from our customers that they’re looking to achieve is:

“Know exactly where our AI fails or hallucinates, and how to fix it”

“Instantly identify how each prompt and model change impacts quality”

“Prove reliability with quantitative metrics our customers trust”

It’s not possible to do this with manual evals which don’t scale, nor LLM based evals which are extremely unreliable. What Composo’s purpose built models provide is evals that are accurate, precise & deterministic. You simply plug in to any LLM generation, retrieval or tool call, and write a single sentence to create any custom criteria.

Check out our blog for more on the theory behind it, our detailed validation results & in depth guides for how to evaluate a range of use cases, based on customers we’ve worked with.

What are Evaluation Criteria?

Evaluation criteria are simple, single-sentence instructions that tell Composo exactly what to evaluate in your LLM outputs. You write them in plain English, starting with either “Reward responses that…” or “Penalize responses that…”.

Example:

  • Input: A customer service conversation
  • Criteria: "Reward responses that express appropriate empathy when the user is frustrated"
  • Result: Composo analyzes the response and returns a score from 0-1 based on how well it meets this criteria

This single sentence is all you need - no complex rubrics, no prompt engineering, no unreliable LLM judges. Just describe what good (or bad) looks like, and Composo handles the rest.

Technical Architecture

Composo uses a generative reward model architecture that combines:

  • A custom-trained reasoning model that analyzes inputs against criteria
  • A specialized scoring model that produces calibrated, deterministic scores

This approach achieves 89% accuracy on real-world evaluation tasks compared to 72% for LLM-as-judge methods, with a 60% reduction in error rate.

Key Differences from LLM-as-Judge

  1. Deterministic: Same inputs always produce identical scores
  2. Calibrated: Scores meaningfully distributed across 0-1 range
  3. Consistent: Robust to minor wording changes in criteria
  4. Accurate: Trained specifically for evaluation, not general text generation

Core Capabilities

Composo offers three types of evaluation metrics:

  1. Response Evaluation - Assess the quality of LLM outputs
  2. Tool Call Evaluation - Measure how well LLMs use functions and tools

Each evaluation type supports two scoring methods:

  • Reward: Continuous scoring from 0 to 1
  • Binary: Pass/fail assessment

How It Works

The Evals API offers two primary evaluation mechanisms:

  1. Reward Score Evaluation: For continuous scoring (recommended for most use cases). Learn more »
  2. Binary Evaluation: Use for simple pass/fail assessments against specific rules or policies. Perfect for content moderation and clear-cut criteria. Learn more »

You specify what to evaluate through your criteria.

So for the reward endpoint:

  • "Reward responses that..." - Positive response evaluation
  • "Penalize responses that..." - Negative response evaluation

And for the binary endpoint:

  • "Response passes if..." - Positive response evaluation
  • "Response fails if..." - Negative response evaluation

Message Format

Both endpoints accept the same message format:

{
    "messages": [
        {"role": "system", "content": "..."},
        {"role": "user", "content": "..."},
        {"role": "assistant", "content": "..."},
        {"role": "function", "name": "...", "content": "..."},
        {"role": "assistant", "content": "..."}
    ],
    "evaluation_criteria": "Reward responses that..."
}

FAQs

Should I include system message when evaluating with Composo?

  • Optional whether you include in an evaluation API call, but can be helpful to include them as they provide useful context, so we do recommend trying with this included.

What’s the context limit?

  • 120k tokens

What’s an expected response time?

  • 5-15s per API call

Can I run parallel requests?

  • Yes, but recommend limiting to 5 parallel API calls

What languages are supported?

  • Our evaluation models support all major languages plus code. A good rule of thumb is that if you don’t need a specialised model to deal with your language then we can handle it

Get Started with Composo

Ready to see how Composo compares to your current evaluation approach? Book in for a quick demo here.

For any questions or to learn more about how Composo can support your needs, reach out to us at [email protected].

Composo delivers deterministic, accurate evaluation for LLM applications through purpose-built generative reward models. Unlike unreliable LLM-as-judge approaches, our specialized models provide consistent, precise scores you can trust—with just a single sentence criteria.

Why Composo?

Composo is designed to empower engineering & product teams to ship 10x faster, with complete confidence. What we hear from our customers that they’re looking to achieve is:

“Know exactly where our AI fails or hallucinates, and how to fix it”

“Instantly identify how each prompt and model change impacts quality”

“Prove reliability with quantitative metrics our customers trust”

It’s not possible to do this with manual evals which don’t scale, nor LLM based evals which are extremely unreliable. What Composo’s purpose built models provide is evals that are accurate, precise & deterministic. You simply plug in to any LLM generation, retrieval or tool call, and write a single sentence to create any custom criteria.

Check out our blog for more on the theory behind it, our detailed validation results & in depth guides for how to evaluate a range of use cases, based on customers we’ve worked with.

What are Evaluation Criteria?

Evaluation criteria are simple, single-sentence instructions that tell Composo exactly what to evaluate in your LLM outputs. You write them in plain English, starting with either “Reward responses that…” or “Penalize responses that…”.

Example:

  • Input: A customer service conversation
  • Criteria: "Reward responses that express appropriate empathy when the user is frustrated"
  • Result: Composo analyzes the response and returns a score from 0-1 based on how well it meets this criteria

This single sentence is all you need - no complex rubrics, no prompt engineering, no unreliable LLM judges. Just describe what good (or bad) looks like, and Composo handles the rest.

Technical Architecture

Composo uses a generative reward model architecture that combines:

  • A custom-trained reasoning model that analyzes inputs against criteria
  • A specialized scoring model that produces calibrated, deterministic scores

This approach achieves 89% accuracy on real-world evaluation tasks compared to 72% for LLM-as-judge methods, with a 60% reduction in error rate.

Key Differences from LLM-as-Judge

  1. Deterministic: Same inputs always produce identical scores
  2. Calibrated: Scores meaningfully distributed across 0-1 range
  3. Consistent: Robust to minor wording changes in criteria
  4. Accurate: Trained specifically for evaluation, not general text generation

Core Capabilities

Composo offers three types of evaluation metrics:

  1. Response Evaluation - Assess the quality of LLM outputs
  2. Tool Call Evaluation - Measure how well LLMs use functions and tools

Each evaluation type supports two scoring methods:

  • Reward: Continuous scoring from 0 to 1
  • Binary: Pass/fail assessment

How It Works

The Evals API offers two primary evaluation mechanisms:

  1. Reward Score Evaluation: For continuous scoring (recommended for most use cases). Learn more »
  2. Binary Evaluation: Use for simple pass/fail assessments against specific rules or policies. Perfect for content moderation and clear-cut criteria. Learn more »

You specify what to evaluate through your criteria.

So for the reward endpoint:

  • "Reward responses that..." - Positive response evaluation
  • "Penalize responses that..." - Negative response evaluation

And for the binary endpoint:

  • "Response passes if..." - Positive response evaluation
  • "Response fails if..." - Negative response evaluation

Message Format

Both endpoints accept the same message format:

{
    "messages": [
        {"role": "system", "content": "..."},
        {"role": "user", "content": "..."},
        {"role": "assistant", "content": "..."},
        {"role": "function", "name": "...", "content": "..."},
        {"role": "assistant", "content": "..."}
    ],
    "evaluation_criteria": "Reward responses that..."
}

FAQs

Should I include system message when evaluating with Composo?

  • Optional whether you include in an evaluation API call, but can be helpful to include them as they provide useful context, so we do recommend trying with this included.

What’s the context limit?

  • 120k tokens

What’s an expected response time?

  • 5-15s per API call

Can I run parallel requests?

  • Yes, but recommend limiting to 5 parallel API calls

What languages are supported?

  • Our evaluation models support all major languages plus code. A good rule of thumb is that if you don’t need a specialised model to deal with your language then we can handle it

Get Started with Composo

Ready to see how Composo compares to your current evaluation approach? Book in for a quick demo here.

For any questions or to learn more about how Composo can support your needs, reach out to us at [email protected].