Skip to main content

Should I include system messages when evaluating with Composo?

  • Including system messages is optional but recommended, as they provide useful context that can improve evaluation accuracy.

What’s the context limit?

  • 200k tokens

What’s the expected response time?

  • Composo Align: 5-15 seconds per API call
  • Composo Lightning: 3 seconds per API call

Can I run parallel requests?

  • Yes, we recommend limiting to 5 parallel API calls for optimal performance

What are the rate limits?

  • Free plan: 500 requests per hour
  • Paid plans: Higher limits based on your specific plan

What languages are supported?

  • Our evaluation models support all major languages plus code. A good rule of thumb is that if you don’t need a specialized model to deal with your language, we can handle it.

What’s the difference between reward and binary evaluation?

  • Reward evaluation: Returns a continuous score from 0-1 measuring how well the output meets your criteria
  • Binary evaluation: Returns a simple pass/fail result for clear-cut criteria or policy compliance

Can I evaluate tool calls and agents, not just responses?

  • Yes! Composo evaluates three types of outputs:
    • Responses: The assistant’s latest response
    • Tool calls: Individual tool call parameters and selection
    • Agents: Complete end-to-end agent traces

How deterministic are the evaluation scores?

  • Composo achieves <1% variance in scoring, meaning the same input will produce virtually identical scores every time. This compares to 30%+ variance typical with LLM-as-judge approaches. We also cache results for benchmark evaluations to ensure perfect repeatability across runs.

What do you mean by a generative reward model architecture?

  • It’s a dual-model system: one model generates detailed reasoning about why an output meets your criteria, while another specialized scoring model (trained on preference data) produces the actual score. This separation ensures both interpretable explanations and consistent, meaningful scores.

How complex is the integration?

  • No - integration takes just 3 lines of code. You send your conversation and a simple evaluation criterion like “reward responses that are accurate.” All the complexity happens behind the scenes. It’s a drop-in replacement for anywhere you currently use LLM-as-judge.

What makes Composo more accurate than LLM-as-judge??

  • We use purpose-built reward models trained on tens of thousands of human preference comparisons across real-world domains. Instead of asking an LLM to generate arbitrary scores, our models learn quality distributions through pairwise comparisons (similar to ELO rankings). This creates meaningful, consistent scoring that’s grounded in actual human judgments.

How do you achieve such consistent scoring?

  • We use a multi-layered approach including ensemble techniques and statistical aggregation. Multiple specialized models analyze each evaluation, and we aggregate their outputs to eliminate random variance. This is fundamentally different from single-model LLM approaches that produce different scores each time.
I