Skip to main content

Should I include system messages when evaluating with Composo?

  • Including system messages is optional but recommended, as they provide useful context that can improve evaluation accuracy.

What’s the context limit?

  • This is model dependant, see the context windows by model here.

What’s the expected response time?

  • This is model dependant, see the latency by model here.

What are the rate limits?

  • Free plan: 500 requests per hour
  • Paid plans: Higher limits based on your specific requirements

What languages are supported?

  • Our evaluation models support all major languages plus code. A good rule of thumb is that if you don’t need a specialized model to deal with your language, we can handle it.

Can I evaluate tool calls, not just responses?

  • Yes! Composo evaluates all agent behaviour, including tool calls.

How deterministic are the evaluation scores?

  • Composo achieves <1% variance in scoring, meaning the same input will produce virtually identical scores every time. This compares to 30%+ variance typical with LLM-as-judge approaches. We also cache results for benchmark evaluations to ensure perfect repeatability across runs.

What do you mean by a generative reward model architecture?

  • It’s a dual-model system: one model generates detailed reasoning about why an output meets your criteria, while another specialized scoring model (trained on preference data) produces the actual score. This separation ensures both interpretable explanations and consistent, meaningful scores.

How complex is the integration?

  • Integration takes just 3 lines of code. You send your conversation and a simple evaluation criterion like “reward responses that are accurate.” All the complexity happens behind the scenes. It’s a drop-in replacement for anywhere you currently use LLM-as-judge.

What makes Composo more accurate than LLM-as-judge??

  • We use purpose-built reward models trained on tens of thousands of human preference comparisons across real-world domains. Instead of asking an LLM to generate arbitrary scores, our models learn quality distributions through pairwise comparisons (similar to ELO rankings). This creates meaningful, consistent scoring that’s grounded in actual human judgments.

How do you achieve such consistent scoring?

  • We use a multi-layered approach including ensemble techniques and statistical aggregation. Multiple specialized models analyze each evaluation, and we aggregate their outputs to eliminate random variance. This is fundamentally different from single-model LLM approaches that produce different scores each time.