Should I include system messages when evaluating with Composo?

  • Including system messages is optional but recommended, as they provide useful context that can improve evaluation accuracy.

What’s the context limit?

  • 200k tokens

What’s the expected response time?

  • Composo Align (flagship model): 5-15 seconds per API call
  • Composo Lightning: 3 seconds per API call

Can I run parallel requests?

  • Yes, we recommend limiting to 5 parallel API calls for optimal performance

What are the rate limits?

  • Free plan: 500 requests per hour
  • Paid plans: Higher limits based on your specific plan

What languages are supported?

  • Our evaluation models support all major languages plus code. A good rule of thumb is that if you don’t need a specialized model to deal with your language, we can handle it.

What’s the difference between reward and binary evaluation?

  • Reward evaluation: Returns a continuous score from 0-1 measuring how well the output meets your criteria
  • Binary evaluation: Returns a simple pass/fail result for clear-cut criteria or policy compliance

Can I evaluate tool calls and agents, not just responses?

  • Yes! Composo evaluates three types of outputs:
    • Responses: The assistant’s latest response
    • Tool calls: Individual tool call parameters and selection
    • Agents: Complete end-to-end agent traces

How deterministic are the evaluation scores?

  • Composo provides <1% variance in scores - the same input will always produce the same output, unlike LLM-as-judge approaches which have >30% variance.