Composo delivers deterministic, accurate evaluation for LLM applications through purpose-built generative reward models. Unlike unreliable LLM-as-judge approaches, our specialized models provide consistent, precise scores you can trust—with just a single sentence criteria.

Quickstart

Get up and running with Composo in under 5 minutes. This guide will help you evaluate your first LLM response and understand how Composo delivers deterministic, accurate evaluations.

Step 1: Create Your Account

Step 2: Generate Your API Key

Navigate to Profile → API Keys in the dashboard
Click Generate New API Key

Step 3: Run Your First Evaluation

[Optional] Install the SDK:

pip install composo

Now let’s evaluate a customer service response for empathy and helpfulness using the Composo SDK:

from composo import Composo

# Initialize the client with your API key
composo_client = Composo(api_key="YOUR_API_KEY")

# Example: Evaluating a customer service response
result = composo_client.evaluate(
    messages=[
        {"role": "user", "content": "I'm really frustrated with my device not working."},
        {"role": "assistant", "content": "I'm sorry to hear that you're experiencing issues with your device. Let's see how I can assist you to resolve this problem."}
    ],
    criteria="Reward responses that express appropriate empathy if the user is facing a problem they're finding frustrating"
)

# Display results
print(f"Score: {result.score}")
print(f"Analysis: {result.explanation}")

curl -X POST "https://platform.composo.ai/api/v1/evals/reward" \
  -H "API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "I'\''m really frustrated with my device not working."
      },
      {
        "role": "assistant",
        "content": "I'\''m sorry to hear that you'\''re experiencing issues with your device. Let'\''s see how I can assist you to resolve this problem."
      }
    ],
    "evaluation_criteria": "Reward responses that express appropriate empathy if the user is facing a problem they'\''re finding frustrating"
  }'

Understanding the Results

Composo returns:

Score: A value between 0 and 1 (e.g. 0.86 means the response strongly meets your criteria)
Explanation: Detailed analysis of why the response received this score

Example output:

JSON

Score: 0.86/1.0
Analysis: - The assistant directly acknowledges the user's difficulty and expresses sympathy ("I'm sorry to hear that you're experiencing issues"), showing clear empathy.
- The response is timely and supportive, immediately addressing the expressed frustration and not ignoring the emotional content.
- It constructively adds a collaborative next step ("Let's see how I can assist you"), enhancing the empathetic tone, with only minor room for deeper emotional mirroring.

Step 4: Evaluate Agents with Tracing

For agent applications, Composo provides real-time tracing to capture and evaluate multi-agent interactions. Here’s a simple example with an orchestrator coordinating two sub-agents:

Python

from composo import Composo
from composo.models import criteria
from composo.tracing import ComposoTracer, Instruments, AgentTracer, agent_tracer
from openai import OpenAI

# Initialize tracing for OpenAI
ComposoTracer.init(instruments=[Instruments.OPENAI])
composo_client = Composo(api_key="YOUR_API_KEY")
openai_client = OpenAI()

# Define a simple sub-agent
@agent_tracer(name="research_agent")
def research_agent(topic):
    return openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Research: {topic}"}],
        max_tokens=50
    )

# Orchestrator coordinates multiple agents
with AgentTracer("orchestrator") as tracer:
    # First sub-agent: planning
    with AgentTracer("planning_agent"):
        plan = openai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": "Plan a trip to Paris"}],
            max_tokens=50
        )

    # Second sub-agent: research
    research = research_agent("Paris attractions")

# Evaluate the full agent trace
results = composo_client.evaluate_trace(tracer.trace, criteria=criteria.agent)

for result, criterion in zip(results, criteria.agent):
    print(f"Criterion: {criterion}")
    print(f"Evaluation Result: {result}\n")

This example shows how Composo traces each agent’s LLM calls independently and evaluates them against our comprehensive agent framework.

Getting Started

Criteria Guide

Testing

Monitoring

Cookbooks

Community Examples

Billing

Quickstart

Quickstart

Step 1: Create Your Account

Step 2: Generate Your API Key

Step 3: Run Your First Evaluation

Understanding the Results

Step 4: Evaluate Agents with Tracing

​Quickstart

​Step 1: Create Your Account

​Step 2: Generate Your API Key

​Step 3: Run Your First Evaluation

​Understanding the Results

​Step 4: Evaluate Agents with Tracing

Quickstart

Step 1: Create Your Account

Step 2: Generate Your API Key

Step 3: Run Your First Evaluation

Understanding the Results

Step 4: Evaluate Agents with Tracing