Get up and running with Composo in under 5 minutes. This guide will help you evaluate your first LLM response and understand how Composo delivers deterministic, accurate evaluations.

What You’ll Build

In this 5 minute quickstart, you’ll:
  • Set up your Composo account and API access
  • Evaluate an LLM response for quality and accuracy
  • Understand how to interpret Composo’s scores and explanations
  • Learn the difference between reward (0-1 scoring) and binary (pass/fail) evaluations

Step 1: Create Your Account

Sign up for a Composo account at platform.composo.ai.

Step 2: Generate Your API Key

  1. Navigate to ProfileAPI Keys in the dashboard
  2. Click Create New API Key
If your organization has a fine-tuned model with Composo, all API keys created with organization accounts will automatically route to that finetuned model.

Step 3: Run Your First Evaluation

First, install the SDK:
pip install composo
Now let’s evaluate a customer service response for empathy and helpfulness using the Composo SDK:
from composo import Composo

# Initialize the client with your API key
composo_client = Composo(api_key="YOUR_API_KEY")

# Example: Evaluating a customer service response
result = composo_client.evaluate(
    messages=[
        {"role": "user", "content": "I'm really frustrated with my device not working."},
        {"role": "assistant", "content": "I'm sorry to hear that you're experiencing issues with your device. Let's see how I can assist you to resolve this problem."}
    ],
    criteria="Reward responses that express appropriate empathy if the user is facing a problem they're finding frustrating"
)

# Display results
print(f"Score: {result.score}")
print(f"Analysis: {result.explanation}")

Understanding the Results

Composo returns:
  • Score: A value between 0 and 1 (e.g. 0.86 means the response strongly meets your criteria)
  • Explanation: Detailed analysis of why the response received this score
Example output:
JSON
Score: 1.0/1.0
Analysis: The assistant expresses appropriate empathy and support in response to the user's frustration.

Step 4: Try the Composo Agent Evaluation Framework

Our lighning model is not yet optimised for agent evaluation, so make sure you aren’t using this for this example
Python
from composo import Composo, criteria

composo_client = Composo(api_key="YOUR_API_KEY")

# Simple weather agent example
messages = [
    {"role": "user", "content": "What's the weather in Paris?"},
    {"role": "assistant", "content": None, "tool_calls": [
        {
            "id": "call_123",
            "type": "function",
            "function": {
                "name": "get_weather",
                "arguments": "{\"location\": \"Paris, France\"}"
            }
        }
    ]},
    {"role": "tool", "tool_call_id": "call_123", "content": "Currently 15°C with clear skies"},
    {"role": "assistant", "content": "The weather in Paris is currently 15°C with clear skies."}
]

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City and country"}
                },
                "required": ["location"]
            }
        }
    }
]

# Evaluate with the agents framework
results = composo_client.evaluate(
    messages=messages,
    tools=tools,
    criteria=criteria.agent
)

for result in results:
    print(f"Score: {result.score}/1.00")
    print(f"Explanation: {result.explanation}\n")

Step 5: Try Different Evaluation Types

Binary Evaluation (Pass/Fail)

For simple pass/fail checks, use the binary endpoint:
from composo import Composo

composo_client = Composo(api_key="YOUR_API_KEY")
result = composo_client.evaluate(
    messages=[
        {"role": "user", "content": "I'm having headaches every morning. What should I do?"},
        {"role": "assistant", "content": "You should consult a healthcare professional for proper advice."}
    ],
    criteria="Response fails if the assistant provides medical advice."
)

print(f"Passed: {result.score}")
print(f"Explanation: {result.explanation}")

RAG Accuracy Evaluation

Evaluate how faithfully an LLM uses retrieved context:
from composo import Composo, criteria

composo_client = Composo(api_key="YOUR_API_KEY")

# Example RAG conversation with retrieved context
messages = [
    {
        "role": "user",
        "content": """What is the current population of Tokyo?

Context:
According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents, making it the world's most populous urban agglomeration. The Tokyo Metropolis itself has 14.0 million people."""
    },
    {
        "role": "assistant",
        "content": "Based on the 2020 census data provided, Tokyo has 14.0 million people in the metropolis proper, while the greater metropolitan area contains approximately 37.4 million residents, making it the world's largest urban agglomeration."
    }
]

# Evaluate with the RAG framework
results = composo_client.evaluate(
    messages=messages,
    criteria=criteria.rag
)

for result in results:
    print(f"Score: {result.score}/1.00")
    print(f"Explanation: {result.explanation}\n")

What’s Next?

Now that you’ve made your first evaluation, explore more advanced features:
  1. SDK Documentation - Learn how to use the Python SDK
  2. Writing Effective Criteria - Learn how to craft precise evaluation criteria for your use case
  3. Criteria Library - Browse pre-built criteria for common evaluation scenarios
  4. Use Cases - See examples for RAG, customer service, content generation, and more