Get up and running with Composo in under 5 minutes. This guide will help you evaluate your first LLM response and understand how Composo delivers deterministic, accurate evaluations.

What You’ll Build

In this 5 minute quickstart, you’ll:

  • Set up your Composo account and API access
  • Evaluate an LLM response for quality and accuracy
  • Understand how to interpret Composo’s scores and explanations
  • Learn the difference between reward (0-1 scoring) and binary (pass/fail) evaluations

Step 1: Create Your Account

Sign up for a Composo account using the invitation link from our team. Once logged in, you’ll have access to the platform dashboard where you can monitor your evaluations and manage your API keys.

Step 2: Generate Your API Key

  1. Navigate to ProfileAPI Keys in the dashboard
  2. Click Create New API Key

If your organization has a fine-tuned model with Composo, all API keys created with organization accounts will automatically route to that finetuned model.

Step 3: Run Your First Evaluation

Let’s evaluate a customer service response for empathy and helpfulness:

import requests

# Composo API endpoint
url = "https://platform.composo.ai/api/v1/evals/reward"

# Your API key
headers = {
    "API-Key": "YOUR_API_KEY"
}

# Example: Evaluating a customer service response
payload = {
    "messages": [
        {"role": "user", "content": "I'm really frustrated with my device not working."},
        {"role": "assistant", "content": "I'm sorry to hear that you're experiencing issues with your device. Let's see how I can assist you to resolve this problem."}
    ],
    "evaluation_criteria": "Reward responses that express appropriate empathy if the user is facing a problem they're finding frustrating"
}

# Make the API call
response = requests.post(url, headers=headers, json=payload)
result = response.json()

# Display results
print(f"Score: {result['score']}")
print(f"Analysis: {result.get('explanation', 'No feedback provided')}")

Understanding the Results

Composo returns:

  • Score: A value between 0 and 1 (e.g. 0.86 means the response strongly meets your criteria)
  • Explanation: Detailed analysis of why the response received this score

Example output:

Score: 0.67/1.00
Analysis: The assistant demonstrates appropriate empathy by acknowledging the user's frustration and offering assistance, though there was room for slightly more personalized emotional validation.

Step 4: Try Different Evaluation Types

Binary Evaluation (Pass/Fail)

For simple pass/fail checks, use the binary endpoint:

import requests

url = "https://platform.composo.ai/api/v1/evals/binary"
headers = {
    "API-Key": "YOUR_API_KEY"
}
payload = {
    "messages": [
        {"role": "user", "content": "I'm having headaches every morning. What should I do?"},
        {"role": "assistant", "content": "You should consult a healthcare professional for proper advice."}
    ],
    "evaluation_criteria": "Response fails if the assistant provides medical advice."
}

response = requests.post(url, headers=headers, json=payload)
result = response.json()

print(f"Passed: {result['passed']}")
print(f"Explanation: {result['explanation']}")

RAG Accuracy Evaluation

Evaluate how faithfully an LLM uses retrieved context:

import requests

url = "https://platform.composo.ai/api/v1/evals/reward"
headers = {
    "API-Key": 'your-api-key-here'
}

# Example: Evaluating how well an LLM uses provided context
payload = {
    "messages": [
        {
            "role": "user", 
            "content": """What is the current population of Tokyo?

Context:
According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents, making it the world's most populous urban agglomeration. The Tokyo Metropolis itself has 14.0 million people."""
        },
        {
            "role": "assistant", 
            "content": "Based on the 2020 census data provided, Tokyo has 14.0 million people in the metropolis proper, while the greater metropolitan area contains approximately 37.4 million residents, making it the world's largest urban agglomeration."
        }
    ],
    "evaluation_criteria": "Reward responses that accurately use the provided context and cite specific data points"
}

response = requests.post(url, headers=headers, json=payload)
result = response.json()

print(f"Score: {result['score']}")
print(f"Explanation: {result.get('explanation', 'No feedback provided.')}")

What’s Next?

Now that you’ve made your first evaluation, explore more advanced features:

  1. Writing Effective Criteria - Learn how to craft precise evaluation criteria for your use case
  2. Criteria Library - Browse pre-built criteria for common evaluation scenarios
  3. API Reference - Explore all available endpoints and parameters
  4. Use Cases - See examples for RAG, customer service, content generation, and more