Get up and running with Composo in under 5 minutes. This guide will help you evaluate your first LLM response and understand how Composo delivers deterministic, accurate evaluations.
What You’ll Build
In this 5 minute quickstart, you’ll:
- Set up your Composo account and API access
- Evaluate an LLM response for quality and accuracy
- Understand how to interpret Composo’s scores and explanations
- Learn the difference between reward (0-1 scoring) and binary (pass/fail) evaluations
Step 1: Create Your Account
Sign up for a Composo account using the invitation link from our team. Once logged in, you’ll have access to the platform dashboard where you can monitor your evaluations and manage your API keys.
Step 2: Generate Your API Key
- Navigate to Profile → API Keys in the dashboard
- Click Create New API Key
If your organization has a fine-tuned model with Composo, all API keys created with organization accounts will automatically route to that finetuned model.
Step 3: Run Your First Evaluation
Let’s evaluate a customer service response for empathy and helpfulness:
import requests
# Composo API endpoint
url = "https://platform.composo.ai/api/v1/evals/reward"
# Your API key
headers = {
"API-Key": "YOUR_API_KEY"
}
# Example: Evaluating a customer service response
payload = {
"messages": [
{"role": "user", "content": "I'm really frustrated with my device not working."},
{"role": "assistant", "content": "I'm sorry to hear that you're experiencing issues with your device. Let's see how I can assist you to resolve this problem."}
],
"evaluation_criteria": "Reward responses that express appropriate empathy if the user is facing a problem they're finding frustrating"
}
# Make the API call
response = requests.post(url, headers=headers, json=payload)
result = response.json()
# Display results
print(f"Score: {result['score']}")
print(f"Analysis: {result.get('explanation', 'No feedback provided')}")
Understanding the Results
Composo returns:
- Score: A value between 0 and 1 (e.g. 0.86 means the response strongly meets your criteria)
- Explanation: Detailed analysis of why the response received this score
Example output:
Score: 0.67/1.00
Analysis: The assistant demonstrates appropriate empathy by acknowledging the user's frustration and offering assistance, though there was room for slightly more personalized emotional validation.
Step 4: Try Different Evaluation Types
Binary Evaluation (Pass/Fail)
For simple pass/fail checks, use the binary endpoint:
import requests
url = "https://platform.composo.ai/api/v1/evals/binary"
headers = {
"API-Key": "YOUR_API_KEY"
}
payload = {
"messages": [
{"role": "user", "content": "I'm having headaches every morning. What should I do?"},
{"role": "assistant", "content": "You should consult a healthcare professional for proper advice."}
],
"evaluation_criteria": "Response fails if the assistant provides medical advice."
}
response = requests.post(url, headers=headers, json=payload)
result = response.json()
print(f"Passed: {result['passed']}")
print(f"Explanation: {result['explanation']}")
RAG Accuracy Evaluation
Evaluate how faithfully an LLM uses retrieved context:
import requests
url = "https://platform.composo.ai/api/v1/evals/reward"
headers = {
"API-Key": 'your-api-key-here'
}
# Example: Evaluating how well an LLM uses provided context
payload = {
"messages": [
{
"role": "user",
"content": """What is the current population of Tokyo?
Context:
According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents, making it the world's most populous urban agglomeration. The Tokyo Metropolis itself has 14.0 million people."""
},
{
"role": "assistant",
"content": "Based on the 2020 census data provided, Tokyo has 14.0 million people in the metropolis proper, while the greater metropolitan area contains approximately 37.4 million residents, making it the world's largest urban agglomeration."
}
],
"evaluation_criteria": "Reward responses that accurately use the provided context and cite specific data points"
}
response = requests.post(url, headers=headers, json=payload)
result = response.json()
print(f"Score: {result['score']}")
print(f"Explanation: {result.get('explanation', 'No feedback provided.')}")
What’s Next?
Now that you’ve made your first evaluation, explore more advanced features:
- Writing Effective Criteria - Learn how to craft precise evaluation criteria for your use case
- Criteria Library - Browse pre-built criteria for common evaluation scenarios
- API Reference - Explore all available endpoints and parameters
- Use Cases - See examples for RAG, customer service, content generation, and more