Get up and running with Composo in under 5 minutes. This guide will help you evaluate your first LLM response and understand how Composo delivers deterministic, accurate evaluations.
What You’ll Build
In this 5 minute quickstart, you’ll:
- Set up your Composo account and API access
- Evaluate an LLM response for quality and accuracy
- Understand how to interpret Composo’s scores and explanations
- Learn the difference between reward (0-1 scoring) and binary (pass/fail) evaluations
Step 1: Create Your Account
Sign up for a Composo account at platform.composo.ai.
Step 2: Generate Your API Key
- Navigate to Profile → API Keys in the dashboard
- Click Create New API Key
If your organization has a fine-tuned model with Composo, all API keys created with organization accounts will automatically route to that finetuned model.
Step 3: Run Your First Evaluation
First, install the SDK:
Now let’s evaluate a customer service response for empathy and helpfulness using the Composo SDK:
from composo import Composo
# Initialize the client with your API key
composo_client = Composo(api_key="YOUR_API_KEY")
# Example: Evaluating a customer service response
result = composo_client.evaluate(
messages=[
{"role": "user", "content": "I'm really frustrated with my device not working."},
{"role": "assistant", "content": "I'm sorry to hear that you're experiencing issues with your device. Let's see how I can assist you to resolve this problem."}
],
criteria="Reward responses that express appropriate empathy if the user is facing a problem they're finding frustrating"
)
# Display results
print(f"Score: {result.score}")
print(f"Analysis: {result.explanation}")
Understanding the Results
Composo returns:
- Score: A value between 0 and 1 (e.g. 0.86 means the response strongly meets your criteria)
- Explanation: Detailed analysis of why the response received this score
Example output:
Score: 1.0/1.0
Analysis: The assistant expresses appropriate empathy and support in response to the user's frustration.
Step 4: Try the Composo Agent Evaluation Framework
Our lighning model is not yet optimised for agent evaluation, so make sure you aren’t using this for this example
from composo import Composo, criteria
composo_client = Composo(api_key="YOUR_API_KEY")
# Simple weather agent example
messages = [
{"role": "user", "content": "What's the weather in Paris?"},
{"role": "assistant", "content": None, "tool_calls": [
{
"id": "call_123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"Paris, France\"}"
}
}
]},
{"role": "tool", "tool_call_id": "call_123", "content": "Currently 15°C with clear skies"},
{"role": "assistant", "content": "The weather in Paris is currently 15°C with clear skies."}
]
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City and country"}
},
"required": ["location"]
}
}
}
]
# Evaluate with the agents framework
results = composo_client.evaluate(
messages=messages,
tools=tools,
criteria=criteria.agent
)
for result in results:
print(f"Score: {result.score}/1.00")
print(f"Explanation: {result.explanation}\n")
Step 5: Try Different Evaluation Types
Binary Evaluation (Pass/Fail)
For simple pass/fail checks, use the binary endpoint:
from composo import Composo
composo_client = Composo(api_key="YOUR_API_KEY")
result = composo_client.evaluate(
messages=[
{"role": "user", "content": "I'm having headaches every morning. What should I do?"},
{"role": "assistant", "content": "You should consult a healthcare professional for proper advice."}
],
criteria="Response fails if the assistant provides medical advice."
)
print(f"Passed: {result.score}")
print(f"Explanation: {result.explanation}")
RAG Accuracy Evaluation
Evaluate how faithfully an LLM uses retrieved context:
from composo import Composo, criteria
composo_client = Composo(api_key="YOUR_API_KEY")
# Example RAG conversation with retrieved context
messages = [
{
"role": "user",
"content": """What is the current population of Tokyo?
Context:
According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents, making it the world's most populous urban agglomeration. The Tokyo Metropolis itself has 14.0 million people."""
},
{
"role": "assistant",
"content": "Based on the 2020 census data provided, Tokyo has 14.0 million people in the metropolis proper, while the greater metropolitan area contains approximately 37.4 million residents, making it the world's largest urban agglomeration."
}
]
# Evaluate with the RAG framework
results = composo_client.evaluate(
messages=messages,
criteria=criteria.rag
)
for result in results:
print(f"Score: {result.score}/1.00")
print(f"Explanation: {result.explanation}\n")
What’s Next?
Now that you’ve made your first evaluation, explore more advanced features:
- SDK Documentation - Learn how to use the Python SDK
- Writing Effective Criteria - Learn how to craft precise evaluation criteria for your use case
- Criteria Library - Browse pre-built criteria for common evaluation scenarios
- Use Cases - See examples for RAG, customer service, content generation, and more