# Binary
Source: https://docs.composo.ai/api-reference/evals/binary

https://platform.composo.ai/api/evals-docs/openapi.json post /api/v1/evals/binary
Evaluate LLM output against specified criteria. Result is pass/fail.


# Reward
Source: https://docs.composo.ai/api-reference/evals/reward

https://platform.composo.ai/api/evals-docs/openapi.json post /api/v1/evals/reward
Evaluate LLM output against specified criteria. Score on a continuous 0-1 scale.


# Healthcheck
Source: https://docs.composo.ai/api-reference/healthcheck

https://platform.composo.ai/api/evals-docs/openapi.json get /api/healthcheck


# Get Usage
Source: https://docs.composo.ai/api-reference/usage/get-usage

https://platform.composo.ai/api/evals-docs/openapi.json get /api/v1/usage
Get current usage information for the authenticated user.


# FAQs
Source: https://docs.composo.ai/pages/FAQs/common-questions


### Should I include system messages when evaluating with Composo?

* Including system messages is optional but recommended, as they provide useful context that can improve evaluation accuracy.

### What's the context limit?

* 200k tokens

### What's the expected response time?

* **Composo Align (flagship model):** 5-15 seconds per API call
* **Composo Lightning:** 3 seconds per API call

### Can I run parallel requests?

* Yes, we recommend limiting to 5 parallel API calls for optimal performance

### What are the rate limits?

* **Free plan:** 500 requests per hour
* **Paid plans:** Higher limits based on your specific plan

### What languages are supported?

* Our evaluation models support all major languages plus code. A good rule of thumb is that if you don't need a specialized model to deal with your language, we can handle it.

### What's the difference between reward and binary evaluation?

* **Reward evaluation:** Returns a continuous score from 0-1 measuring how well the output meets your criteria
* **Binary evaluation:** Returns a simple pass/fail result for clear-cut criteria or policy compliance

### Can I evaluate tool calls and agents, not just responses?

* Yes! Composo evaluates three types of outputs:
  * **Responses:** The assistant's latest response
  * **Tool calls:** Individual tool call parameters and selection
  * **Agents:** Complete end-to-end agent traces

### How deterministic are the evaluation scores?

* Composo provides \<1% variance in scores - the same input will always produce the same output, unlike LLM-as-judge approaches which have >30% variance.


# Anonymizing Data for Composo Evaluations
Source: https://docs.composo.ai/pages/guides/anonymization

Anonymizing your data while maintaining evaluation quality

When dealing with sensitive customer information, you may need to anonymize data before sending it to Composo evaluation services. This guide explains how to effectively anonymize your data while preserving evaluation quality.

## Recommended Anonymization Approach

For optimal evaluation results, we recommend using a **consistent placeholder substitution** approach rather than removing or scrambling PII. This preserves relationships between entities that are important for evaluation quality.

### Best Practices

1. **Use sequential placeholders** for each entity type
   * Replace "Bob sent an email to Sally" with "NAME\_1 sent an email to NAME\_2"
   * This preserves relationships between entities
2. **Maintain placeholder consistency** across all related content
   * The same entity should have the same placeholder ID throughout a single evaluation request
   * Example: If "Sally" is "NAME\_2" in one part, it should remain "NAME\_2" everywhere in that request
3. **Preserve structure and context**
   * Keep sentence structure, formatting, and non-PII context intact
   * This ensures evaluations remain accurate and meaningful

Numbering can be omitted if there is only one instance of a particular entity type. For example, if only one name appears in your data, you can simply use "NAME" instead of "NAME\_1".

## Recommended PII Types to Anonymize

* Person names → "NAME\_1", "NAME\_2", etc.
* Email addresses → "EMAIL\_1", "EMAIL\_2", etc.
* Phone numbers → "PHONE\_1", "PHONE\_2", etc.
* Physical addresses → "ADDRESS\_1", "ADDRESS\_2", etc. (you can retain country/region)
* URLs → "URL\_1", "URL\_2", etc.

## Implementation Example

**Original Data:**

```json wrap
{
  "messages": [
    {"role": "user", "content": "How do I contact Bob Smith?"},
    {"role": "assistant", "content": "You can reach Bob Smith at bob.smith@example.com or call him at (555) 123-4567."}
  ],
  "evaluation_criteria": "Reward responses that provide complete contact information when requested."
}
```

**Anonymized Data:**

```json
{
  "messages": [
    {"role": "user", "content": "How do I contact NAME_1?"},
    {"role": "assistant", "content": "You can reach NAME_1 at EMAIL_1 or call him at PHONE_1."}
  ],
  "evaluation_criteria": "Reward responses that provide complete contact information when requested."
}
```

## Tools for Anonymization

We recommend using [Microsoft Presidio](https://github.com/microsoft/presidio), an open-source framework for PII detection and anonymization. It provides:

* Entity recognition for common PII types
* Multiple anonymization methods
* Support for multiple languages
* Customizable entity detection


# Composo & Langfuse
Source: https://docs.composo.ai/pages/guides/composo-langfuse

How to use Composo in combination with Langfuse

This guide shows how to integrate Composo's deterministic evaluation with Langfuse's observability platform to evaluate your LLM applications with confidence.

## Overview

**Langfuse** provides comprehensive observability for LLM applications with tracing, debugging, and dataset management capabilities. **Composo** delivers deterministic, accurate evaluation through purpose-built generative reward models that achieve 92% accuracy (vs 72% for LLM-as-judge).

Together, they enable you to:

* ✅ Track every LLM interaction through Langfuse's tracing
* ✅ Add deterministic evaluation scores to your traces
* ✅ Evaluate datasets programmatically with reliable metrics
* ✅ Ship AI features with confidence using quantitative, trustworthy metrics

## Prerequisites

```python Python wrap
pip install langfuse composo
```

```python Python wrap
import os
from langfuse import Langfuse, get_client
from composo import Composo, AsyncComposo

# Set your API keys
os.environ["LANGFUSE_PUBLIC_KEY"] = "your-public-key"
os.environ["LANGFUSE_SECRET_KEY"] = "your-secret-key"
os.environ["COMPOSO_API_KEY"] = "your-composo-key"

# Initialize clients
langfuse = get_client()
composo_client = Composo()
async_composo = AsyncComposo()
```

## How Langfuse & Composo work in combination

<img src="https://mintcdn.com/composo/Uv5TOXXNrGY0snkP/images/Untitleddiagram_MermaidChart-2025-08-19-133254.png?maxW=3840&auto=format&n=Uv5TOXXNrGY0snkP&q=85&s=4d1e0feb0b46211aefde34fd33f64d5e" alt="Untitleddiagram Mermaid Chart 2025 08 19 133254 Pn" width="3840" height="1885" data-path="images/Untitleddiagram_MermaidChart-2025-08-19-133254.png" srcset="https://mintcdn.com/composo/Uv5TOXXNrGY0snkP/images/Untitleddiagram_MermaidChart-2025-08-19-133254.png?w=280&maxW=3840&auto=format&n=Uv5TOXXNrGY0snkP&q=85&s=7e040b4238502199da151624b95ce20f 280w, https://mintcdn.com/composo/Uv5TOXXNrGY0snkP/images/Untitleddiagram_MermaidChart-2025-08-19-133254.png?w=560&maxW=3840&auto=format&n=Uv5TOXXNrGY0snkP&q=85&s=cf0e7ee509f5f630451c224af3dc36ef 560w, https://mintcdn.com/composo/Uv5TOXXNrGY0snkP/images/Untitleddiagram_MermaidChart-2025-08-19-133254.png?w=840&maxW=3840&auto=format&n=Uv5TOXXNrGY0snkP&q=85&s=a47a4172483e80e90347b2265dc76de7 840w, https://mintcdn.com/composo/Uv5TOXXNrGY0snkP/images/Untitleddiagram_MermaidChart-2025-08-19-133254.png?w=1100&maxW=3840&auto=format&n=Uv5TOXXNrGY0snkP&q=85&s=ef911da991866261855f93b1e4498fa7 1100w, https://mintcdn.com/composo/Uv5TOXXNrGY0snkP/images/Untitleddiagram_MermaidChart-2025-08-19-133254.png?w=1650&maxW=3840&auto=format&n=Uv5TOXXNrGY0snkP&q=85&s=65c78845f9785557a57190f72b8ca783 1650w, https://mintcdn.com/composo/Uv5TOXXNrGY0snkP/images/Untitleddiagram_MermaidChart-2025-08-19-133254.png?w=2500&maxW=3840&auto=format&n=Uv5TOXXNrGY0snkP&q=85&s=da3ddcdeb8ff1fdd7d9349378e634c66 2500w" data-optimize="true" data-opv="2" />

## Method 1: Real-time Trace Evaluation

Evaluate LLM outputs as they're generated in production or development. This approach uses the `@observe` decorator to automatically trace your LLM calls, then evaluates them with Composo asynchronously.

More detail on how the langfuse @observe decorator works is [here](https://langfuse.com/docs/observability/sdk/python/sdk-v3#basic-tracing).

### When to use

* Production monitoring with real-time quality scores
* Development iteration with immediate feedback

### Implementation

```python Python wrap
import asyncio
from langfuse import get_client, observe
from anthropic import Anthropic
from composo import AsyncComposo

# Initialize async Composo client
async_composo = AsyncComposo()

@observe()
async def llm_call(input_data: str) -> str:
    # LLM call with async evaluation using @observe decorator
    model_name = "claude-sonnet-4-20250514"
    
    anthropic = Anthropic()
    resp = anthropic.messages.create(
        model=model_name,
        max_tokens=100,
        messages=[{"role": "user", "content": input_data}],
    )
    output = resp.content[0].text.strip()
    
    # Get trace ID for scoring
    trace_id = get_client().get_current_trace_id()
    evaluation_criteria = "Reward responses that are helpful"
    
    # Start asynchronous evaluation task (non-blocking)
    # Note: You can register tasks to a task queue or background tasks
    await evaluate_with_composo(trace_id, input_data, output, evaluation_criteria)
    
    return output

async def evaluate_with_composo(trace_id, input_data, output, evaluation_criteria):
    # Evaluate LLM output with Composo and score in Langfuse
    # Composo expects a list of chat messages
    messages = [
        {"role": "user", "content": input_data},
        {"role": "assistant", "content": output},
    ]
    
    eval_resp = await async_composo.evaluate(
        messages=messages, 
        criteria=evaluation_criteria
    )
    
    # Score the trace in Langfuse
    langfuse = get_client()
    langfuse.create_score(
        trace_id=trace_id,
        name=evaluation_criteria,
        value=eval_resp.score,
        comment=eval_resp.explanation,
    )
```

Then in your main application:

```python Python wrap
# Simply call the function - Langfuse logs and Composo evaluates asynchronously
await llm_call(input_data)
```

## Method 2: Dataset Evaluation

Use this method to evaluate your LLM application on a dataset that already exists in Langfuse. The `item.run()` context manager automatically links execution traces to dataset items.

For more detail on how this works from Langfuse please see [here](https://langfuse.com/docs/evaluation/dataset-runs/remote-run).

### When to use

* Testing prompt or model changes on existing Langfuse datasets
* Running experiments that you want to track in Langfuse UI
* Creating new dataset runs for comparison
* Regression testing with immediate Langfuse visibility

### Implementation

```python Python wrap
from langfuse import get_client
from anthropic import Anthropic
from composo import Composo

# Initialize Composo client
composo = Composo()

def llm_call(question: str, item_id: str, run_name: str):
    #Encapsulates the LLM call and appends input/output data to trace
    model_name = "claude-sonnet-4-20250514"
    
    with get_client().start_as_current_generation(
        name=run_name,
        input={"question": question},
        metadata={"item_id": item_id},
        model=model_name,
    ) as generation:
        anthropic = Anthropic()
        resp = anthropic.messages.create(
            model=model_name,
            max_tokens=100,
            messages=[{"role": "user", "content": f"Question: {question}"}],
        )
        answer = resp.content[0].text.strip()
        
        generation.update_trace(
            input={"question": question},
            output={"answer": answer},
        )
        return answer

def run_dataset_evaluation(dataset_name: str, run_name: str, evaluation_criteria: str):
    #Run evaluation on a Langfuse dataset using Composo
    langfuse = get_client()
    dataset = langfuse.get_dataset(name=dataset_name)
    
    for item in dataset.items:
        print(f"Running evaluation for item: {item.id}")
        
        # item.run() automatically links the trace to the dataset item
        with item.run(run_name=run_name) as root_span:
            # Generate answer
            generated_answer = llm_call(
                question=item.input,
                item_id=item.id,
                run_name=run_name,
            )
            
            print(f"Item {item.id} processed. Trace ID: {root_span.trace_id}")
            
            # Evaluate with Composo
            messages = [
                {"role": "user", "content": f"Question: {item.input}"},
                {"role": "assistant", "content": generated_answer},
            ]
            
            eval_resp = composo.evaluate(
                messages=messages, 
                criteria=evaluation_criteria
            )
            
            # Score the trace
            root_span.score_trace(
                name=evaluation_criteria,
                value=eval_resp.score,
                comment=eval_resp.explanation,
            )
    
    # Ensure all data is sent to Langfuse
    langfuse.flush()

# Example usage
if __name__ == "__main__":
    run_dataset_evaluation(
        dataset_name="your-dataset-name",
        run_name="evaluation-run-1",
        evaluation_criteria="Reward responses that are accurate and helpful"
    )
```

## Method 3: Evaluating New Datasets

Use this method to evaluate datasets that don't yet exist in Langfuse. You can create your own dataset locally, evaluate it with Composo, and log both the traces and evaluation scores to Langfuse for UI interpretation.

### When to use

* Evaluating new datasets before uploading to Langfuse
* Quick experimentation with custom datasets
* Batch evaluation of local test cases
* Creating baseline evaluations for new use cases

### Implementation

Please see [this notebook](https://colab.research.google.com/drive/1ZBIueZy2Ca6z0ll_8jjSq7GgLad_mXMP?usp=sharing) for the implementation approach for this.

## Method Selection Recap

* Use Method 1 for real-time production monitoring
* Use Method 2 for evaluating existing Langfuse datasets
* Use Method 3 for evaluating new datasets that don't yet exist in Langfuse

## Resources

* 📊 [Langfuse Dataset Runs Documentation](https://langfuse.com/docs/evaluation/dataset-runs/remote-run) - applicable for method 2
* 🎯 [Composo Documentation](https://docs.composo.ai/)
* 💬 [Get Support](mailto:support@composo.ai)

## Next Steps

1. **Start with Method 1** for immediate feedback during development
2. **Use Method 2** to run experiments on datasets in Langfuse
3. **Apply Method 3** to evaluate new datasets before uploading to Langfuse

Ready to get started? [Sign up for Composo](https://platform.composo.ai/) to get your API key and begin evaluating with confidence.


# Criteria Library
Source: https://docs.composo.ai/pages/guides/criteria-library

Here's a range of criteria that we've seen to help when writing your own

## Core frameworks (start here)

### RAG framework

* **Context Faithfulness**: Reward responses that make only claims directly supported by the provided source material without any hallucination or speculation
* **Completeness**: Reward responses that comprehensively include all relevant information from the source material needed to fully answer the question
* **Context Precision**: Reward responses that include only information necessary to answer the question without extraneous details from the source material
* **Relevance**: Reward responses where all content directly addresses and is relevant to answering the user's specific question

### Agents framework

* **Exploration**: Reward agents that plan effectively: exploring new information and capabilities, and investigating unknowns despite uncertainty
* **Exploitation**: Reward agents that plan effectively: exploiting existing knowledge and available context to create reliable plans with predictable outcomes
* **Tool use**: Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls
* **Goal pursuit**: Reward agents that works towards the goal specified by the user
* **Agent Faithfulness**: Reward agents that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation

## Advanced metrics (use these next)

### Agents

* **Agent Sequencing:** Reward agents that follows logical sequences, such as gathering required information from user before attempting specific lookups
* **Agent Efficiency:** Reward agents that are efficient when working towards their goal
* **Agent Thoroughness:** Reward agents that are fully comprehensive and thorough when working towards their goal

### Individual tool call focused (use these when you want to pinpoint specific tool call steps)

* **Tool Call Formulation:** Reward tool calls that formulate arguments using only information provided by the user or previous tool call returns without fabricating parameters.
* **Tool Relevance:** Reward tool calls that perform actions or retrieve information directly relevant to the goal.
* **Response completeness from tool return:** Reward responses that incorporate all relevant information from tool call returns needed to comprehensively answer the user's question.
* **Response precision from tool return:** Reward responses that include only the specific information from tool call returns that directly addresses the user's query
* **Response faithfulness to tool return**: Reward responses that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation

### Response quality

* **Conciseness:** Reward responses that are clear and direct, avoiding unnecessary verbosity, repetition, or extraneous details
* **Information Structure:** Reward responses that present information in a logical, well-organized format that prioritizes the most important details.
* **Professional Tone:** Reward responses that maintain appropriate professional language and tone suitable for the context.
* **Actionable Guidance:** Reward responses that provide practical next steps or actionable recommendations when appropriate.

### Accuracy and robustness

* **Source Attribution:** Reward responses that explicitly cite or reference specific source documents or sections used to support each claim.
* **Factual Accuracy:** Reward responses that accurately reflect factual information without introducing errors or fabricated details.
* **Uncertainty Handling:** Reward responses that appropriately acknowledge limitations when information is incomplete or unavailable, rather than making assumptions.
* **Appropriate Refusals:** Reward responses that appropriately refuse to answer when source material lacks sufficient information to address the question.

### Safety

* **Harmful Content Prevention:** Penalize responses that provide inappropriate advice (e.g., medical advice, harmful instructions) outside the system's intended scope.
* **System Compliance:** Penalize responses that violate explicit system constraints, limitations, or instructions.

## Extended library (for inspiration when writing your own)

<Expandable title="Creative & Expressive Content">
  * **Creativity:** Reward responses that demonstrate original thinking, novel approaches, or innovative solutions.
  * **Empathy:** Reward responses that show understanding and connection with human emotions and experiences.
  * **Humor:** Reward responses that appropriately use wit, clever wordplay, or situational comedy when suitable to context.
  * **Surprise:** Reward responses that include unexpected but delightful elements or developments.
  * **Happiness:** Reward responses that evoke positive emotions and create uplifting experiences.
  * **Narrative Structure:** Reward responses that maintain logical progression and development.
</Expandable>

<Expandable title="Legal & Regulatory">
  * **Legal Authority:** Reward responses that prioritize the most authoritative legal sources (legislation, case law, preparatory works).
  * **Jurisdictional Accuracy:** Reward responses that correctly identify jurisdictional context and cite the most recent legally binding sources.
  * **Legal Terminology:** Reward responses that correctly interpret legal terminology, avoiding confusion with non-legal meanings.
  * **Citation Recognition:** Reward responses that recognize and appropriately process standard legal citation formats.
</Expandable>

<Expandable title="Financial & Business">
  * **Quantitative Accuracy:** Reward responses that accurately represent quantitative data without speculation beyond provided information.
  * **Metric Context:** Reward responses that include appropriate context for metrics, comparisons, and calculations.
  * **Risk Disclosure:** Reward responses that acknowledge limitations and uncertainties in quantitative analysis.
  * **Regulatory Compliance:** Penalize responses that include financial recommendations without appropriate risk disclaimers.
</Expandable>

<Expandable title="Customer Service & Support">
  * **Issue Resolution:** Reward responses that capture all significant elements: issue nature, agent actions, and resolutions offered.
  * **Entity Accuracy:** Reward responses that correctly identify specific entities (payment methods, brands, etc.) only when explicitly mentioned.
  * **Interaction Dynamics:** Reward responses that accurately represent both customer and agent perspectives.
  * **Chronological Clarity:** Reward responses that present information in clear chronological sequence.
</Expandable>

<Expandable title="Technical & Development">
  * **Query Translation:** Reward SQL queries that accurately translate natural language intent with proper syntax.
  * **Feature Accuracy:** Penalize responses that reference outdated, incorrect, or non-existent functionality.
  * **Validation Implementation:** Penalize responses that fail to include critical validation rules when specified.
  * **Cost Efficiency:** Reward responses that provide cost-effective technical solutions.
</Expandable>

<Expandable title="Healthcare & Medical">
  * **Medical Terminology:** Reward responses that use precise medical terminology appropriate for the audience (clinician vs patient).
  * **Evidence-Based Content:** Reward responses that reference current clinical guidelines or peer-reviewed studies.
  * **Harm Prevention:** Penalize responses that could delay necessary medical care through self-diagnosis suggestions.
  * **Appropriate Referrals:** Reward responses that direct users to qualified healthcare professionals for medical decisions.
</Expandable>

<Expandable title="Educational & Tutoring">
  * **Learning Adaptation:** Reward responses that adapt explanation complexity to match the user's learning level.
  * **Conceptual Building:** Reward responses that connect new concepts to familiar ideas.
  * **Active Learning:** Reward responses that encourage critical thinking through questions when pedagogically appropriate.
  * **Misconception Correction:** Reward responses that identify and gently correct common misconceptions.
</Expandable>

<Expandable title="Content Creation & Marketing">
  * **Voice Consistency:** Reward responses that maintain consistent brand voice and personality.
  * **Audience Targeting:** Reward responses that tailor language and complexity for the specified target audience.
  * **Hook Effectiveness:** Reward responses with compelling openings appropriate to the platform.
  * **SEO Optimization:** Reward responses that naturally incorporate relevant keywords without compromising readability.
</Expandable>

<Expandable title="E-commerce & Product">
  * **Specification Accuracy:** Reward responses that accurately represent product details without fabrication.
  * **Comparison Fairness:** Reward responses that provide balanced product comparisons with strengths and limitations.
  * **Decision Support:** Reward responses that help users make informed decisions by addressing common concerns.
  * **Policy Clarity:** Reward responses that clearly communicate relevant policies when applicable.
</Expandable>

<Expandable title="Research & Academic">
  * **Scholarly Rigor:** Reward responses that properly cite primary sources and acknowledge research limitations.
  * **Literature Synthesis:** Reward responses that effectively synthesize multiple sources while maintaining distinct attribution.
  * **Academic Integrity:** Reward responses that encourage original thinking and proper attribution.
  * **Disciplinary Conventions:** Reward responses that follow discipline-specific writing and citation styles.
</Expandable>

<Expandable title="Conversation handling">
  * **Context Retention:** Reward responses that appropriately reference and build upon previous conversation turns.
  * **Intent Recognition:** Reward responses that correctly identify user intent even when expressed ambiguously.
  * **Emotional Intelligence:** Reward responses that appropriately recognize and respond to user emotional states.
  * **Boundary Awareness:** Reward responses that maintain professional boundaries while being helpful.
</Expandable>

<Expandable title="Translation & Localization">
  * **Cultural Adaptation:** Reward responses that appropriately adapt content for cultural context beyond literal translation.
  * **Idiomatic Accuracy:** Reward responses that correctly handle idioms and culture-specific references.
  * **Terminology Consistency:** Reward responses that maintain consistent technical terminology throughout translations.
  * **Contextual Disambiguation:** Reward responses that correctly resolve ambiguous terms based on domain context.
</Expandable>


# How to write effective criteria
Source: https://docs.composo.ai/pages/guides/criteria-writing


When crafting your evaluation criteria, consider the following guidelines to ensure effective and meaningful assessments:

**Be Specific and Focused**: Clearly define the quality or behavior you want to evaluate. Avoid vague statements. Focus on a single aspect per criterion to maintain clarity.

* *Example*: Instead of "good," use "a friendly and encouraging tone."

**Use Clear Direction**: Begin your criteria with an explicit directive such as `"Reward responses that..."`, `"Penalize responses that..."`, `"Reward tool calls..."`, `"Reward agents that..."`.

* *Example*: `"Reward responses that use empathetic language when addressing user concerns."`

**Monotonic or Appropriately Qualified Qualities**: Ideally, the quality you're assessing should be monotonic (more is always better for rewards, worse for penalties). For non-monotonic qualities where balance matters, use qualifiers like "appropriate" to ensure higher scores represent better adherence.

* *Example*: Instead of `"Reward responses that are polite"` which can become excessive, use `"Reward responses that use an appropriate level of politeness"` ensuring the response is polite but not overly so.

**Avoid Conjunctions**: Focus on one quality at a time. Using "and" often indicates multiple qualities, which can lead to unclear scoring when only one quality is present.

* *Example*: Instead of `"The assistant should be concise and informative"` split into two separate criteria.

**Avoid LLM Keywords**: Composo's reward model is finetuned from LLM models trained in conversation format. Avoid alternate definitions of 'User' and 'Assistant' that might conflict with LLM keywords 'user' and 'assistant'.

* *Example*: Instead of `"Reward responses that comprehensively address the User Question"`, rename the 'User Question' in your prompt and use `"Reward responses that comprehensively address the Target Question"`

**Leverage Domain Expertise**: Your domain knowledge is your secret weapon. Inject your understanding of what constitutes a 'good' answer in your specific field—this gives your evaluation model leverage over the generative model.

* *Example*: For medical contexts: `"Reward responses that distinguish between emergency symptoms requiring immediate care versus symptoms suitable for routine appointments"`

**Use Qualifiers When Needed**: Include a qualifier starting with "if" to specify when the criterion should apply. This helps handle conditional requirements.

* *Example*: `"Reward responses that provide code examples if the user asks for implementation details"`

**Keep Criteria Concise**: Aim for one clear sentence per criterion. If you need multiple sentences to explain, consider splitting into separate criteria.

<Accordion title="Example Clauses And Recommendations for Improvements" icon="crosshairs">
  #### Reward responses that provide correct information based solely on the provided context without fabricating details.

  OK. Clarification about 'correct' would be useful—does it have to be factually correct, or only in agreement with the provided context?

  #### Reward responses that directly address the 'User Question' without including irrelevant information.

  Poor. Clauses that change the definition of user and assistant from the LLM definition risk confusion.

  #### Reward responses that properly cite the specific source of information from the provided context.

  Good. 'Properly' is slightly ambiguous and rolls in both concepts of citation style and accuracy.

  #### Reward responses that appropriately acknowledge limitations if information is incomplete or unavailable rather than guessing.

  Good. Could be improved by clarifying what the agent might be guessing at.

  #### Reward responses that comprehensively address all aspects of the 'User Question' if information is available in the context.

  Poor. Clauses that change the definition of user and assistant from the LLM definition risk confusion.

  #### Reward responses that present technical information in a logical, well-organized format that prioritizes the most important details.

  Excellent. It's clear what format we're looking for and what kind of information that applies to.

  #### Reward responses that provide practical next steps or recommendations if appropriate and supported by the context.

  OK. Somewhat ambiguous about what should be supported by the context—is it the next steps or the relevance of the question?

  #### Reward responses that strictly include only information explicitly stated in the support ticket, without adding any fabricated details or assumptions.

  Excellent. It's clear what the expected input is and what the model should be doing.

  #### Reward responses that correctly identify and include specific entities (payment methods, product categories, brands, couriers) only when explicitly mentioned in the ticket, avoiding hallucinations of these elements.

  Excellent. It's clear that we're trying to avoid fabricating names of specific entities and the examples make it even clearer.

  #### Reward responses that include all significant elements of the support ticket, including the nature of the issue, agent actions, and resolutions offered, without omitting key details.

  Excellent. It's clear that we're looking for good coverage of the important elements in the response.

  #### Reward responses that present the information in a clear chronological sequence that accurately reflects the flow of the support interaction.

  Excellent. A clear requirement for chronological presentation of the information in the support interaction.

  #### Penalize responses that include unnecessary concluding statements, evaluative summaries, or editorial comments not derived from the ticket content.

  Excellent. It's clear that we're trying to avoid verbose summary content that isn't clearly derived from the provided ticket.

  #### Reward responses that demonstrate empathy while acknowledging the friend's feelings of defeat without minimizing them.

  OK. This contains two separate qualities which could lead to unclear scoring when the response demonstrates one but not the other. Consider splitting into two criteria or using 'and' to make both required.

  #### Reward responses that explain ethical concerns when declining harmful requests rather than simply refusing without context

  OK. The model is specifically trained to recognize 'if' statements, so we'd recommend changing 'when' to 'if'.

  #### Reward responses that maintain an appropriate educational tone suitable for academic assessment contexts

  Excellent. A clear requirement for a tone with additional helpful context about why it's needed.
</Accordion>

## Recommended Template for Crafting Criteria

```
[Prefix] [quality] [qualifier (optional)].
```

**Components**:

* **Prefix**:
  * **For 0-1 Reward Scoring**: "Reward responses that", "Penalize responses that", "Reward tool calls that", "Penalize tool calls that", "Reward agents that", "Penalize agents that"
  * **For Binary Evaluation**: "Response passes if", "Response fails if", "Tool call passes if", "Tool call fails if", "Agent passes if", "Agent fails if"
* **Quality**: The specific property or behavior to evaluate.
* **Qualifier (Optional)**: An "if" statement specifying conditions.

**Example Criteria**:

* `"Reward responses that provide a comprehensive analysis of the code snippet"`
* `"Penalize responses where the language is overly technical if the response is for a beginner"`
* `"Reward responses that use an appropriate level of politeness"`
* `"Reward agents that explore new information and capabilities despite uncertainty"`
* `"Tool call passes if all required parameters are provided without fabrication"`


# Ground Truth Evaluation
Source: https://docs.composo.ai/pages/guides/ground-truths

Leverage your labeled data to create precise evaluation metrics

## What is Ground Truth Evaluation?

Ground truth evaluation allows you to measure how well your LLM outputs align with known correct answers. By dynamically inserting your validated labels into Composo's evaluation criteria, you can create precise, case-specific evaluations.

## When to Use Ground Truth

We typically recommend using evaluation criteria / guidelines such as those in our RAG framework rather than rigid ground truths, since it's more flexible and doesn't require labeled data. However, ground truth evaluation works well when:

* You have an exact answer you need to match (calculations, specific classifications)
* You have existing labeled data from historical reviews
* You need to benchmark different models on the same validation set
* Compliance requires testing against specific approved responses

## How It Works

The key is dynamically inserting your ground truth labels directly into the evaluation criteria:

```python Python wrap
from composo import Composo

composo_client = Composo(api_key="YOUR_API_KEY")

# Your ground truth answer from the dataset
ground_truth = "The capital of France is Paris, a city known for the Eiffel Tower, the Louvre Museum, and its historic architecture along the Seine River."

# Evaluate if the LLM's response matches the ground truth
result = composo_client.evaluate(
    messages=[
        {
            "role": "user",
            "content": "What is the capital of France and what is it known for?"
        },
        {
            "role": "assistant",
            "content": "The capital of France is Paris. It's famous for iconic landmarks like the Eiffel Tower, world-class museums including the Louvre, and beautiful architecture along the Seine River."
        }
    ],
    criteria=f"Reward responses that closely match this expected answer: {ground_truth}"
)

print(f"Alignment Score: {result.score}")
print(f"Explanation: {result.explanation}\n")
```

## Common Use Cases

### Classification Tasks

```python Python wrap
# Multi-class classification
ground_truth_category = "Technical Support"

criteria = f"Reward responses that correctly classify this inquiry as: {ground_truth_category}"
```

### Extraction Tasks

```python Python wrap
# Entity extraction validation
ground_truth_entities = "Company: Acme Corp, Amount: $50,000, Date: March 2024"

criteria = f"Reward responses that extract all of these entities: {ground_truth_entities}"
```

### Decision Validation

```python Python wrap
# Validating specific decisions
ground_truth_decision = "Escalate to Level 2 Support"

criteria = f"Reward responses that make this decision: {ground_truth_decision}"
```

### Numerical Validation

```python Python wrap
# Calculation or counting tasks
ground_truth_answer = "Total: $1,247.50"

criteria = f"Reward responses that arrive at the correct answer: {ground_truth_answer}"
```

## Setting Thresholds

Different use cases require different accuracy thresholds:

* **High-stakes decisions** (medical, financial): Consider scores ≥ 0.9 as passing
* **General classification**: Scores ≥ 0.8 typically indicate good alignment
* **Exploratory analysis**: Scores ≥ 0.7 may be acceptable initially

## Next Steps

* If you have labeled data ready, try the patterns above
* For more flexible evaluation without needing labels, explore [custom criteria](/pages/guides/criteria-writing)
* See our [criteria library](/pages/guides/criteria-library) for evaluation inspiration


# Intro to Composo
Source: https://docs.composo.ai/pages/overview

Ship AI agents that actually work in production

See the full LLMs.txt [here](https://docs.composo.ai/llms-full.txt).

Composo delivers deterministic, accurate evaluation for LLM applications through purpose-built generative reward models. Unlike unreliable LLM-as-judge approaches, our specialized models provide consistent, precise scores you can trust—with just a single sentence criteria.

## Why Composo?

Engineering & product teams building enterprise AI applications tell us they need to:

> “Test & iterate faster during development”

> “Rapidly find and fix edge cases in production”

> “Have 100% confidence in quality when we ship”

Manual evals don't scale. LLM-as-judge is unreliable with 30%+ variance. Composo's purpose-built evaluation models deliver:

* 92% accuracy vs 72% for LLM-as-judge
* Deterministic scoring - same input always produces same output
* 70% reduction in error rate over alternatives
* Simple integration - just write a single sentence to create any custom criteria

## Evaluation Frameworks (start here)

Composo provides industry-leading frameworks to get you started immediately:

🤖 **Agent Framework**

Our comprehensive agent evaluation framework covers planning, tool use, and goal achievement. [Learn more →](https://docs.composo.ai/pages/usecases/agent-evaluation)

📚 **RAG Framework**

Battle-tested metrics for retrieval-augmented generation including faithfulness, completeness, and precision. [Learn more →](https://docs.composo.ai/pages/usecases/rag-evaluation)

🎯 **Criteria Library**

The real power of Composo is writing your own custom criteria in plain English - and most teams do exactly this for their specific use cases.

Browse our extensive library of pre-built criteria for common evaluation scenarios to help inspire you here. [View library →](https://docs.composo.ai/pages/guides/criteria-library)

## What Are Evaluation Criteria?

Evaluation criteria are simple, single-sentence instructions that tell Composo exactly what to evaluate in your LLM outputs.

### Three Types of Evaluation

1. **Response Evaluation** - Evaluates the latest assistant response
2. **Tool Call Evaluation** - Evaluates the latest tool call and its parameters
3. **Agent Evaluation** - Evaluates the full end-to-end agent trace

### Two Scoring Methods

Each evaluation type supports two scoring methods:

1. **Reward Score Evaluation**: For continuous scoring (recommended for most use cases).
2. **Binary Evaluation**: Use for simple pass/fail assessments against specific rules or policies. Perfect for content moderation and clear-cut criteria.

You specify what to evaluate through your criteria.

So for reward score evaluation:

* `"Reward responses that..."` - Positive response evaluation
* `"Penalize responses that..."` - Negative response evaluation
* `"Reward tool calls that..."` - Positive tool call evaluation
* `"Penalize tool calls that..."` - Negative tool call evaluation
* `"Reward agents that..."` - Positive agent evaluation
* `"Penalize agents that..."` - Negative agent evaluation

And for binary evaluation:

* `"Response passes if..."` / `"Response fails if..."` - Response evaluation
* `"Tool call passes if..."` / `"Tool call fails if..."` - Tool call evaluation
* `"Agent passes if..."` / `"Agent fails if..."` - Agent evaluation

### **For Example:**

* Input: A customer service conversation
* Criteria: `"Reward responses that express appropriate empathy when the user is frustrated"`
* Result: Composo analyzes the response and returns a score from 0-1 based on how well it meets this criteria

This single sentence is all you need - no complex rubrics, no prompt engineering, no unreliable LLM judges. Just describe what good (or bad) looks like, and Composo handles the rest.

## Composo's models

Composo offers two purpose-built evaluation models to match your needs:

### Composo Lightning

**Fast evaluation for rapid iteration**

* 3 second median response time
* Optimized for development workflows and real-time feedback
* Ideal for quick iteration during development and testing
* Works with LLM outputs & retrieval, not tool calling or agentic examples

### Composo Align

**Expert-level evaluation for production confidence**

* 5-15 second response time
* Achieves 92% accuracy on real-world evaluation tasks (vs \~70% for LLM-as-judge)
* 70% reduction in error rate compared to alternatives
* Our flagship model for when accuracy matters most

Both models use our generative reward model architecture that combines:

* A custom-trained reasoning model that analyzes inputs against criteria
* A specialized scoring model that produces calibrated, deterministic scores

This dual-model approach lets you choose between speed and power: use Lightning for rapid development cycles, and Align for production deployments where maximum accuracy is critical.

## Key Differences from LLM-as-Judge

1. **Deterministic**: Same inputs always produce identical scores
2. **Calibrated**: Scores meaningfully distributed across 0-1 range
3. **Consistent**: Robust to minor wording changes in criteria
4. **Accurate**: Trained specifically for evaluation, not general text generation

## Message Format

Both endpoints accept the same message format:

```json wrap
{
    "messages": [
        {"role": "system", "content": "..."},
        {"role": "user", "content": "..."},
        {"role": "assistant", "content": "..."},
        {"role": "tool", "tool_call_id": "...", "content": "..."},
        {"role": "assistant", "content": "..."}
    ],
    "evaluation_criteria": "Reward responses that...",
    "tools": [...]  // Optional, for tool call evaluation
}
```

## Get Started with Composo

Ready to see how Composo compares to your current evaluation approach? Get started in 15 minutes with 500 free credits

* Sign up at [platform.composo.ai](http://platform.composo.ai)
* Install the SDK: `pip install composo`
* Start getting eval results in \<15 minutes

We love to work closely and give 1-1 support, if you'd like to chat then feel free to book in [here](https://www.composo.ai/book-a-demo).

For any questions, you can also reach us at [contact@composo.ai](mailto:contact@composo.ai).


# Quickstart
Source: https://docs.composo.ai/pages/quickstart

Get started with the Composo Evals API

Get up and running with Composo in under 5 minutes. This guide will help you evaluate your first LLM response and understand how Composo delivers deterministic, accurate evaluations.

## What You'll Build

In this 5 minute quickstart, you'll:

* Set up your Composo account and API access
* Evaluate an LLM response for quality and accuracy
* Understand how to interpret Composo's scores and explanations
* Learn the difference between reward (0-1 scoring) and binary (pass/fail) evaluations

## Step 1: Create Your Account

Sign up for a Composo account at [platform.composo.ai](https://platform.composo.ai).

## Step 2: Generate Your API Key

1. Navigate to **Profile** → **API Keys** in the dashboard
2. Click **Create New API Key**

<Note>
  If your organization has a fine-tuned model with Composo, all API keys created with organization accounts will automatically route to that finetuned model.
</Note>

## Step 3: Run Your First Evaluation

First, install the SDK:

```bash
pip install composo
```

Now let's evaluate a customer service response for empathy and helpfulness using the Composo SDK:

<CodeGroup>
  ```python Python wrap
  from composo import Composo

  # Initialize the client with your API key
  composo_client = Composo(api_key="YOUR_API_KEY")

  # Example: Evaluating a customer service response
  result = composo_client.evaluate(
      messages=[
          {"role": "user", "content": "I'm really frustrated with my device not working."},
          {"role": "assistant", "content": "I'm sorry to hear that you're experiencing issues with your device. Let's see how I can assist you to resolve this problem."}
      ],
      criteria="Reward responses that express appropriate empathy if the user is facing a problem they're finding frustrating"
  )

  # Display results
  print(f"Score: {result.score}")
  print(f"Analysis: {result.explanation}")
  ```

  ```bash cURL
  curl -X POST "https://platform.composo.ai/api/v1/evals/reward" \
    -H "API-Key: YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "messages": [
        {
          "role": "user",
          "content": "I'\''m really frustrated with my device not working."
        },
        {
          "role": "assistant",
          "content": "I'\''m sorry to hear that you'\''re experiencing issues with your device. Let'\''s see how I can assist you to resolve this problem."
        }
      ],
      "evaluation_criteria": "Reward responses that express appropriate empathy if the user is facing a problem they'\''re finding frustrating"
    }'
  ```
</CodeGroup>

### Understanding the Results

Composo returns:

* **Score**: A value between 0 and 1 (e.g. 0.86 means the response strongly meets your criteria)
* **Explanation**: Detailed analysis of why the response received this score

Example output:

```json JSON wrap
Score: 1.0/1.0
Analysis: The assistant expresses appropriate empathy and support in response to the user's frustration.
```

## Step 4: Try the Composo Agent Evaluation Framework

<Note>
  Our lighning model is not yet optimised for agent evaluation, so make sure you aren't using this for this example
</Note>

```python Python wrap
from composo import Composo, criteria

composo_client = Composo(api_key="YOUR_API_KEY")

# Simple weather agent example
messages = [
    {"role": "user", "content": "What's the weather in Paris?"},
    {"role": "assistant", "content": None, "tool_calls": [
        {
            "id": "call_123",
            "type": "function",
            "function": {
                "name": "get_weather",
                "arguments": "{\"location\": \"Paris, France\"}"
            }
        }
    ]},
    {"role": "tool", "tool_call_id": "call_123", "content": "Currently 15°C with clear skies"},
    {"role": "assistant", "content": "The weather in Paris is currently 15°C with clear skies."}
]

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City and country"}
                },
                "required": ["location"]
            }
        }
    }
]

# Evaluate with the agents framework
results = composo_client.evaluate(
    messages=messages,
    tools=tools,
    criteria=criteria.agent
)

for result in results:
    print(f"Score: {result.score}/1.00")
    print(f"Explanation: {result.explanation}\n")
```

## Step 5: Try Different Evaluation Types

### Binary Evaluation (Pass/Fail)

For simple pass/fail checks, use the binary endpoint:

<CodeGroup>
  ```python Python wrap
  from composo import Composo

  composo_client = Composo(api_key="YOUR_API_KEY")
  result = composo_client.evaluate(
      messages=[
          {"role": "user", "content": "I'm having headaches every morning. What should I do?"},
          {"role": "assistant", "content": "You should consult a healthcare professional for proper advice."}
      ],
      criteria="Response fails if the assistant provides medical advice."
  )

  print(f"Passed: {result.score}")
  print(f"Explanation: {result.explanation}")
  ```

  ```bash cURL
  curl -X POST "https://platform.composo.ai/api/v1/evals/binary" \
    -H "API-Key: YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "messages": [
        {
          "role": "user",
          "content": "I'\''m having headaches every morning. What should I do?"
        },
        {
          "role": "assistant",
          "content": "You should consult a healthcare professional for proper advice."
        }
      ],
      "evaluation_criteria": "Response fails if the assistant provides medical advice."
    }'
  ```
</CodeGroup>

### RAG Accuracy Evaluation

Evaluate how faithfully an LLM uses retrieved context:

<CodeGroup>
  ```python Python wrap
  from composo import Composo, criteria

  composo_client = Composo(api_key="YOUR_API_KEY")

  # Example RAG conversation with retrieved context
  messages = [
      {
          "role": "user",
          "content": """What is the current population of Tokyo?

  Context:
  According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents, making it the world's most populous urban agglomeration. The Tokyo Metropolis itself has 14.0 million people."""
      },
      {
          "role": "assistant",
          "content": "Based on the 2020 census data provided, Tokyo has 14.0 million people in the metropolis proper, while the greater metropolitan area contains approximately 37.4 million residents, making it the world's largest urban agglomeration."
      }
  ]

  # Evaluate with the RAG framework
  results = composo_client.evaluate(
      messages=messages,
      criteria=criteria.rag
  )

  for result in results:
      print(f"Score: {result.score}/1.00")
      print(f"Explanation: {result.explanation}\n")
  ```

  ```bash cURL
  curl -X POST "https://platform.composo.ai/api/v1/evals/reward" \
    -H "API-Key: YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "messages": [
        {
          "role": "user",
          "content": "What is the current population of Tokyo?\n\nContext:\nAccording to the 2020 census, Tokyo'\''s metropolitan area has approximately 37.4 million residents, making it the world'\''s most populous urban agglomeration. The Tokyo Metropolis itself has 14.0 million people."
        },
        {
          "role": "assistant",
          "content": "Based on the 2020 census data provided, Tokyo has 14.0 million people in the metropolis proper, while the greater metropolitan area contains approximately 37.4 million residents, making it the world'\''s largest urban agglomeration."
        }
      ],
      "evaluation_criteria": "Reward responses that accurately use the provided context and cite specific data points"
    }'
  ```
</CodeGroup>

## What's Next?

Now that you've made your first evaluation, explore more advanced features:

1. [**SDK Documentation**](/pages/sdk/overview) - Learn how to use the Python SDK
2. [**Writing Effective Criteria**](/pages/guides/criteria-writing) - Learn how to craft precise evaluation criteria for your use case
3. [**Criteria Library**](/pages/guides/criteria-library) - Browse pre-built criteria for common evaluation scenarios
4. [**Use Cases**](/pages/usecases) - See examples for RAG, customer service, content generation, and more


# null
Source: https://docs.composo.ai/pages/sdk/overview


[//]: # "##############################"

[//]: # "N.B. recommend keeping sdk/readme.md and docs/pages/sdk/overview - docs overview and pypi cover page - identical to minimise maintenance"

[//]: # "N.B. SDK docs should contain only SDK-specifc features e.g. multiple criteria, async, etc. General tool calling or RAG docs should be elsewhere"

[//]: # "##############################"

Composo provides a Python SDK for Composo evaluation, with:

* **Dual Client Support**: Both synchronous and asynchronous clients
* **Convenient Format**: Compatible with python dictionaries and results objects from OpenAI and Anthropic
* **HTTP Goodies**: Connection pooling + retry logic

> **Note:** This SDK is for Python users. If you're using TypeScript, JavaScript, or other languages, please refer to the [REST API Reference](https://docs.composo.ai/api-reference/evals/reward) to call the API directly.

## Installation

Install the SDK using pip:

```bash wrap
pip install composo
```

# Quick Start

Let's run a simple *Hello World* evaluation to get started with Composo evaluation.

```python Python
from composo import Composo

composo_client = Composo()

result = composo_client.evaluate(
    messages=[
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Hello! How can I help you today?"}
    ],
    criteria="Reward responses that are friendly"
)

print(f"Score: {result.score}")
print(f"Explanation: {result.explanation}")
```

# Reference

### Client Parameters

Both `Composo` and `AsyncComposo` clients accept the following parameters during instantiation:

| Parameter     | Type  | Required | Default             | Description                                                                                    |
| ------------- | ----- | -------- | ------------------- | ---------------------------------------------------------------------------------------------- |
| `api_key`     | `str` | No\*     | `None`              | Your Composo API key. If not provided, will use `COMPOSO_API_KEY` environment variable         |
| `model_core`  | `str` | No       | Lastest Align model | Specify the model to use for evaluation. Options: `align-20250529`, `align-lightning-20250731` |
| `num_retries` | `int` | No       | `1`                 | Number of retry attempts for failed requests                                                   |

\*Required if `COMPOSO_API_KEY` environment variable is not set.

<Note>
  Lightning model does not currently support agents and tool calling, for that evaluation you must be using the default align model.
</Note>

### Evaluation Method Parameters

The `evaluate()` method accepts the following parameters:

| Parameter  | Type                             | Required | Description                                                 |
| ---------- | -------------------------------- | -------- | ----------------------------------------------------------- |
| `messages` | `List[Dict]`                     | Yes      | List of message dictionaries with 'role' and 'content' keys |
| `criteria` | `str` or `List[str]`             | Yes      | Evaluation criteria (single string or list of criteria)     |
| `tools`    | `List[Dict]`                     | No       | Tool definitions for evaluating tool calls                  |
| `result`   | `OpenAI/Anthropic Result Object` | No       | Pre-computed LLM result object to evaluate                  |

#### Environment Variables

The SDK supports the following environment variables:

* `COMPOSO_API_KEY`: Your Composo API key (used when `api_key` parameter is not provided)

### Response Format

The `evaluate` method returns an `EvaluationResponse` object:

```python Python
class EvaluationResponse:
    score: Optional[float]      # Score from 0-1
    explanation: str            # Evaluation explanation
```

# Async Evaluation

Use the async client when you need to run multiple evaluations concurrently or integrate with async workflows.

```python Python
import asyncio
from composo import AsyncComposo

async def main():
    composo_client = AsyncComposo()
    result = await composo_client.evaluate(
        messages=[
            {"role": "user", "content": "Hello"},
            {"role": "assistant", "content": "Hello! How can I help you today?"}
        ],
        criteria="Reward responses that are friendly"
    )
    
    print(f"Score: {result.score}")
    print(f"Explanation: {result.explanation}")

asyncio.run(main())
```

# Multiple Criteria Evaluation

When evaluating against multiple criteria, the async client runs all evaluations concurrently for better performance.

```python Python
import os
import asyncio
from composo import AsyncComposo

async def main():
    client = AsyncComposo()

    messages = [
        {"role": "user", "content": "Explain quantum computing in simple terms"},
        {"role": "assistant", "content": "Quantum computing uses quantum mechanics to process information..."}
    ]

    criteria = [
        "Reward responses that explain complex topics in simple terms",
        "Reward responses that provide accurate technical information",
        "Reward responses that are engaging and easy to understand"
    ]

    results = await client.evaluate(messages=messages, criteria=criteria)

    for i, result in enumerate(results):
        print(f"Criteria {i+1}: Score = {result.score}")
        print(f"Explanation: {result.explanation}\n")

asyncio.run(main())
```

# Evaluating OpenAI/Anthropic Outputs

You can directly evaluate the result of a call to the OpenAI SDK by passing the return of completions.create to composo evaluate. N.B. Composo will always evaluate choices\[0].

```python Python
import os
import openai
from composo import Composo

composo_client = Composo()

openai_composo_client = openai.OpenAI(api_key="your-openai-key")
openai_result = openai_composo_client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "What is machine learning?"}]
)

result = composo_client.evaluate(
    messages=[{"role": "user", "content": "What is machine learning?"}],
    result=openai_result,
    criteria="Reward accurate technical explanations"
)

print(f"Score: {result.score}")
```

# Error Handling

The SDK provides specific exception types:

```python Python
from composo import (
    ComposoError,
    RateLimitError,
    MalformedError,
    APIError,
    AuthenticationError
)

try:
    result = composo_client.evaluate(messages=messages, criteria=criteria)
except RateLimitError:
    print("Rate limit exceeded")
except AuthenticationError:
    print("Invalid API key")
except ComposoError as e:
    print(f"Composo error: {e}")
```

## Logging

The SDK uses Python's standard logging module. Configure logging level:

```python Python
import logging
logging.getLogger("composo").setLevel(logging.INFO)
```


# Agent Evaluation
Source: https://docs.composo.ai/pages/usecases/agent-evaluation

Evaluate the performance of your agentic systems with Composo's comprehensive agent framework.

## Why Agent Evaluation Matters

As LLM applications evolve from simple chat interfaces to sophisticated agentic systems with tool calling, multi-step reasoning, and complex workflows, traditional evaluation approaches fail to capture what makes agents actually work in production.

## The Composo Agent Framework

Start here with our battle-tested framework that evaluates agents across five critical dimensions. We've developed this framework through extensive R\&D and tested with industry partners.

### Proven Through Rigorous Research & Real-World Testing

This framework represents **>12 months of intensive R\&D** with leading AI teams who needed agent evaluation that actually works in production. Here's what makes it different:

**The Research Journey**

* **Thousands of production agent traces analyzed** from both regulated enterprises as well as leading AI startups
* **12 major framework iterations** based on real-world failure modes we discovered
* **Validated across 8 industries** including healthcare, finance, legal, and deep knowledge research
* **>85% accuracy** in predicting agent success/failure before deployment
* **3x faster debugging** of agent issues compared to manual analysis

**Why These Specific Metrics?**

Our research revealed that agent failures cluster into five distinct patterns. Traditional "did it get the right answer?" evaluation misses >70% of these failure modes:

* **Exploration vs Exploitation imbalance**: Agents that either never try new approaches (getting stuck) or never leverage what they've learned (inefficient loops)
* **Tool misuse patterns**: Subtle errors in parameter formatting that work 90% of the time but fail catastrophically on edge cases
* **Goal drift**: Agents that solve *a* problem but not *the user's* problem
* **Hallucinated capabilities**: Agents hallucinating as LLMs are always prone to do (e.g. claiming success when tools actually returned errors, or abandoning critical information from earlier in the conversation)

Each metric in our framework directly addresses these production failure modes. This isn't academic theory—it's battle-tested engineering derived from millions of real agent interactions.

**Industry Validation**

*"Composo's agent framework caught critical issues our own evaluation suite missed. It identified tool-calling patterns that would have caused production outages."* - ML Engineer, Fortune 500 Financial Services

*"We reduced our agent failure rate by 35% after implementing Composo's evaluation framework in our CI/CD pipeline."* - Head of AI, Healthcare Startup

This framework now evaluates over **10 million agent interactions monthly** across our customer base, continuously proving its effectiveness at scale.

### Core Agent Metrics

**🔍 Exploration**

`Reward agents that plan effectively: exploring new information and capabilities, and investigating unknowns despite uncertainty`

**⚡ Exploitation**

`Reward agents that plan effectively: exploiting existing knowledge and available context to create reliable plans with predictable outcomes`

**🔧 Tool Use**

`Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls`

**🎯 Goal Pursuit**

`Reward agents that work towards the goal specified by the user`

**✅ Agent Faithfulness**

`Reward agents that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation`

## Implementation Guide

<Note>
  Agent evaluation is currently only available with our default model, not the lightning model
</Note>

Get started evaluating your agent in under 5 minutes using our pre-built agent framework:

```python wrap
from composo import Composo, criteria

composo_client = Composo(api_key="YOUR_API_KEY")

# Simple weather agent example
messages = [
    {"role": "user", "content": "What's the weather in Paris?"},
    {"role": "assistant", "content": None, "tool_calls": [
        {
            "id": "call_123",
            "type": "function",
            "function": {
                "name": "get_weather",
                "arguments": "{\"location\": \"Paris, France\"}"
            }
        }
    ]},
    {"role": "tool", "tool_call_id": "call_123", "content": "Currently 15°C with clear skies"},
    {"role": "assistant", "content": "The weather in Paris is currently 15°C with clear skies."}
]

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City and country"}
                },
                "required": ["location"]
            }
        }
    }
]

# Evaluate with the agents framework
results = composo_client.evaluate(
    messages=messages,
    tools=tools,
    criteria=criteria.agent
)

for result in results:
    print(f"Score: {result.score}/1.00")
    print(f"Explanation: {result.explanation}\n")
```

### Evaluating with Individual Metrics

You can also evaluate against specific metrics from the framework:

```python wrap
# Evaluate specific aspects of agent behavior
results = composo_client.evaluate(
    messages=agent_trace,
    tools=tool_definitions,
    criteria=[
        "Reward agents that work towards the goal specified by the user",
        "Reward agents that operate tools correctly in accordance with the tool definition",
        "Reward agents that only make claims directly supported by tool call returns"
    ]
)
```

## Advanced Agent Metrics

Once you've mastered the core framework, explore these additional agent-level metrics for deeper insights:

**Agent Sequencing**

`Reward agents that follow logical sequences, such as gathering required information from user before attempting specific lookups`

**Agent Efficiency**

`Reward agents that are efficient when working towards their goal`

**Agent Thoroughness**

`Reward agents that are fully comprehensive and thorough when working towards their goal`

## Evaluating Individual Tool Calls

For granular analysis, evaluate specific tool call steps within your agent trace:

**Tool Call Formulation**

`Reward tool calls that formulate arguments using only information provided by the user or previous tool call returns without fabricating parameters`

**Tool Relevance**

`Reward tool calls that perform actions or retrieve information directly relevant to the goal`

**Response Completeness from Tool Returns**

`Reward responses that incorporate all relevant information from tool call returns needed to comprehensively answer the user's question`

**Response Precision from Tool Returns**

`Reward responses that include only the specific information from tool call returns that directly addresses the user's query`

**Response Faithfulness to Tool Returns**

`Reward responses that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation`

## Writing Custom Agent Criteria

While our agent framework and additional metrics cover many use cases, you can write custom criteria for your specific domain. See our [Criteria Writing guide](/pages/guides/criteria-writing) for detailed instructions on crafting your own criteria.

Common patterns for custom agent criteria:

```python wrap
# Healthcare agent
"Reward agents that appropriately defer to medical professionals for diagnosis"

# Financial agent
"Reward agents that verify account permissions before accessing sensitive data"

# Code generation agent
"Reward agents that validate syntax before executing code modifications"

# Research agent
"Reward agents that prioritize peer-reviewed sources over general web content"
```

## Next Steps

* [**Read our Agent Evaluation Blog**](https://www.composo.ai/post/agentic-evals) - Deep dive into evaluation strategies
* [**Explore the Criteria Library**](/pages/guides/criteria-library) - Find more pre-built criteria


# RAG Evaluation
Source: https://docs.composo.ai/pages/usecases/rag-evaluation

Battle-tested metrics for retrieval-augmented generation including faithfulness, completeness, and precision.

## Why RAG Evaluation Matters

Retrieval-Augmented Generation (RAG) systems are only as good as their ability to accurately use retrieved information. Poor RAG performance leads to hallucinations, incomplete answers, and loss of user trust. Composo's RAG framework provides comprehensive evaluation across the critical dimensions of RAG quality.

## The Composo RAG Framework

Our framework, developed through extensive R\&D and rigorously tested with Fortune 500 companies and leading AI teams, delivers **92% accuracy** in detecting hallucinations and faithfulness violations—far exceeding the \~70% accuracy of LLM-as-judge approaches.

### Proven Performance

* **18 months of research** refining the optimal RAG evaluation criteria
* **Battle-tested** across hundreds of production RAG systems including for critical hallucination detection in regulated industries
* **92% agreement** with expert human evaluators on RAG quality assessment
* **70% reduction in error rate** compared to traditional LLM-as-judge methods

This isn't just another evaluation tool—it's the result of deep collaboration with industry leaders who needed evaluation that actually works for production RAG systems handling millions of queries daily.

### Core RAG Metrics

**📖 Context Faithfulness**

"Reward responses that make only claims directly supported by the provided source material without any hallucination or speculation"

**✅ Completeness**

"Reward responses that comprehensively include all relevant information from the source material needed to fully answer the question"

**🎯 Context Precision**

"Reward responses that include only information necessary to answer the question without extraneous details from the source material"

**🔍 Relevance**

"Reward responses where all content directly addresses and is relevant to answering the user's specific question"

## Implementation Example

Here's how to evaluate a RAG system's performance using our framework:

```python Python wrap
from composo import Composo, criteria

composo_client = Composo(api_key="your-api-key-here")

# Example RAG conversation with retrieved context
messages = [
    {
        "role": "user", 
        "content": """What is the current population of Tokyo?

Context:
According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents, making it the world's most populous urban agglomeration. The Tokyo Metropolis itself has 14.0 million people."""
    },
    {
        "role": "assistant", 
        "content": "Based on the 2020 census data provided, Tokyo has 14.0 million people in the metropolis proper, while the greater metropolitan area contains approximately 37.4 million residents, making it the world's largest urban agglomeration."
    }
]

# Evaluate with the RAG framework
results = composo_client.evaluate(
    messages=messages,
    criteria=criteria.rag
)

for result in results:
    print(f"Score: {result.score}/1.00")
    print(f"Explanation: {result.explanation}\n")
```

## Evaluating Retrieval Quality

Beyond evaluating the generated responses, you can also assess the quality of your retrieval system itself. This helps identify when your vector search or retrieval mechanism needs improvement before it impacts downstream generation.

### How It Works

Treat your retrieval step as a "tool call" and evaluate whether the retrieved chunks are actually relevant to the user's query. This gives you quantitative metrics on retrieval precision.

### Implementation

```python Python wrap
from composo import Composo

composo_client = Composo(api_key="your-api-key-here")

# User's question
user_query = "What is the current population of Tokyo?"

# Chunks retrieved by your RAG system
retrieved_chunks = """
Chunk 1: According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents.
Chunk 2: The Tokyo Metropolis itself has 14.0 million people.
Chunk 3: Population density in Tokyo is approximately 6,158 people per square kilometer.
"""

# Define the retrieval tool (for context)
tools = [
    {
        "type": "function",
        "function": {
            "name": "rag_retrieval",
            "description": "Retrieves relevant document chunks based on semantic search",
            "parameters": {"type": "object", "required": [], "properties": {}}
        }
    }
]

# Evaluate retrieval quality
result = composo_client.evaluate(
    messages=[
        {"role": "user", "content": user_query},
        {"role": "function", "name": "rag_retrieval", "content": retrieved_chunks}
    ],
    tools=tools,
    criteria="Reward tool calls that retrieve chunks directly relevant to answering the user's question"
)

print(f"Retrieval Quality Score: {result.score:.2f}/1.00")
# High scores (>0.8) indicate good retrieval
# Low scores (<0.6) suggest retrieval improvements needed
```


# Response Quality Evaluation
Source: https://docs.composo.ai/pages/usecases/response-evaluation

Evaluate custom quality aspects of LLM responses

Beyond our pre-built Agent & RAG frameworks, Composo's real power lies in writing custom criteria for any quality aspect you care about—and most teams do exactly this for their specific use cases.

## What is Response Quality Evaluation?

Response quality evaluation assesses subjective and domain-specific aspects of assistant responses: tone, style, safety, adherence to guidelines, and any custom quality metric unique to your application.

## Example Criteria

### Core Quality Metrics

* **Conciseness**: `"Reward responses that are clear and direct, avoiding unnecessary verbosity, repetition, or extraneous details"`
* **Information Structure**: `"Reward responses that present information in a logical, well-organized format that prioritizes the most important details"`
* **Professional Tone**: `"Reward responses that maintain appropriate professional language and tone suitable for the context"`
* **Actionable Guidance**: `"Reward responses that provide practical next steps or actionable recommendations when appropriate"`

### Safety & Compliance

* **Harmful Content**: `"Penalize responses that provide inappropriate advice (e.g., medical advice, harmful instructions) outside the system's intended scope"`
* **System Compliance**: `"Penalize responses that violate explicit system constraints, limitations, or instructions"`

### Domain-Specific Examples

* **Healthcare**: `"Reward responses that use precise medical terminology appropriate for the audience (clinician vs patient)"`
* **Customer Service**: `"Reward responses that express appropriate empathy when the user is frustrated"`
* **Technical Support**: `"Reward responses that precisely adhere to the technical user manual's resolution steps"`
* **Education**: `"Reward responses that adapt explanation complexity to match the user's learning level"`

## Writing Effective Criteria

Every criterion follows this simple template:

```
[Prefix] [quality] [qualifier (optional)]
```

* **Prefix**: "Reward responses that..." or "Penalize responses that..."
* **Quality**: The specific behavior you want to evaluate
* **Qualifier**: Optional "if" statement for conditional application

**Example**: `"Reward responses that provide code examples if the user asks for implementation details"`

* Prefix: "Reward responses that"
* Quality: "provide code examples"
* Qualifier: "if the user asks for implementation details"

### Key Principles

✅ **Be specific** - Focus on one quality at a time\
✅ **Use clear direction** - Start with "Reward" or "Penalize"\
✅ **Add qualifiers when needed** - Use "appropriate" for non-monotonic qualities\
✅ **Leverage domain expertise** - Your knowledge of what "good" looks like is your secret weapon

## Next Steps

📚 [**Browse our Criteria Library**](/pages/guides/criteria-library) - Explore tried & tested criteria across domains for inspiration\
✏️ [**How to Write Criteria Guide**](/pages/guides/criteria-writing) - Master the art of writing precise evaluation criteria