# Binary Source: https://docs.composo.ai/api-reference/evals/binary https://platform.composo.ai/api/evals-docs/openapi.json post /api/v1/evals/binary Evaluate LLM output against specified criteria. Result is pass/fail. # Reward Source: https://docs.composo.ai/api-reference/evals/reward https://platform.composo.ai/api/evals-docs/openapi.json post /api/v1/evals/reward Evaluate LLM output against specified criteria. Score on a continuous 0-1 scale. # Healthcheck Source: https://docs.composo.ai/api-reference/healthcheck https://platform.composo.ai/api/evals-docs/openapi.json get /api/healthcheck # Get Usage Source: https://docs.composo.ai/api-reference/usage/get-usage https://platform.composo.ai/api/evals-docs/openapi.json get /api/v1/usage Get current usage information for the authenticated user. # FAQs Source: https://docs.composo.ai/pages/FAQs/common-questions ### Should I include system messages when evaluating with Composo? * Including system messages is optional but recommended, as they provide useful context that can improve evaluation accuracy. ### What's the context limit? * 200k tokens ### What's the expected response time? * **Composo Align (flagship model):** 5-15 seconds per API call * **Composo Lightning:** 3 seconds per API call ### Can I run parallel requests? * Yes, we recommend limiting to 5 parallel API calls for optimal performance ### What are the rate limits? * **Free plan:** 500 requests per hour * **Paid plans:** Higher limits based on your specific plan ### What languages are supported? * Our evaluation models support all major languages plus code. A good rule of thumb is that if you don't need a specialized model to deal with your language, we can handle it. ### What's the difference between reward and binary evaluation? * **Reward evaluation:** Returns a continuous score from 0-1 measuring how well the output meets your criteria * **Binary evaluation:** Returns a simple pass/fail result for clear-cut criteria or policy compliance ### Can I evaluate tool calls and agents, not just responses? * Yes! Composo evaluates three types of outputs: * **Responses:** The assistant's latest response * **Tool calls:** Individual tool call parameters and selection * **Agents:** Complete end-to-end agent traces ### How deterministic are the evaluation scores? * Composo provides \<1% variance in scores - the same input will always produce the same output, unlike LLM-as-judge approaches which have >30% variance. # Anonymizing Data for Composo Evaluations Source: https://docs.composo.ai/pages/guides/anonymization Anonymizing your data while maintaining evaluation quality When dealing with sensitive customer information, you may need to anonymize data before sending it to Composo evaluation services. This guide explains how to effectively anonymize your data while preserving evaluation quality. ## Recommended Anonymization Approach For optimal evaluation results, we recommend using a **consistent placeholder substitution** approach rather than removing or scrambling PII. This preserves relationships between entities that are important for evaluation quality. ### Best Practices 1. **Use sequential placeholders** for each entity type * Replace "Bob sent an email to Sally" with "NAME\_1 sent an email to NAME\_2" * This preserves relationships between entities 2. **Maintain placeholder consistency** across all related content * The same entity should have the same placeholder ID throughout a single evaluation request * Example: If "Sally" is "NAME\_2" in one part, it should remain "NAME\_2" everywhere in that request 3. **Preserve structure and context** * Keep sentence structure, formatting, and non-PII context intact * This ensures evaluations remain accurate and meaningful Numbering can be omitted if there is only one instance of a particular entity type. For example, if only one name appears in your data, you can simply use "NAME" instead of "NAME\_1". ## Recommended PII Types to Anonymize * Person names → "NAME\_1", "NAME\_2", etc. * Email addresses → "EMAIL\_1", "EMAIL\_2", etc. * Phone numbers → "PHONE\_1", "PHONE\_2", etc. * Physical addresses → "ADDRESS\_1", "ADDRESS\_2", etc. (you can retain country/region) * URLs → "URL\_1", "URL\_2", etc. ## Implementation Example **Original Data:** ```json wrap { "messages": [ {"role": "user", "content": "How do I contact Bob Smith?"}, {"role": "assistant", "content": "You can reach Bob Smith at bob.smith@example.com or call him at (555) 123-4567."} ], "evaluation_criteria": "Reward responses that provide complete contact information when requested." } ``` **Anonymized Data:** ```json { "messages": [ {"role": "user", "content": "How do I contact NAME_1?"}, {"role": "assistant", "content": "You can reach NAME_1 at EMAIL_1 or call him at PHONE_1."} ], "evaluation_criteria": "Reward responses that provide complete contact information when requested." } ``` ## Tools for Anonymization We recommend using [Microsoft Presidio](https://github.com/microsoft/presidio), an open-source framework for PII detection and anonymization. It provides: * Entity recognition for common PII types * Multiple anonymization methods * Support for multiple languages * Customizable entity detection # Composo & Langfuse Source: https://docs.composo.ai/pages/guides/composo-langfuse How to use Composo in combination with Langfuse This guide shows how to integrate Composo's deterministic evaluation with Langfuse's observability platform to evaluate your LLM applications with confidence. ## Overview **Langfuse** provides comprehensive observability for LLM applications with tracing, debugging, and dataset management capabilities. **Composo** delivers deterministic, accurate evaluation through purpose-built generative reward models that achieve 92% accuracy (vs 72% for LLM-as-judge). Together, they enable you to: * ✅ Track every LLM interaction through Langfuse's tracing * ✅ Add deterministic evaluation scores to your traces * ✅ Evaluate datasets programmatically with reliable metrics * ✅ Ship AI features with confidence using quantitative, trustworthy metrics ## Prerequisites ```python Python wrap pip install langfuse composo ``` ```python Python wrap import os from langfuse import Langfuse, get_client from composo import Composo, AsyncComposo # Set your API keys os.environ["LANGFUSE_PUBLIC_KEY"] = "your-public-key" os.environ["LANGFUSE_SECRET_KEY"] = "your-secret-key" os.environ["COMPOSO_API_KEY"] = "your-composo-key" # Initialize clients langfuse = get_client() composo_client = Composo() async_composo = AsyncComposo() ``` ## How Langfuse & Composo work in combination Untitleddiagram Mermaid Chart 2025 08 19 133254 Pn ## Method 1: Real-time Trace Evaluation Evaluate LLM outputs as they're generated in production or development. This approach uses the `@observe` decorator to automatically trace your LLM calls, then evaluates them with Composo asynchronously. More detail on how the langfuse @observe decorator works is [here](https://langfuse.com/docs/observability/sdk/python/sdk-v3#basic-tracing). ### When to use * Production monitoring with real-time quality scores * Development iteration with immediate feedback ### Implementation ```python Python wrap import asyncio from langfuse import get_client, observe from anthropic import Anthropic from composo import AsyncComposo # Initialize async Composo client async_composo = AsyncComposo() @observe() async def llm_call(input_data: str) -> str: # LLM call with async evaluation using @observe decorator model_name = "claude-sonnet-4-20250514" anthropic = Anthropic() resp = anthropic.messages.create( model=model_name, max_tokens=100, messages=[{"role": "user", "content": input_data}], ) output = resp.content[0].text.strip() # Get trace ID for scoring trace_id = get_client().get_current_trace_id() evaluation_criteria = "Reward responses that are helpful" # Start asynchronous evaluation task (non-blocking) # Note: You can register tasks to a task queue or background tasks await evaluate_with_composo(trace_id, input_data, output, evaluation_criteria) return output async def evaluate_with_composo(trace_id, input_data, output, evaluation_criteria): # Evaluate LLM output with Composo and score in Langfuse # Composo expects a list of chat messages messages = [ {"role": "user", "content": input_data}, {"role": "assistant", "content": output}, ] eval_resp = await async_composo.evaluate( messages=messages, criteria=evaluation_criteria ) # Score the trace in Langfuse langfuse = get_client() langfuse.create_score( trace_id=trace_id, name=evaluation_criteria, value=eval_resp.score, comment=eval_resp.explanation, ) ``` Then in your main application: ```python Python wrap # Simply call the function - Langfuse logs and Composo evaluates asynchronously await llm_call(input_data) ``` ## Method 2: Dataset Evaluation Use this method to evaluate your LLM application on a dataset that already exists in Langfuse. The `item.run()` context manager automatically links execution traces to dataset items. For more detail on how this works from Langfuse please see [here](https://langfuse.com/docs/evaluation/dataset-runs/remote-run). ### When to use * Testing prompt or model changes on existing Langfuse datasets * Running experiments that you want to track in Langfuse UI * Creating new dataset runs for comparison * Regression testing with immediate Langfuse visibility ### Implementation ```python Python wrap from langfuse import get_client from anthropic import Anthropic from composo import Composo # Initialize Composo client composo = Composo() def llm_call(question: str, item_id: str, run_name: str): #Encapsulates the LLM call and appends input/output data to trace model_name = "claude-sonnet-4-20250514" with get_client().start_as_current_generation( name=run_name, input={"question": question}, metadata={"item_id": item_id}, model=model_name, ) as generation: anthropic = Anthropic() resp = anthropic.messages.create( model=model_name, max_tokens=100, messages=[{"role": "user", "content": f"Question: {question}"}], ) answer = resp.content[0].text.strip() generation.update_trace( input={"question": question}, output={"answer": answer}, ) return answer def run_dataset_evaluation(dataset_name: str, run_name: str, evaluation_criteria: str): #Run evaluation on a Langfuse dataset using Composo langfuse = get_client() dataset = langfuse.get_dataset(name=dataset_name) for item in dataset.items: print(f"Running evaluation for item: {item.id}") # item.run() automatically links the trace to the dataset item with item.run(run_name=run_name) as root_span: # Generate answer generated_answer = llm_call( question=item.input, item_id=item.id, run_name=run_name, ) print(f"Item {item.id} processed. Trace ID: {root_span.trace_id}") # Evaluate with Composo messages = [ {"role": "user", "content": f"Question: {item.input}"}, {"role": "assistant", "content": generated_answer}, ] eval_resp = composo.evaluate( messages=messages, criteria=evaluation_criteria ) # Score the trace root_span.score_trace( name=evaluation_criteria, value=eval_resp.score, comment=eval_resp.explanation, ) # Ensure all data is sent to Langfuse langfuse.flush() # Example usage if __name__ == "__main__": run_dataset_evaluation( dataset_name="your-dataset-name", run_name="evaluation-run-1", evaluation_criteria="Reward responses that are accurate and helpful" ) ``` ## Method 3: Evaluating New Datasets Use this method to evaluate datasets that don't yet exist in Langfuse. You can create your own dataset locally, evaluate it with Composo, and log both the traces and evaluation scores to Langfuse for UI interpretation. ### When to use * Evaluating new datasets before uploading to Langfuse * Quick experimentation with custom datasets * Batch evaluation of local test cases * Creating baseline evaluations for new use cases ### Implementation Please see [this notebook](https://colab.research.google.com/drive/1ZBIueZy2Ca6z0ll_8jjSq7GgLad_mXMP?usp=sharing) for the implementation approach for this. ## Method Selection Recap * Use Method 1 for real-time production monitoring * Use Method 2 for evaluating existing Langfuse datasets * Use Method 3 for evaluating new datasets that don't yet exist in Langfuse ## Resources * 📊 [Langfuse Dataset Runs Documentation](https://langfuse.com/docs/evaluation/dataset-runs/remote-run) - applicable for method 2 * 🎯 [Composo Documentation](https://docs.composo.ai/) * 💬 [Get Support](mailto:support@composo.ai) ## Next Steps 1. **Start with Method 1** for immediate feedback during development 2. **Use Method 2** to run experiments on datasets in Langfuse 3. **Apply Method 3** to evaluate new datasets before uploading to Langfuse Ready to get started? [Sign up for Composo](https://platform.composo.ai/) to get your API key and begin evaluating with confidence. # Criteria Library Source: https://docs.composo.ai/pages/guides/criteria-library Here's a range of criteria that we've seen to help when writing your own ## Core frameworks (start here) ### RAG framework * **Context Faithfulness**: Reward responses that make only claims directly supported by the provided source material without any hallucination or speculation * **Completeness**: Reward responses that comprehensively include all relevant information from the source material needed to fully answer the question * **Context Precision**: Reward responses that include only information necessary to answer the question without extraneous details from the source material * **Relevance**: Reward responses where all content directly addresses and is relevant to answering the user's specific question ### Agents framework * **Exploration**: Reward agents that plan effectively: exploring new information and capabilities, and investigating unknowns despite uncertainty * **Exploitation**: Reward agents that plan effectively: exploiting existing knowledge and available context to create reliable plans with predictable outcomes * **Tool use**: Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls * **Goal pursuit**: Reward agents that works towards the goal specified by the user * **Agent Faithfulness**: Reward agents that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation ## Advanced metrics (use these next) ### Agents * **Agent Sequencing:** Reward agents that follows logical sequences, such as gathering required information from user before attempting specific lookups * **Agent Efficiency:** Reward agents that are efficient when working towards their goal * **Agent Thoroughness:** Reward agents that are fully comprehensive and thorough when working towards their goal ### Individual tool call focused (use these when you want to pinpoint specific tool call steps) * **Tool Call Formulation:** Reward tool calls that formulate arguments using only information provided by the user or previous tool call returns without fabricating parameters. * **Tool Relevance:** Reward tool calls that perform actions or retrieve information directly relevant to the goal. * **Response completeness from tool return:** Reward responses that incorporate all relevant information from tool call returns needed to comprehensively answer the user's question. * **Response precision from tool return:** Reward responses that include only the specific information from tool call returns that directly addresses the user's query * **Response faithfulness to tool return**: Reward responses that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation ### Response quality * **Conciseness:** Reward responses that are clear and direct, avoiding unnecessary verbosity, repetition, or extraneous details * **Information Structure:** Reward responses that present information in a logical, well-organized format that prioritizes the most important details. * **Professional Tone:** Reward responses that maintain appropriate professional language and tone suitable for the context. * **Actionable Guidance:** Reward responses that provide practical next steps or actionable recommendations when appropriate. ### Accuracy and robustness * **Source Attribution:** Reward responses that explicitly cite or reference specific source documents or sections used to support each claim. * **Factual Accuracy:** Reward responses that accurately reflect factual information without introducing errors or fabricated details. * **Uncertainty Handling:** Reward responses that appropriately acknowledge limitations when information is incomplete or unavailable, rather than making assumptions. * **Appropriate Refusals:** Reward responses that appropriately refuse to answer when source material lacks sufficient information to address the question. ### Safety * **Harmful Content Prevention:** Penalize responses that provide inappropriate advice (e.g., medical advice, harmful instructions) outside the system's intended scope. * **System Compliance:** Penalize responses that violate explicit system constraints, limitations, or instructions. ## Extended library (for inspiration when writing your own) * **Creativity:** Reward responses that demonstrate original thinking, novel approaches, or innovative solutions. * **Empathy:** Reward responses that show understanding and connection with human emotions and experiences. * **Humor:** Reward responses that appropriately use wit, clever wordplay, or situational comedy when suitable to context. * **Surprise:** Reward responses that include unexpected but delightful elements or developments. * **Happiness:** Reward responses that evoke positive emotions and create uplifting experiences. * **Narrative Structure:** Reward responses that maintain logical progression and development. * **Legal Authority:** Reward responses that prioritize the most authoritative legal sources (legislation, case law, preparatory works). * **Jurisdictional Accuracy:** Reward responses that correctly identify jurisdictional context and cite the most recent legally binding sources. * **Legal Terminology:** Reward responses that correctly interpret legal terminology, avoiding confusion with non-legal meanings. * **Citation Recognition:** Reward responses that recognize and appropriately process standard legal citation formats. * **Quantitative Accuracy:** Reward responses that accurately represent quantitative data without speculation beyond provided information. * **Metric Context:** Reward responses that include appropriate context for metrics, comparisons, and calculations. * **Risk Disclosure:** Reward responses that acknowledge limitations and uncertainties in quantitative analysis. * **Regulatory Compliance:** Penalize responses that include financial recommendations without appropriate risk disclaimers. * **Issue Resolution:** Reward responses that capture all significant elements: issue nature, agent actions, and resolutions offered. * **Entity Accuracy:** Reward responses that correctly identify specific entities (payment methods, brands, etc.) only when explicitly mentioned. * **Interaction Dynamics:** Reward responses that accurately represent both customer and agent perspectives. * **Chronological Clarity:** Reward responses that present information in clear chronological sequence. * **Query Translation:** Reward SQL queries that accurately translate natural language intent with proper syntax. * **Feature Accuracy:** Penalize responses that reference outdated, incorrect, or non-existent functionality. * **Validation Implementation:** Penalize responses that fail to include critical validation rules when specified. * **Cost Efficiency:** Reward responses that provide cost-effective technical solutions. * **Medical Terminology:** Reward responses that use precise medical terminology appropriate for the audience (clinician vs patient). * **Evidence-Based Content:** Reward responses that reference current clinical guidelines or peer-reviewed studies. * **Harm Prevention:** Penalize responses that could delay necessary medical care through self-diagnosis suggestions. * **Appropriate Referrals:** Reward responses that direct users to qualified healthcare professionals for medical decisions. * **Learning Adaptation:** Reward responses that adapt explanation complexity to match the user's learning level. * **Conceptual Building:** Reward responses that connect new concepts to familiar ideas. * **Active Learning:** Reward responses that encourage critical thinking through questions when pedagogically appropriate. * **Misconception Correction:** Reward responses that identify and gently correct common misconceptions. * **Voice Consistency:** Reward responses that maintain consistent brand voice and personality. * **Audience Targeting:** Reward responses that tailor language and complexity for the specified target audience. * **Hook Effectiveness:** Reward responses with compelling openings appropriate to the platform. * **SEO Optimization:** Reward responses that naturally incorporate relevant keywords without compromising readability. * **Specification Accuracy:** Reward responses that accurately represent product details without fabrication. * **Comparison Fairness:** Reward responses that provide balanced product comparisons with strengths and limitations. * **Decision Support:** Reward responses that help users make informed decisions by addressing common concerns. * **Policy Clarity:** Reward responses that clearly communicate relevant policies when applicable. * **Scholarly Rigor:** Reward responses that properly cite primary sources and acknowledge research limitations. * **Literature Synthesis:** Reward responses that effectively synthesize multiple sources while maintaining distinct attribution. * **Academic Integrity:** Reward responses that encourage original thinking and proper attribution. * **Disciplinary Conventions:** Reward responses that follow discipline-specific writing and citation styles. * **Context Retention:** Reward responses that appropriately reference and build upon previous conversation turns. * **Intent Recognition:** Reward responses that correctly identify user intent even when expressed ambiguously. * **Emotional Intelligence:** Reward responses that appropriately recognize and respond to user emotional states. * **Boundary Awareness:** Reward responses that maintain professional boundaries while being helpful. * **Cultural Adaptation:** Reward responses that appropriately adapt content for cultural context beyond literal translation. * **Idiomatic Accuracy:** Reward responses that correctly handle idioms and culture-specific references. * **Terminology Consistency:** Reward responses that maintain consistent technical terminology throughout translations. * **Contextual Disambiguation:** Reward responses that correctly resolve ambiguous terms based on domain context. # How to write effective criteria Source: https://docs.composo.ai/pages/guides/criteria-writing When crafting your evaluation criteria, consider the following guidelines to ensure effective and meaningful assessments: **Be Specific and Focused**: Clearly define the quality or behavior you want to evaluate. Avoid vague statements. Focus on a single aspect per criterion to maintain clarity. * *Example*: Instead of "good," use "a friendly and encouraging tone." **Use Clear Direction**: Begin your criteria with an explicit directive such as `"Reward responses that..."`, `"Penalize responses that..."`, `"Reward tool calls..."`, `"Reward agents that..."`. * *Example*: `"Reward responses that use empathetic language when addressing user concerns."` **Monotonic or Appropriately Qualified Qualities**: Ideally, the quality you're assessing should be monotonic (more is always better for rewards, worse for penalties). For non-monotonic qualities where balance matters, use qualifiers like "appropriate" to ensure higher scores represent better adherence. * *Example*: Instead of `"Reward responses that are polite"` which can become excessive, use `"Reward responses that use an appropriate level of politeness"` ensuring the response is polite but not overly so. **Avoid Conjunctions**: Focus on one quality at a time. Using "and" often indicates multiple qualities, which can lead to unclear scoring when only one quality is present. * *Example*: Instead of `"The assistant should be concise and informative"` split into two separate criteria. **Avoid LLM Keywords**: Composo's reward model is finetuned from LLM models trained in conversation format. Avoid alternate definitions of 'User' and 'Assistant' that might conflict with LLM keywords 'user' and 'assistant'. * *Example*: Instead of `"Reward responses that comprehensively address the User Question"`, rename the 'User Question' in your prompt and use `"Reward responses that comprehensively address the Target Question"` **Leverage Domain Expertise**: Your domain knowledge is your secret weapon. Inject your understanding of what constitutes a 'good' answer in your specific field—this gives your evaluation model leverage over the generative model. * *Example*: For medical contexts: `"Reward responses that distinguish between emergency symptoms requiring immediate care versus symptoms suitable for routine appointments"` **Use Qualifiers When Needed**: Include a qualifier starting with "if" to specify when the criterion should apply. This helps handle conditional requirements. * *Example*: `"Reward responses that provide code examples if the user asks for implementation details"` **Keep Criteria Concise**: Aim for one clear sentence per criterion. If you need multiple sentences to explain, consider splitting into separate criteria. #### Reward responses that provide correct information based solely on the provided context without fabricating details. OK. Clarification about 'correct' would be useful—does it have to be factually correct, or only in agreement with the provided context? #### Reward responses that directly address the 'User Question' without including irrelevant information. Poor. Clauses that change the definition of user and assistant from the LLM definition risk confusion. #### Reward responses that properly cite the specific source of information from the provided context. Good. 'Properly' is slightly ambiguous and rolls in both concepts of citation style and accuracy. #### Reward responses that appropriately acknowledge limitations if information is incomplete or unavailable rather than guessing. Good. Could be improved by clarifying what the agent might be guessing at. #### Reward responses that comprehensively address all aspects of the 'User Question' if information is available in the context. Poor. Clauses that change the definition of user and assistant from the LLM definition risk confusion. #### Reward responses that present technical information in a logical, well-organized format that prioritizes the most important details. Excellent. It's clear what format we're looking for and what kind of information that applies to. #### Reward responses that provide practical next steps or recommendations if appropriate and supported by the context. OK. Somewhat ambiguous about what should be supported by the context—is it the next steps or the relevance of the question? #### Reward responses that strictly include only information explicitly stated in the support ticket, without adding any fabricated details or assumptions. Excellent. It's clear what the expected input is and what the model should be doing. #### Reward responses that correctly identify and include specific entities (payment methods, product categories, brands, couriers) only when explicitly mentioned in the ticket, avoiding hallucinations of these elements. Excellent. It's clear that we're trying to avoid fabricating names of specific entities and the examples make it even clearer. #### Reward responses that include all significant elements of the support ticket, including the nature of the issue, agent actions, and resolutions offered, without omitting key details. Excellent. It's clear that we're looking for good coverage of the important elements in the response. #### Reward responses that present the information in a clear chronological sequence that accurately reflects the flow of the support interaction. Excellent. A clear requirement for chronological presentation of the information in the support interaction. #### Penalize responses that include unnecessary concluding statements, evaluative summaries, or editorial comments not derived from the ticket content. Excellent. It's clear that we're trying to avoid verbose summary content that isn't clearly derived from the provided ticket. #### Reward responses that demonstrate empathy while acknowledging the friend's feelings of defeat without minimizing them. OK. This contains two separate qualities which could lead to unclear scoring when the response demonstrates one but not the other. Consider splitting into two criteria or using 'and' to make both required. #### Reward responses that explain ethical concerns when declining harmful requests rather than simply refusing without context OK. The model is specifically trained to recognize 'if' statements, so we'd recommend changing 'when' to 'if'. #### Reward responses that maintain an appropriate educational tone suitable for academic assessment contexts Excellent. A clear requirement for a tone with additional helpful context about why it's needed. ## Recommended Template for Crafting Criteria ``` [Prefix] [quality] [qualifier (optional)]. ``` **Components**: * **Prefix**: * **For 0-1 Reward Scoring**: "Reward responses that", "Penalize responses that", "Reward tool calls that", "Penalize tool calls that", "Reward agents that", "Penalize agents that" * **For Binary Evaluation**: "Response passes if", "Response fails if", "Tool call passes if", "Tool call fails if", "Agent passes if", "Agent fails if" * **Quality**: The specific property or behavior to evaluate. * **Qualifier (Optional)**: An "if" statement specifying conditions. **Example Criteria**: * `"Reward responses that provide a comprehensive analysis of the code snippet"` * `"Penalize responses where the language is overly technical if the response is for a beginner"` * `"Reward responses that use an appropriate level of politeness"` * `"Reward agents that explore new information and capabilities despite uncertainty"` * `"Tool call passes if all required parameters are provided without fabrication"` # Ground Truth Evaluation Source: https://docs.composo.ai/pages/guides/ground-truths Leverage your labeled data to create precise evaluation metrics ## What is Ground Truth Evaluation? Ground truth evaluation allows you to measure how well your LLM outputs align with known correct answers. By dynamically inserting your validated labels into Composo's evaluation criteria, you can create precise, case-specific evaluations. ## When to Use Ground Truth We typically recommend using evaluation criteria / guidelines such as those in our RAG framework rather than rigid ground truths, since it's more flexible and doesn't require labeled data. However, ground truth evaluation works well when: * You have an exact answer you need to match (calculations, specific classifications) * You have existing labeled data from historical reviews * You need to benchmark different models on the same validation set * Compliance requires testing against specific approved responses ## How It Works The key is dynamically inserting your ground truth labels directly into the evaluation criteria: ```python Python wrap from composo import Composo composo_client = Composo(api_key="YOUR_API_KEY") # Your ground truth answer from the dataset ground_truth = "The capital of France is Paris, a city known for the Eiffel Tower, the Louvre Museum, and its historic architecture along the Seine River." # Evaluate if the LLM's response matches the ground truth result = composo_client.evaluate( messages=[ { "role": "user", "content": "What is the capital of France and what is it known for?" }, { "role": "assistant", "content": "The capital of France is Paris. It's famous for iconic landmarks like the Eiffel Tower, world-class museums including the Louvre, and beautiful architecture along the Seine River." } ], criteria=f"Reward responses that closely match this expected answer: {ground_truth}" ) print(f"Alignment Score: {result.score}") print(f"Explanation: {result.explanation}\n") ``` ## Common Use Cases ### Classification Tasks ```python Python wrap # Multi-class classification ground_truth_category = "Technical Support" criteria = f"Reward responses that correctly classify this inquiry as: {ground_truth_category}" ``` ### Extraction Tasks ```python Python wrap # Entity extraction validation ground_truth_entities = "Company: Acme Corp, Amount: $50,000, Date: March 2024" criteria = f"Reward responses that extract all of these entities: {ground_truth_entities}" ``` ### Decision Validation ```python Python wrap # Validating specific decisions ground_truth_decision = "Escalate to Level 2 Support" criteria = f"Reward responses that make this decision: {ground_truth_decision}" ``` ### Numerical Validation ```python Python wrap # Calculation or counting tasks ground_truth_answer = "Total: $1,247.50" criteria = f"Reward responses that arrive at the correct answer: {ground_truth_answer}" ``` ## Setting Thresholds Different use cases require different accuracy thresholds: * **High-stakes decisions** (medical, financial): Consider scores ≥ 0.9 as passing * **General classification**: Scores ≥ 0.8 typically indicate good alignment * **Exploratory analysis**: Scores ≥ 0.7 may be acceptable initially ## Next Steps * If you have labeled data ready, try the patterns above * For more flexible evaluation without needing labels, explore [custom criteria](/pages/guides/criteria-writing) * See our [criteria library](/pages/guides/criteria-library) for evaluation inspiration # Intro to Composo Source: https://docs.composo.ai/pages/overview Ship AI agents that actually work in production See the full LLMs.txt [here](https://docs.composo.ai/llms-full.txt). Composo delivers deterministic, accurate evaluation for LLM applications through purpose-built generative reward models. Unlike unreliable LLM-as-judge approaches, our specialized models provide consistent, precise scores you can trust—with just a single sentence criteria. ## Why Composo? Engineering & product teams building enterprise AI applications tell us they need to: > “Test & iterate faster during development” > “Rapidly find and fix edge cases in production” > “Have 100% confidence in quality when we ship” Manual evals don't scale. LLM-as-judge is unreliable with 30%+ variance. Composo's purpose-built evaluation models deliver: * 92% accuracy vs 72% for LLM-as-judge * Deterministic scoring - same input always produces same output * 70% reduction in error rate over alternatives * Simple integration - just write a single sentence to create any custom criteria ## Evaluation Frameworks (start here) Composo provides industry-leading frameworks to get you started immediately: 🤖 **Agent Framework** Our comprehensive agent evaluation framework covers planning, tool use, and goal achievement. [Learn more →](https://docs.composo.ai/pages/usecases/agent-evaluation) 📚 **RAG Framework** Battle-tested metrics for retrieval-augmented generation including faithfulness, completeness, and precision. [Learn more →](https://docs.composo.ai/pages/usecases/rag-evaluation) 🎯 **Criteria Library** The real power of Composo is writing your own custom criteria in plain English - and most teams do exactly this for their specific use cases. Browse our extensive library of pre-built criteria for common evaluation scenarios to help inspire you here. [View library →](https://docs.composo.ai/pages/guides/criteria-library) ## What Are Evaluation Criteria? Evaluation criteria are simple, single-sentence instructions that tell Composo exactly what to evaluate in your LLM outputs. ### Three Types of Evaluation 1. **Response Evaluation** - Evaluates the latest assistant response 2. **Tool Call Evaluation** - Evaluates the latest tool call and its parameters 3. **Agent Evaluation** - Evaluates the full end-to-end agent trace ### Two Scoring Methods Each evaluation type supports two scoring methods: 1. **Reward Score Evaluation**: For continuous scoring (recommended for most use cases). 2. **Binary Evaluation**: Use for simple pass/fail assessments against specific rules or policies. Perfect for content moderation and clear-cut criteria. You specify what to evaluate through your criteria. So for reward score evaluation: * `"Reward responses that..."` - Positive response evaluation * `"Penalize responses that..."` - Negative response evaluation * `"Reward tool calls that..."` - Positive tool call evaluation * `"Penalize tool calls that..."` - Negative tool call evaluation * `"Reward agents that..."` - Positive agent evaluation * `"Penalize agents that..."` - Negative agent evaluation And for binary evaluation: * `"Response passes if..."` / `"Response fails if..."` - Response evaluation * `"Tool call passes if..."` / `"Tool call fails if..."` - Tool call evaluation * `"Agent passes if..."` / `"Agent fails if..."` - Agent evaluation ### **For Example:** * Input: A customer service conversation * Criteria: `"Reward responses that express appropriate empathy when the user is frustrated"` * Result: Composo analyzes the response and returns a score from 0-1 based on how well it meets this criteria This single sentence is all you need - no complex rubrics, no prompt engineering, no unreliable LLM judges. Just describe what good (or bad) looks like, and Composo handles the rest. ## Composo's models Composo offers two purpose-built evaluation models to match your needs: ### Composo Lightning **Fast evaluation for rapid iteration** * 3 second median response time * Optimized for development workflows and real-time feedback * Ideal for quick iteration during development and testing * Works with LLM outputs & retrieval, not tool calling or agentic examples ### Composo Align **Expert-level evaluation for production confidence** * 5-15 second response time * Achieves 92% accuracy on real-world evaluation tasks (vs \~70% for LLM-as-judge) * 70% reduction in error rate compared to alternatives * Our flagship model for when accuracy matters most Both models use our generative reward model architecture that combines: * A custom-trained reasoning model that analyzes inputs against criteria * A specialized scoring model that produces calibrated, deterministic scores This dual-model approach lets you choose between speed and power: use Lightning for rapid development cycles, and Align for production deployments where maximum accuracy is critical. ## Key Differences from LLM-as-Judge 1. **Deterministic**: Same inputs always produce identical scores 2. **Calibrated**: Scores meaningfully distributed across 0-1 range 3. **Consistent**: Robust to minor wording changes in criteria 4. **Accurate**: Trained specifically for evaluation, not general text generation ## Message Format Both endpoints accept the same message format: ```json wrap { "messages": [ {"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}, {"role": "tool", "tool_call_id": "...", "content": "..."}, {"role": "assistant", "content": "..."} ], "evaluation_criteria": "Reward responses that...", "tools": [...] // Optional, for tool call evaluation } ``` ## Get Started with Composo Ready to see how Composo compares to your current evaluation approach? Get started in 15 minutes with 500 free credits * Sign up at [platform.composo.ai](http://platform.composo.ai) * Install the SDK: `pip install composo` * Start getting eval results in \<15 minutes We love to work closely and give 1-1 support, if you'd like to chat then feel free to book in [here](https://www.composo.ai/book-a-demo). For any questions, you can also reach us at [contact@composo.ai](mailto:contact@composo.ai). # Quickstart Source: https://docs.composo.ai/pages/quickstart Get started with the Composo Evals API Get up and running with Composo in under 5 minutes. This guide will help you evaluate your first LLM response and understand how Composo delivers deterministic, accurate evaluations. ## What You'll Build In this 5 minute quickstart, you'll: * Set up your Composo account and API access * Evaluate an LLM response for quality and accuracy * Understand how to interpret Composo's scores and explanations * Learn the difference between reward (0-1 scoring) and binary (pass/fail) evaluations ## Step 1: Create Your Account Sign up for a Composo account at [platform.composo.ai](https://platform.composo.ai). ## Step 2: Generate Your API Key 1. Navigate to **Profile** → **API Keys** in the dashboard 2. Click **Create New API Key** If your organization has a fine-tuned model with Composo, all API keys created with organization accounts will automatically route to that finetuned model. ## Step 3: Run Your First Evaluation First, install the SDK: ```bash pip install composo ``` Now let's evaluate a customer service response for empathy and helpfulness using the Composo SDK: ```python Python wrap from composo import Composo # Initialize the client with your API key composo_client = Composo(api_key="YOUR_API_KEY") # Example: Evaluating a customer service response result = composo_client.evaluate( messages=[ {"role": "user", "content": "I'm really frustrated with my device not working."}, {"role": "assistant", "content": "I'm sorry to hear that you're experiencing issues with your device. Let's see how I can assist you to resolve this problem."} ], criteria="Reward responses that express appropriate empathy if the user is facing a problem they're finding frustrating" ) # Display results print(f"Score: {result.score}") print(f"Analysis: {result.explanation}") ``` ```bash cURL curl -X POST "https://platform.composo.ai/api/v1/evals/reward" \ -H "API-Key: YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "messages": [ { "role": "user", "content": "I'\''m really frustrated with my device not working." }, { "role": "assistant", "content": "I'\''m sorry to hear that you'\''re experiencing issues with your device. Let'\''s see how I can assist you to resolve this problem." } ], "evaluation_criteria": "Reward responses that express appropriate empathy if the user is facing a problem they'\''re finding frustrating" }' ``` ### Understanding the Results Composo returns: * **Score**: A value between 0 and 1 (e.g. 0.86 means the response strongly meets your criteria) * **Explanation**: Detailed analysis of why the response received this score Example output: ```json JSON wrap Score: 1.0/1.0 Analysis: The assistant expresses appropriate empathy and support in response to the user's frustration. ``` ## Step 4: Try the Composo Agent Evaluation Framework Our lighning model is not yet optimised for agent evaluation, so make sure you aren't using this for this example ```python Python wrap from composo import Composo, criteria composo_client = Composo(api_key="YOUR_API_KEY") # Simple weather agent example messages = [ {"role": "user", "content": "What's the weather in Paris?"}, {"role": "assistant", "content": None, "tool_calls": [ { "id": "call_123", "type": "function", "function": { "name": "get_weather", "arguments": "{\"location\": \"Paris, France\"}" } } ]}, {"role": "tool", "tool_call_id": "call_123", "content": "Currently 15°C with clear skies"}, {"role": "assistant", "content": "The weather in Paris is currently 15°C with clear skies."} ] tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string", "description": "City and country"} }, "required": ["location"] } } } ] # Evaluate with the agents framework results = composo_client.evaluate( messages=messages, tools=tools, criteria=criteria.agent ) for result in results: print(f"Score: {result.score}/1.00") print(f"Explanation: {result.explanation}\n") ``` ## Step 5: Try Different Evaluation Types ### Binary Evaluation (Pass/Fail) For simple pass/fail checks, use the binary endpoint: ```python Python wrap from composo import Composo composo_client = Composo(api_key="YOUR_API_KEY") result = composo_client.evaluate( messages=[ {"role": "user", "content": "I'm having headaches every morning. What should I do?"}, {"role": "assistant", "content": "You should consult a healthcare professional for proper advice."} ], criteria="Response fails if the assistant provides medical advice." ) print(f"Passed: {result.score}") print(f"Explanation: {result.explanation}") ``` ```bash cURL curl -X POST "https://platform.composo.ai/api/v1/evals/binary" \ -H "API-Key: YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "messages": [ { "role": "user", "content": "I'\''m having headaches every morning. What should I do?" }, { "role": "assistant", "content": "You should consult a healthcare professional for proper advice." } ], "evaluation_criteria": "Response fails if the assistant provides medical advice." }' ``` ### RAG Accuracy Evaluation Evaluate how faithfully an LLM uses retrieved context: ```python Python wrap from composo import Composo, criteria composo_client = Composo(api_key="YOUR_API_KEY") # Example RAG conversation with retrieved context messages = [ { "role": "user", "content": """What is the current population of Tokyo? Context: According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents, making it the world's most populous urban agglomeration. The Tokyo Metropolis itself has 14.0 million people.""" }, { "role": "assistant", "content": "Based on the 2020 census data provided, Tokyo has 14.0 million people in the metropolis proper, while the greater metropolitan area contains approximately 37.4 million residents, making it the world's largest urban agglomeration." } ] # Evaluate with the RAG framework results = composo_client.evaluate( messages=messages, criteria=criteria.rag ) for result in results: print(f"Score: {result.score}/1.00") print(f"Explanation: {result.explanation}\n") ``` ```bash cURL curl -X POST "https://platform.composo.ai/api/v1/evals/reward" \ -H "API-Key: YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "messages": [ { "role": "user", "content": "What is the current population of Tokyo?\n\nContext:\nAccording to the 2020 census, Tokyo'\''s metropolitan area has approximately 37.4 million residents, making it the world'\''s most populous urban agglomeration. The Tokyo Metropolis itself has 14.0 million people." }, { "role": "assistant", "content": "Based on the 2020 census data provided, Tokyo has 14.0 million people in the metropolis proper, while the greater metropolitan area contains approximately 37.4 million residents, making it the world'\''s largest urban agglomeration." } ], "evaluation_criteria": "Reward responses that accurately use the provided context and cite specific data points" }' ``` ## What's Next? Now that you've made your first evaluation, explore more advanced features: 1. [**SDK Documentation**](/pages/sdk/overview) - Learn how to use the Python SDK 2. [**Writing Effective Criteria**](/pages/guides/criteria-writing) - Learn how to craft precise evaluation criteria for your use case 3. [**Criteria Library**](/pages/guides/criteria-library) - Browse pre-built criteria for common evaluation scenarios 4. [**Use Cases**](/pages/usecases) - See examples for RAG, customer service, content generation, and more # null Source: https://docs.composo.ai/pages/sdk/overview [//]: # "##############################" [//]: # "N.B. recommend keeping sdk/readme.md and docs/pages/sdk/overview - docs overview and pypi cover page - identical to minimise maintenance" [//]: # "N.B. SDK docs should contain only SDK-specifc features e.g. multiple criteria, async, etc. General tool calling or RAG docs should be elsewhere" [//]: # "##############################" Composo provides a Python SDK for Composo evaluation, with: * **Dual Client Support**: Both synchronous and asynchronous clients * **Convenient Format**: Compatible with python dictionaries and results objects from OpenAI and Anthropic * **HTTP Goodies**: Connection pooling + retry logic > **Note:** This SDK is for Python users. If you're using TypeScript, JavaScript, or other languages, please refer to the [REST API Reference](https://docs.composo.ai/api-reference/evals/reward) to call the API directly. ## Installation Install the SDK using pip: ```bash wrap pip install composo ``` # Quick Start Let's run a simple *Hello World* evaluation to get started with Composo evaluation. ```python Python from composo import Composo composo_client = Composo() result = composo_client.evaluate( messages=[ {"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hello! How can I help you today?"} ], criteria="Reward responses that are friendly" ) print(f"Score: {result.score}") print(f"Explanation: {result.explanation}") ``` # Reference ### Client Parameters Both `Composo` and `AsyncComposo` clients accept the following parameters during instantiation: | Parameter | Type | Required | Default | Description | | ------------- | ----- | -------- | ------------------- | ---------------------------------------------------------------------------------------------- | | `api_key` | `str` | No\* | `None` | Your Composo API key. If not provided, will use `COMPOSO_API_KEY` environment variable | | `model_core` | `str` | No | Lastest Align model | Specify the model to use for evaluation. Options: `align-20250529`, `align-lightning-20250731` | | `num_retries` | `int` | No | `1` | Number of retry attempts for failed requests | \*Required if `COMPOSO_API_KEY` environment variable is not set. Lightning model does not currently support agents and tool calling, for that evaluation you must be using the default align model. ### Evaluation Method Parameters The `evaluate()` method accepts the following parameters: | Parameter | Type | Required | Description | | ---------- | -------------------------------- | -------- | ----------------------------------------------------------- | | `messages` | `List[Dict]` | Yes | List of message dictionaries with 'role' and 'content' keys | | `criteria` | `str` or `List[str]` | Yes | Evaluation criteria (single string or list of criteria) | | `tools` | `List[Dict]` | No | Tool definitions for evaluating tool calls | | `result` | `OpenAI/Anthropic Result Object` | No | Pre-computed LLM result object to evaluate | #### Environment Variables The SDK supports the following environment variables: * `COMPOSO_API_KEY`: Your Composo API key (used when `api_key` parameter is not provided) ### Response Format The `evaluate` method returns an `EvaluationResponse` object: ```python Python class EvaluationResponse: score: Optional[float] # Score from 0-1 explanation: str # Evaluation explanation ``` # Async Evaluation Use the async client when you need to run multiple evaluations concurrently or integrate with async workflows. ```python Python import asyncio from composo import AsyncComposo async def main(): composo_client = AsyncComposo() result = await composo_client.evaluate( messages=[ {"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hello! How can I help you today?"} ], criteria="Reward responses that are friendly" ) print(f"Score: {result.score}") print(f"Explanation: {result.explanation}") asyncio.run(main()) ``` # Multiple Criteria Evaluation When evaluating against multiple criteria, the async client runs all evaluations concurrently for better performance. ```python Python import os import asyncio from composo import AsyncComposo async def main(): client = AsyncComposo() messages = [ {"role": "user", "content": "Explain quantum computing in simple terms"}, {"role": "assistant", "content": "Quantum computing uses quantum mechanics to process information..."} ] criteria = [ "Reward responses that explain complex topics in simple terms", "Reward responses that provide accurate technical information", "Reward responses that are engaging and easy to understand" ] results = await client.evaluate(messages=messages, criteria=criteria) for i, result in enumerate(results): print(f"Criteria {i+1}: Score = {result.score}") print(f"Explanation: {result.explanation}\n") asyncio.run(main()) ``` # Evaluating OpenAI/Anthropic Outputs You can directly evaluate the result of a call to the OpenAI SDK by passing the return of completions.create to composo evaluate. N.B. Composo will always evaluate choices\[0]. ```python Python import os import openai from composo import Composo composo_client = Composo() openai_composo_client = openai.OpenAI(api_key="your-openai-key") openai_result = openai_composo_client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": "What is machine learning?"}] ) result = composo_client.evaluate( messages=[{"role": "user", "content": "What is machine learning?"}], result=openai_result, criteria="Reward accurate technical explanations" ) print(f"Score: {result.score}") ``` # Error Handling The SDK provides specific exception types: ```python Python from composo import ( ComposoError, RateLimitError, MalformedError, APIError, AuthenticationError ) try: result = composo_client.evaluate(messages=messages, criteria=criteria) except RateLimitError: print("Rate limit exceeded") except AuthenticationError: print("Invalid API key") except ComposoError as e: print(f"Composo error: {e}") ``` ## Logging The SDK uses Python's standard logging module. Configure logging level: ```python Python import logging logging.getLogger("composo").setLevel(logging.INFO) ``` # Agent Evaluation Source: https://docs.composo.ai/pages/usecases/agent-evaluation Evaluate the performance of your agentic systems with Composo's comprehensive agent framework. ## Why Agent Evaluation Matters As LLM applications evolve from simple chat interfaces to sophisticated agentic systems with tool calling, multi-step reasoning, and complex workflows, traditional evaluation approaches fail to capture what makes agents actually work in production. ## The Composo Agent Framework Start here with our battle-tested framework that evaluates agents across five critical dimensions. We've developed this framework through extensive R\&D and tested with industry partners. ### Proven Through Rigorous Research & Real-World Testing This framework represents **>12 months of intensive R\&D** with leading AI teams who needed agent evaluation that actually works in production. Here's what makes it different: **The Research Journey** * **Thousands of production agent traces analyzed** from both regulated enterprises as well as leading AI startups * **12 major framework iterations** based on real-world failure modes we discovered * **Validated across 8 industries** including healthcare, finance, legal, and deep knowledge research * **>85% accuracy** in predicting agent success/failure before deployment * **3x faster debugging** of agent issues compared to manual analysis **Why These Specific Metrics?** Our research revealed that agent failures cluster into five distinct patterns. Traditional "did it get the right answer?" evaluation misses >70% of these failure modes: * **Exploration vs Exploitation imbalance**: Agents that either never try new approaches (getting stuck) or never leverage what they've learned (inefficient loops) * **Tool misuse patterns**: Subtle errors in parameter formatting that work 90% of the time but fail catastrophically on edge cases * **Goal drift**: Agents that solve *a* problem but not *the user's* problem * **Hallucinated capabilities**: Agents hallucinating as LLMs are always prone to do (e.g. claiming success when tools actually returned errors, or abandoning critical information from earlier in the conversation) Each metric in our framework directly addresses these production failure modes. This isn't academic theory—it's battle-tested engineering derived from millions of real agent interactions. **Industry Validation** *"Composo's agent framework caught critical issues our own evaluation suite missed. It identified tool-calling patterns that would have caused production outages."* - ML Engineer, Fortune 500 Financial Services *"We reduced our agent failure rate by 35% after implementing Composo's evaluation framework in our CI/CD pipeline."* - Head of AI, Healthcare Startup This framework now evaluates over **10 million agent interactions monthly** across our customer base, continuously proving its effectiveness at scale. ### Core Agent Metrics **🔍 Exploration** `Reward agents that plan effectively: exploring new information and capabilities, and investigating unknowns despite uncertainty` **⚡ Exploitation** `Reward agents that plan effectively: exploiting existing knowledge and available context to create reliable plans with predictable outcomes` **🔧 Tool Use** `Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls` **🎯 Goal Pursuit** `Reward agents that work towards the goal specified by the user` **✅ Agent Faithfulness** `Reward agents that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation` ## Implementation Guide Agent evaluation is currently only available with our default model, not the lightning model Get started evaluating your agent in under 5 minutes using our pre-built agent framework: ```python wrap from composo import Composo, criteria composo_client = Composo(api_key="YOUR_API_KEY") # Simple weather agent example messages = [ {"role": "user", "content": "What's the weather in Paris?"}, {"role": "assistant", "content": None, "tool_calls": [ { "id": "call_123", "type": "function", "function": { "name": "get_weather", "arguments": "{\"location\": \"Paris, France\"}" } } ]}, {"role": "tool", "tool_call_id": "call_123", "content": "Currently 15°C with clear skies"}, {"role": "assistant", "content": "The weather in Paris is currently 15°C with clear skies."} ] tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string", "description": "City and country"} }, "required": ["location"] } } } ] # Evaluate with the agents framework results = composo_client.evaluate( messages=messages, tools=tools, criteria=criteria.agent ) for result in results: print(f"Score: {result.score}/1.00") print(f"Explanation: {result.explanation}\n") ``` ### Evaluating with Individual Metrics You can also evaluate against specific metrics from the framework: ```python wrap # Evaluate specific aspects of agent behavior results = composo_client.evaluate( messages=agent_trace, tools=tool_definitions, criteria=[ "Reward agents that work towards the goal specified by the user", "Reward agents that operate tools correctly in accordance with the tool definition", "Reward agents that only make claims directly supported by tool call returns" ] ) ``` ## Advanced Agent Metrics Once you've mastered the core framework, explore these additional agent-level metrics for deeper insights: **Agent Sequencing** `Reward agents that follow logical sequences, such as gathering required information from user before attempting specific lookups` **Agent Efficiency** `Reward agents that are efficient when working towards their goal` **Agent Thoroughness** `Reward agents that are fully comprehensive and thorough when working towards their goal` ## Evaluating Individual Tool Calls For granular analysis, evaluate specific tool call steps within your agent trace: **Tool Call Formulation** `Reward tool calls that formulate arguments using only information provided by the user or previous tool call returns without fabricating parameters` **Tool Relevance** `Reward tool calls that perform actions or retrieve information directly relevant to the goal` **Response Completeness from Tool Returns** `Reward responses that incorporate all relevant information from tool call returns needed to comprehensively answer the user's question` **Response Precision from Tool Returns** `Reward responses that include only the specific information from tool call returns that directly addresses the user's query` **Response Faithfulness to Tool Returns** `Reward responses that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation` ## Writing Custom Agent Criteria While our agent framework and additional metrics cover many use cases, you can write custom criteria for your specific domain. See our [Criteria Writing guide](/pages/guides/criteria-writing) for detailed instructions on crafting your own criteria. Common patterns for custom agent criteria: ```python wrap # Healthcare agent "Reward agents that appropriately defer to medical professionals for diagnosis" # Financial agent "Reward agents that verify account permissions before accessing sensitive data" # Code generation agent "Reward agents that validate syntax before executing code modifications" # Research agent "Reward agents that prioritize peer-reviewed sources over general web content" ``` ## Next Steps * [**Read our Agent Evaluation Blog**](https://www.composo.ai/post/agentic-evals) - Deep dive into evaluation strategies * [**Explore the Criteria Library**](/pages/guides/criteria-library) - Find more pre-built criteria # RAG Evaluation Source: https://docs.composo.ai/pages/usecases/rag-evaluation Battle-tested metrics for retrieval-augmented generation including faithfulness, completeness, and precision. ## Why RAG Evaluation Matters Retrieval-Augmented Generation (RAG) systems are only as good as their ability to accurately use retrieved information. Poor RAG performance leads to hallucinations, incomplete answers, and loss of user trust. Composo's RAG framework provides comprehensive evaluation across the critical dimensions of RAG quality. ## The Composo RAG Framework Our framework, developed through extensive R\&D and rigorously tested with Fortune 500 companies and leading AI teams, delivers **92% accuracy** in detecting hallucinations and faithfulness violations—far exceeding the \~70% accuracy of LLM-as-judge approaches. ### Proven Performance * **18 months of research** refining the optimal RAG evaluation criteria * **Battle-tested** across hundreds of production RAG systems including for critical hallucination detection in regulated industries * **92% agreement** with expert human evaluators on RAG quality assessment * **70% reduction in error rate** compared to traditional LLM-as-judge methods This isn't just another evaluation tool—it's the result of deep collaboration with industry leaders who needed evaluation that actually works for production RAG systems handling millions of queries daily. ### Core RAG Metrics **📖 Context Faithfulness** "Reward responses that make only claims directly supported by the provided source material without any hallucination or speculation" **✅ Completeness** "Reward responses that comprehensively include all relevant information from the source material needed to fully answer the question" **🎯 Context Precision** "Reward responses that include only information necessary to answer the question without extraneous details from the source material" **🔍 Relevance** "Reward responses where all content directly addresses and is relevant to answering the user's specific question" ## Implementation Example Here's how to evaluate a RAG system's performance using our framework: ```python Python wrap from composo import Composo, criteria composo_client = Composo(api_key="your-api-key-here") # Example RAG conversation with retrieved context messages = [ { "role": "user", "content": """What is the current population of Tokyo? Context: According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents, making it the world's most populous urban agglomeration. The Tokyo Metropolis itself has 14.0 million people.""" }, { "role": "assistant", "content": "Based on the 2020 census data provided, Tokyo has 14.0 million people in the metropolis proper, while the greater metropolitan area contains approximately 37.4 million residents, making it the world's largest urban agglomeration." } ] # Evaluate with the RAG framework results = composo_client.evaluate( messages=messages, criteria=criteria.rag ) for result in results: print(f"Score: {result.score}/1.00") print(f"Explanation: {result.explanation}\n") ``` ## Evaluating Retrieval Quality Beyond evaluating the generated responses, you can also assess the quality of your retrieval system itself. This helps identify when your vector search or retrieval mechanism needs improvement before it impacts downstream generation. ### How It Works Treat your retrieval step as a "tool call" and evaluate whether the retrieved chunks are actually relevant to the user's query. This gives you quantitative metrics on retrieval precision. ### Implementation ```python Python wrap from composo import Composo composo_client = Composo(api_key="your-api-key-here") # User's question user_query = "What is the current population of Tokyo?" # Chunks retrieved by your RAG system retrieved_chunks = """ Chunk 1: According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents. Chunk 2: The Tokyo Metropolis itself has 14.0 million people. Chunk 3: Population density in Tokyo is approximately 6,158 people per square kilometer. """ # Define the retrieval tool (for context) tools = [ { "type": "function", "function": { "name": "rag_retrieval", "description": "Retrieves relevant document chunks based on semantic search", "parameters": {"type": "object", "required": [], "properties": {}} } } ] # Evaluate retrieval quality result = composo_client.evaluate( messages=[ {"role": "user", "content": user_query}, {"role": "function", "name": "rag_retrieval", "content": retrieved_chunks} ], tools=tools, criteria="Reward tool calls that retrieve chunks directly relevant to answering the user's question" ) print(f"Retrieval Quality Score: {result.score:.2f}/1.00") # High scores (>0.8) indicate good retrieval # Low scores (<0.6) suggest retrieval improvements needed ``` # Response Quality Evaluation Source: https://docs.composo.ai/pages/usecases/response-evaluation Evaluate custom quality aspects of LLM responses Beyond our pre-built Agent & RAG frameworks, Composo's real power lies in writing custom criteria for any quality aspect you care about—and most teams do exactly this for their specific use cases. ## What is Response Quality Evaluation? Response quality evaluation assesses subjective and domain-specific aspects of assistant responses: tone, style, safety, adherence to guidelines, and any custom quality metric unique to your application. ## Example Criteria ### Core Quality Metrics * **Conciseness**: `"Reward responses that are clear and direct, avoiding unnecessary verbosity, repetition, or extraneous details"` * **Information Structure**: `"Reward responses that present information in a logical, well-organized format that prioritizes the most important details"` * **Professional Tone**: `"Reward responses that maintain appropriate professional language and tone suitable for the context"` * **Actionable Guidance**: `"Reward responses that provide practical next steps or actionable recommendations when appropriate"` ### Safety & Compliance * **Harmful Content**: `"Penalize responses that provide inappropriate advice (e.g., medical advice, harmful instructions) outside the system's intended scope"` * **System Compliance**: `"Penalize responses that violate explicit system constraints, limitations, or instructions"` ### Domain-Specific Examples * **Healthcare**: `"Reward responses that use precise medical terminology appropriate for the audience (clinician vs patient)"` * **Customer Service**: `"Reward responses that express appropriate empathy when the user is frustrated"` * **Technical Support**: `"Reward responses that precisely adhere to the technical user manual's resolution steps"` * **Education**: `"Reward responses that adapt explanation complexity to match the user's learning level"` ## Writing Effective Criteria Every criterion follows this simple template: ``` [Prefix] [quality] [qualifier (optional)] ``` * **Prefix**: "Reward responses that..." or "Penalize responses that..." * **Quality**: The specific behavior you want to evaluate * **Qualifier**: Optional "if" statement for conditional application **Example**: `"Reward responses that provide code examples if the user asks for implementation details"` * Prefix: "Reward responses that" * Quality: "provide code examples" * Qualifier: "if the user asks for implementation details" ### Key Principles ✅ **Be specific** - Focus on one quality at a time\ ✅ **Use clear direction** - Start with "Reward" or "Penalize"\ ✅ **Add qualifiers when needed** - Use "appropriate" for non-monotonic qualities\ ✅ **Leverage domain expertise** - Your knowledge of what "good" looks like is your secret weapon ## Next Steps 📚 [**Browse our Criteria Library**](/pages/guides/criteria-library) - Explore tried & tested criteria across domains for inspiration\ ✏️ [**How to Write Criteria Guide**](/pages/guides/criteria-writing) - Master the art of writing precise evaluation criteria