# Reward Source: https://docs.composo.ai/api-reference/evals/reward https://platform.composo.ai/api/evals-docs/openapi.json post /api/v1/evals/reward Evaluate LLM output against specified criteria. Score on a continuous 0-1 scale. # Get Usage Source: https://docs.composo.ai/api-reference/usage/get-usage https://platform.composo.ai/api/evals-docs/openapi.json get /api/v1/usage Get current usage information for the authenticated user. # FAQs Source: https://docs.composo.ai/documentation/FAQs/common-questions ### Should I include system messages when evaluating with Composo? * Including system messages is optional but recommended, as they provide useful context that can improve evaluation accuracy. ### What's the context limit? * This is model dependant, see the context windows by model [here](/documentation/getting-started/models#available-model-versions). ### What's the expected response time? * This is model dependant, see the latency by model [here](/documentation/getting-started/models#available-model-versions). ### What are the rate limits? * **Free plan:** 500 requests per hour * **Paid plans:** Higher limits based on your specific requirements ### What languages are supported? * Our evaluation models support all major languages plus code. A good rule of thumb is that if you don't need a specialized model to deal with your language, we can handle it. ### Can I evaluate tool calls, not just responses? * Yes! Composo evaluates all agent behaviour, including tool calls. ### How deterministic are the evaluation scores? * Composo achieves \<1% variance in scoring, meaning the same input will produce virtually identical scores every time. This compares to 30%+ variance typical with LLM-as-judge approaches. We also cache results for benchmark evaluations to ensure perfect repeatability across runs. ### What do you mean by a generative reward model architecture? * It's a dual-model system: one model generates detailed reasoning about why an output meets your criteria, while another specialized scoring model (trained on preference data) produces the actual score. This separation ensures both interpretable explanations and consistent, meaningful scores. ### How complex is the integration? * Integration takes just 3 lines of code. You send your conversation and a simple evaluation criterion like "reward responses that are accurate." All the complexity happens behind the scenes. It's a drop-in replacement for anywhere you currently use LLM-as-judge. ### What makes Composo more accurate than LLM-as-judge?? * We use purpose-built reward models trained on tens of thousands of human preference comparisons across real-world domains. Instead of asking an LLM to generate arbitrary scores, our models learn quality distributions through pairwise comparisons (similar to ELO rankings). This creates meaningful, consistent scoring that's grounded in actual human judgments. ### **How do you achieve such consistent scoring?** * We use a multi-layered approach including ensemble techniques and statistical aggregation. Multiple specialized models analyze each evaluation, and we aggregate their outputs to eliminate random variance. This is fundamentally different from single-model LLM approaches that produce different scores each time. # Agent Evaluation Source: https://docs.composo.ai/documentation/cookbooks/agent-evaluation Evaluate the performance of your agentic systems with Composo's comprehensive agent framework. ## Why Agent Evaluation Matters As LLM applications evolve from simple chat interfaces to sophisticated agentic systems with tool calling, multi-step reasoning, and complex workflows, traditional evaluation approaches fail to capture what makes agents actually work in production. ## The Composo Agent Framework Start here with our battle-tested framework that evaluates agents across five critical dimensions. We've developed this framework through extensive R\&D and tested with industry partners. ### Proven Through Rigorous Research & Real-World Testing This framework represents **>12 months of intensive R\&D** with leading AI teams who needed agent evaluation that actually works in production. Here's what makes it different: **The Research Journey** * **Thousands of production agent traces analyzed** from both regulated enterprises as well as leading AI startups * **12 major framework iterations** based on real-world failure modes we discovered * **Validated across 8 industries** including healthcare, finance, legal, and deep knowledge research * **>85% accuracy** in predicting agent success/failure before deployment * **3x faster debugging** of agent issues compared to manual analysis **Why These Specific Metrics?** Our research revealed that agent failures cluster into five distinct patterns. Traditional "did it get the right answer?" evaluation misses >70% of these failure modes: * **Exploration vs Exploitation imbalance**: Agents that either never try new approaches (getting stuck) or never leverage what they've learned (inefficient loops) * **Tool misuse patterns**: Subtle errors in parameter formatting that work 90% of the time but fail catastrophically on edge cases * **Goal drift**: Agents that solve *a* problem but not *the user's* problem * **Hallucinated capabilities**: Agents hallucinating as LLMs are always prone to do (e.g. claiming success when tools actually returned errors, or abandoning critical information from earlier in the conversation) Each metric in our framework directly addresses these production failure modes. This isn't academic theory—it's battle-tested engineering derived from millions of real agent interactions. **Industry Validation** *"Composo's agent framework caught critical issues our own evaluation suite missed. It identified tool-calling patterns that would have caused production outages."* - ML Engineer, Fortune 500 Financial Services *"We reduced our agent failure rate by 35% after implementing Composo's evaluation framework in our CI/CD pipeline."* - Head of AI, Healthcare Startup This framework now evaluates over **10 million agent interactions monthly** across our customer base, continuously proving its effectiveness at scale. ### Core Agent Metrics **🔍 Exploration** `Reward agents that plan effectively: exploring new information and capabilities, and investigating unknowns despite uncertainty` **⚡ Exploitation** `Reward agents that plan effectively: exploiting existing knowledge and available context to create reliable plans with predictable outcomes` **🔧 Tool Use** `Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls` **🎯 Goal Pursuit** `Reward agents that work towards the goal specified by the user` **✅ Agent Faithfulness** `Reward agents that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation` ## Implementation Guide Agent evaluation is currently only available with our default model, not the Lightning model ### Using Agent Tracing The recommended approach for agent evaluation is to use our tracing SDK. This allows you to instrument your agent code and capture real-time execution traces for evaluation. **Agent Evaluation Criteria:** * `agent_exploration` - Reward agents that plan effectively: exploring new information and capabilities, and investigating unknowns despite uncertainty * `agent_exploitation` - Reward agents that plan effectively: exploiting existing knowledge and available context to create reliable plans with predictable outcomes * `agent_tool_use` - Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls * `agent_goal_pursuit` - Reward agents that work towards the goal specified by the user * `agent_faithfulness` - Reward agents that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation Alternatively, `criteria.agent` is a list that contains all of the above. Get started evaluating your agent in under 5 minutes using our tracing SDK and pre-built agent framework: ```python wrap theme={null} from composo import Composo from composo.models import criteria from composo.tracing import ComposoTracer, Instruments, AgentTracer, agent_tracer from openai import OpenAI # Initialize tracing for OpenAI ComposoTracer.init(instruments=[Instruments.OPENAI]) composo_client = Composo(api_key="YOUR_API_KEY") openai_client = OpenAI() # Define a weather agent as a function @agent_tracer(name="weather_agent") def get_weather_info(location): return openai_client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": f"What's the weather in {location}?"}], max_tokens=100 ) # Orchestrator coordinates the agent workflow with AgentTracer("orchestrator") as tracer: # Execute the weather agent result = get_weather_info("Paris") # Evaluate the full agent trace results = composo_client.evaluate_trace(tracer.trace, criteria=criteria.agent) for result, criterion in zip(results, criteria.agent): print(f"Criterion: {criterion}") for agent in result.results_by_agent_name: print(f"{agent}:") print(f" summary_statistics: {result.results_by_agent_name[agent].summary_statistics} ") for id in result.results_by_agent_name[agent].results_by_agent_instance_id: if result.results_by_agent_name[agent].results_by_agent_instance_id[id]: print(f" Agent instance: {id}") print(f" Score: {result.results_by_agent_name[agent].results_by_agent_instance_id[id].score}") print(f" Explanation: {result.results_by_agent_name[agent].results_by_agent_instance_id[id].explanation}") print("-" * 40) ``` [Learn more about agent tracing →](/pages/sdk/tracing) ### Evaluating with Individual Metrics You can also evaluate against specific metrics from the framework: ```python wrap theme={null} # Evaluate specific aspects of agent behavior results = composo_client.evaluate_trace( tracer.trace, criteria=[ criteria.agent_goal_pursuit, criteria.agent_tool_use, criteria.agent_faithfulness ] ) ``` ## Advanced Agent Metrics Once you've mastered the core framework, explore these additional agent-level metrics for deeper insights: **Agent Sequencing** `Reward agents that follow logical sequences, such as gathering required information from user before attempting specific lookups` **Agent Efficiency** `Reward agents that are efficient when working towards their goal` **Agent Thoroughness** `Reward agents that are fully comprehensive and thorough when working towards their goal` ## Evaluating Individual Tool Calls For granular analysis, evaluate specific tool call steps within your agent trace: **Tool Call Formulation** `Reward tool calls that formulate arguments using only information provided by the user or previous tool call returns without fabricating parameters` **Tool Relevance** `Reward tool calls that perform actions or retrieve information directly relevant to the goal` **Response Completeness from Tool Returns** `Reward responses that incorporate all relevant information from tool call returns needed to comprehensively answer the user's question` **Response Precision from Tool Returns** `Reward responses that include only the specific information from tool call returns that directly addresses the user's query` **Response Faithfulness to Tool Returns** `Reward responses that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation` ## Writing Custom Agent Criteria While our agent framework and additional metrics cover many use cases, you can write custom criteria for your specific domain. See our [Criteria Writing guide](/pages/guides/criteria-writing) for detailed instructions on crafting your own criteria. Common patterns for custom agent criteria: ```python wrap theme={null} # Healthcare agent "Reward agents that appropriately defer to medical professionals for diagnosis" # Financial agent "Reward agents that verify account permissions before accessing sensitive data" # Code generation agent "Reward agents that validate syntax before executing code modifications" # Research agent "Reward agents that prioritize peer-reviewed sources over general web content" ``` ## Next Steps * [**Read our Agent Evaluation Blog**](https://www.composo.ai/post/agentic-evals) - Deep dive into evaluation strategies * [**Explore the Criteria Library**](/pages/guides/criteria-library) - Find more pre-built criteria # RAG Evaluation Source: https://docs.composo.ai/documentation/cookbooks/rag-evaluation Battle-tested metrics for retrieval-augmented generation including faithfulness, completeness, and precision. ## Why RAG Evaluation Matters Retrieval-Augmented Generation (RAG) systems are only as good as their ability to accurately use retrieved information. Poor RAG performance leads to hallucinations, incomplete answers, and loss of user trust. Composo's RAG framework provides comprehensive evaluation across the critical dimensions of RAG quality. ## The Composo RAG Framework Our framework, developed through extensive R\&D and rigorously tested with Fortune 500 companies and leading AI teams, delivers **92% accuracy** in detecting hallucinations and faithfulness violations—far exceeding the \~70% accuracy of LLM-as-judge approaches. ### Proven Performance * **18 months of research** refining the optimal RAG evaluation criteria * **Battle-tested** across hundreds of production RAG systems including for critical hallucination detection in regulated industries * **92% agreement** with expert human evaluators on RAG quality assessment * **70% reduction in error rate** compared to traditional LLM-as-judge methods This isn't just another evaluation tool—it's the result of deep collaboration with industry leaders who needed evaluation that actually works for production RAG systems handling millions of queries daily. ### Core RAG Metrics **📖 Context Faithfulness** "Reward responses that make only claims directly supported by the provided source material without any hallucination or speculation" **✅ Completeness** "Reward responses that comprehensively include all relevant information from the source material needed to fully answer the question" **🎯 Context Precision** "Reward responses that include only information necessary to answer the question without extraneous details from the source material" **🔍 Relevance** "Reward responses where all content directly addresses and is relevant to answering the user's specific question" ## Implementation Example Our SDK now provides independent criteria variables for RAG evaluation, making it easier to use specific criteria or create custom combinations. Each criterion is defined as a separate variable with clear, focused descriptions. **RAG Evaluation Criteria:** * `rag_faithfulness` - Reward responses that make only claims directly supported by the provided source material without any hallucination or speculation * `rag_completeness` - Reward responses that comprehensively include all relevant information from the source material needed to fully answer the question * `rag_precision` - Reward responses that include only information necessary to answer the question without extraneous details from the source material * `rag_relevance` - Reward responses where all content directly addresses and is relevant to answering the user's specific question Alternatively, criteria.rag is a list that contains all the above. Here's how to evaluate a RAG system's performance using our framework: ```python Python wrap theme={null} from composo import Composo, criteria composo_client = Composo(api_key="your-api-key-here") # Example RAG conversation with retrieved context messages = [ { "role": "user", "content": """What is the current population of Tokyo? Context: According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents, making it the world's most populous urban agglomeration. The Tokyo Metropolis itself has 14.0 million people.""" }, { "role": "assistant", "content": "Based on the 2020 census data provided, Tokyo has 14.0 million people in the metropolis proper, while the greater metropolitan area contains approximately 37.4 million residents, making it the world's largest urban agglomeration." } ] # Evaluate with the RAG framework results = composo_client.evaluate( messages=messages, criteria=criteria.rag ) for result, criterion in zip(results, criteria.rag): print(f"Criterion: {criterion}") print(f"Score: {result.score}") print(f"Explanation: {result.explanation}") print("-" * 40) ``` ## Evaluating Retrieval Quality Beyond evaluating the generated responses, you can also assess the quality of your retrieval system itself. This helps identify when your vector search or retrieval mechanism needs improvement before it impacts downstream generation. ### How It Works Treat your retrieval step as a "tool call" and evaluate whether the retrieved chunks are actually relevant to the user's query. This gives you quantitative metrics on retrieval precision. ### Implementation ```python Python wrap theme={null} from composo import Composo composo_client = Composo(api_key="your-api-key-here") # User's question user_query = "What is the current population of Tokyo?" # Chunks retrieved by your RAG system retrieved_chunks = """ Chunk 1: According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents. Chunk 2: The Tokyo Metropolis itself has 14.0 million people. Chunk 3: Population density in Tokyo is approximately 6,158 people per square kilometer. """ # Define the retrieval tool (for context) tools = [ { "type": "function", "function": { "name": "rag_retrieval", "description": "Retrieves relevant document chunks based on semantic search", "parameters": {"type": "object", "required": [], "properties": {}} } } ] # Evaluate retrieval quality result = composo_client.evaluate( messages=[ {"role": "user", "content": user_query}, {"role": "function", "name": "rag_retrieval", "content": retrieved_chunks} ], tools=tools, criteria="Reward tool calls that retrieve chunks directly relevant to answering the user's question" ) print(f"Retrieval Quality Score: {result.score:.2f}/1.00") # High scores (>0.8) indicate good retrieval # Low scores (<0.6) suggest retrieval improvements needed ``` # Response Quality Evaluation Source: https://docs.composo.ai/documentation/cookbooks/response-evaluation Evaluate custom quality aspects of LLM responses Beyond our pre-built Agent & RAG frameworks, Composo's real power lies in writing custom criteria for any quality aspect you care about—and most teams do exactly this for their specific use cases. ## What is Response Quality Evaluation? Response quality evaluation assesses subjective and domain-specific aspects of assistant responses: tone, style, safety, adherence to guidelines, and any custom quality metric unique to your application. ## Example Criteria ### Core Quality Metrics * **Conciseness**: `"Reward responses that are clear and direct, avoiding unnecessary verbosity, repetition, or extraneous details"` * **Information Structure**: `"Reward responses that present information in a logical, well-organized format that prioritizes the most important details"` * **Professional Tone**: `"Reward responses that maintain appropriate professional language and tone suitable for the context"` * **Actionable Guidance**: `"Reward responses that provide practical next steps or actionable recommendations when appropriate"` ### Safety & Compliance * **Harmful Content**: `"Penalize responses that provide inappropriate advice (e.g., medical advice, harmful instructions) outside the system's intended scope"` * **System Compliance**: `"Penalize responses that violate explicit system constraints, limitations, or instructions"` ### Domain-Specific Examples * **Healthcare**: `"Reward responses that use precise medical terminology appropriate for the audience (clinician vs patient)"` * **Customer Service**: `"Reward responses that express appropriate empathy when the user is frustrated"` * **Technical Support**: `"Reward responses that precisely adhere to the technical user manual's resolution steps"` * **Education**: `"Reward responses that adapt explanation complexity to match the user's learning level"` ## Writing Effective Criteria Every criterion follows this simple template: ``` [Prefix] [quality] [qualifier (optional)] ``` * **Prefix**: "Reward responses that..." or "Penalize responses that..." * **Quality**: The specific behavior you want to evaluate * **Qualifier**: Optional "if" statement for conditional application **Example**: `"Reward responses that provide code examples if the user asks for implementation details"` * Prefix: "Reward responses that" * Quality: "provide code examples" * Qualifier: "if the user asks for implementation details" ### Key Principles ✅ **Be specific** - Focus on one quality at a time\ ✅ **Use clear direction** - Start with "Reward" or "Penalize"\ ✅ **Add qualifiers when needed** - Use "appropriate" for non-monotonic qualities\ ✅ **Leverage domain expertise** - Your knowledge of what "good" looks like is your secret weapon ## Next Steps 📚 [**Browse our Criteria Library**](/pages/guides/criteria-library) - Explore tried & tested criteria across domains for inspiration\ ✏️ [**How to Write Criteria Guide**](/pages/guides/criteria-writing) - Master the art of writing precise evaluation criteria # Models Source: https://docs.composo.ai/documentation/getting-started/models Composo has developed 2 distinct model types that each achieve best in class scoring performance for their respective tasks. **Expert-level Agent evaluation for production confidence** * Our flagship model for when accuracy matters most * 5-15 second response time * Achieves 92% accuracy on real-world evaluation tasks (vs \~70% for LLM-as-judge) * Detailed evidence based explanations * Optimized for evaluating complex Agentic applications end-to-end **Fast evaluation for rapid iteration** * 3 second median response time * Optimized for development workflows and real-time feedback * Ideal for quick iteration during development and testing ## Available Model Versions ### Composo Align Versions | Version | Context Window | Latency | Notes | | ---------------- | -------------- | ------------ | ----------------------------------------------------------- | | `align-20251111` | 350K tokens | 5-10 seconds | Current stable version | | `align-20250529` | 150K tokens | 5-15 seconds | Deprecated - migrate to `align-20251111` | ### Composo Align Lightning Versions | Version | Context Window | Latency | Notes | | -------------------------- | -------------- | ------------ | ---------------------- | | `align-lightning-20251127` | 32K tokens | 100 - 800 ms | Beta | | `align-lightning-20250731` | 200K tokens | 1-5 seconds | Current stable version | # Quickstart Source: https://docs.composo.ai/documentation/getting-started/quickstart Ship AI agents that actually work in production Composo delivers deterministic, accurate evaluation for LLM applications through purpose-built generative reward models. Unlike unreliable LLM-as-judge approaches, our specialized models provide consistent, precise scores you can trust—with just a single sentence criteria. # Quickstart Get up and running with Composo in under 5 minutes. This guide will help you evaluate your first LLM response and understand how Composo delivers deterministic, accurate evaluations. ## Step 1: Create Your Account Sign up for a Composo account at [platform.composo.ai](https://platform.composo.ai). ## Step 2: Generate Your API Key 1. Navigate to **Profile** → **API Keys** in the dashboard 2. Click **Generate New API Key** ## Step 3: Run Your First Evaluation \[Optional] Install the SDK: ```bash theme={null} pip install composo ``` Now let's evaluate a customer service response for empathy and helpfulness using the Composo SDK: ```python Python wrap theme={null} from composo import Composo # Initialize the client with your API key composo_client = Composo(api_key="YOUR_API_KEY") # Example: Evaluating a customer service response result = composo_client.evaluate( messages=[ {"role": "user", "content": "I'm really frustrated with my device not working."}, {"role": "assistant", "content": "I'm sorry to hear that you're experiencing issues with your device. Let's see how I can assist you to resolve this problem."} ], criteria="Reward responses that express appropriate empathy if the user is facing a problem they're finding frustrating" ) # Display results print(f"Score: {result.score}") print(f"Analysis: {result.explanation}") ``` ```bash cURL theme={null} curl -X POST "https://platform.composo.ai/api/v1/evals/reward" \ -H "API-Key: YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "messages": [ { "role": "user", "content": "I'\''m really frustrated with my device not working." }, { "role": "assistant", "content": "I'\''m sorry to hear that you'\''re experiencing issues with your device. Let'\''s see how I can assist you to resolve this problem." } ], "evaluation_criteria": "Reward responses that express appropriate empathy if the user is facing a problem they'\''re finding frustrating" }' ``` ### Understanding the Results Composo returns: * **Score**: A value between 0 and 1 (e.g. 0.86 means the response strongly meets your criteria) * **Explanation**: Detailed analysis of why the response received this score Example output: ```json JSON wrap theme={null} Score: 0.86/1.0 Analysis: - The assistant directly acknowledges the user's difficulty and expresses sympathy ("I'm sorry to hear that you're experiencing issues"), showing clear empathy. - The response is timely and supportive, immediately addressing the expressed frustration and not ignoring the emotional content. - It constructively adds a collaborative next step ("Let's see how I can assist you"), enhancing the empathetic tone, with only minor room for deeper emotional mirroring. ``` ## Step 4: Evaluate Agents with Tracing For agent applications, Composo provides real-time tracing to capture and evaluate multi-agent interactions. Here's a simple example with an orchestrator coordinating two sub-agents: ```python Python wrap theme={null} from composo import Composo from composo.models import criteria from composo.tracing import ComposoTracer, Instruments, AgentTracer, agent_tracer from openai import OpenAI # Initialize tracing for OpenAI ComposoTracer.init(instruments=[Instruments.OPENAI]) composo_client = Composo(api_key="YOUR_API_KEY") openai_client = OpenAI() # Define a simple sub-agent @agent_tracer(name="research_agent") def research_agent(topic): return openai_client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": f"Research: {topic}"}], max_tokens=50 ) # Orchestrator coordinates multiple agents with AgentTracer("orchestrator") as tracer: # First sub-agent: planning with AgentTracer("planning_agent"): plan = openai_client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": "Plan a trip to Paris"}], max_tokens=50 ) # Second sub-agent: research research = research_agent("Paris attractions") # Evaluate the full agent trace results = composo_client.evaluate_trace(tracer.trace, criteria=criteria.agent) for result, criterion in zip(results, criteria.agent): print(f"Criterion: {criterion}") print(f"Evaluation Result: {result}\n") ``` This example shows how Composo traces each agent's LLM calls independently and evaluates them against our comprehensive agent framework. # Criteria Library Source: https://docs.composo.ai/documentation/guides/criteria-library Here's a range of criteria that we've seen to help when writing your own ## Core frameworks (start here) ### RAG framework * **Context Faithfulness**: Reward responses that make only claims directly supported by the provided source material without any hallucination or speculation * **Completeness**: Reward responses that comprehensively include all relevant information from the source material needed to fully answer the question * **Context Precision**: Reward responses that include only information necessary to answer the question without extraneous details from the source material * **Relevance**: Reward responses where all content directly addresses and is relevant to answering the user's specific question ### Agents framework * **Exploration**: Reward agents that plan effectively: exploring new information and capabilities, and investigating unknowns despite uncertainty * **Exploitation**: Reward agents that plan effectively: exploiting existing knowledge and available context to create reliable plans with predictable outcomes * **Tool use**: Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls * **Goal pursuit**: Reward agents that work towards the goal specified by the user * **Agent Faithfulness**: Reward agents that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation ## Advanced metrics (use these next) ### Agents * **Agent Sequencing:** Reward agents that follows logical sequences, such as gathering required information from user before attempting specific lookups * **Agent Efficiency:** Reward agents that are efficient when working towards their goal * **Agent Thoroughness:** Reward agents that are fully comprehensive and thorough when working towards their goal ### Individual tool call focused (use these when you want to pinpoint specific tool call steps) * **Tool Call Formulation:** Reward tool calls that formulate arguments using only information provided by the user or previous tool call returns without fabricating parameters. * **Tool Relevance:** Reward tool calls that perform actions or retrieve information directly relevant to the goal. * **Response completeness from tool return:** Reward responses that incorporate all relevant information from tool call returns needed to comprehensively answer the user's question. * **Response precision from tool return:** Reward responses that include only the specific information from tool call returns that directly addresses the user's query * **Response faithfulness to tool return**: Reward responses that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation ### Response quality * **Conciseness:** Reward responses that are clear and direct, avoiding unnecessary verbosity, repetition, or extraneous details * **Information Structure:** Reward responses that present information in a logical, well-organized format that prioritizes the most important details. * **Professional Tone:** Reward responses that maintain appropriate professional language and tone suitable for the context. * **Actionable Guidance:** Reward responses that provide practical next steps or actionable recommendations when appropriate. ### Accuracy and robustness * **Source Attribution:** Reward responses that explicitly cite or reference specific source documents or sections used to support each claim. * **Factual Accuracy:** Reward responses that accurately reflect factual information without introducing errors or fabricated details. * **Uncertainty Handling:** Reward responses that appropriately acknowledge limitations when information is incomplete or unavailable, rather than making assumptions. * **Appropriate Refusals:** Reward responses that appropriately refuse to answer when source material lacks sufficient information to address the question. ### Safety * **Harmful Content Prevention:** Penalize responses that provide inappropriate advice (e.g., medical advice, harmful instructions) outside the system's intended scope. * **System Compliance:** Penalize responses that violate explicit system constraints, limitations, or instructions. ## Extended library (for inspiration when writing your own) * **Creativity:** Reward responses that demonstrate original thinking, novel approaches, or innovative solutions. * **Empathy:** Reward responses that show understanding and connection with human emotions and experiences. * **Humor:** Reward responses that appropriately use wit, clever wordplay, or situational comedy when suitable to context. * **Surprise:** Reward responses that include unexpected but delightful elements or developments. * **Happiness:** Reward responses that evoke positive emotions and create uplifting experiences. * **Narrative Structure:** Reward responses that maintain logical progression and development. * **Legal Authority:** Reward responses that prioritize the most authoritative legal sources (legislation, case law, preparatory works). * **Jurisdictional Accuracy:** Reward responses that correctly identify jurisdictional context and cite the most recent legally binding sources. * **Legal Terminology:** Reward responses that correctly interpret legal terminology, avoiding confusion with non-legal meanings. * **Citation Recognition:** Reward responses that recognize and appropriately process standard legal citation formats. * **Quantitative Accuracy:** Reward responses that accurately represent quantitative data without speculation beyond provided information. * **Metric Context:** Reward responses that include appropriate context for metrics, comparisons, and calculations. * **Risk Disclosure:** Reward responses that acknowledge limitations and uncertainties in quantitative analysis. * **Regulatory Compliance:** Penalize responses that include financial recommendations without appropriate risk disclaimers. * **Issue Resolution:** Reward responses that capture all significant elements: issue nature, agent actions, and resolutions offered. * **Entity Accuracy:** Reward responses that correctly identify specific entities (payment methods, brands, etc.) only when explicitly mentioned. * **Interaction Dynamics:** Reward responses that accurately represent both customer and agent perspectives. * **Chronological Clarity:** Reward responses that present information in clear chronological sequence. * **Query Translation:** Reward responses that accurately translate natural language intent with proper syntax. * **Feature Accuracy:** Penalize responses that reference outdated, incorrect, or non-existent functionality. * **Validation Implementation:** Penalize responses that fail to include critical validation rules when specified. * **Cost Efficiency:** Reward responses that provide cost-effective technical solutions. * **Medical Terminology:** Reward responses that use precise medical terminology appropriate for the audience (clinician vs patient). * **Evidence-Based Content:** Reward responses that reference current clinical guidelines or peer-reviewed studies. * **Harm Prevention:** Penalize responses that could delay necessary medical care through self-diagnosis suggestions. * **Appropriate Referrals:** Reward responses that direct users to qualified healthcare professionals for medical decisions. * **Learning Adaptation:** Reward responses that adapt explanation complexity to match the user's learning level. * **Conceptual Building:** Reward responses that connect new concepts to familiar ideas. * **Active Learning:** Reward responses that encourage critical thinking through questions when pedagogically appropriate. * **Misconception Correction:** Reward responses that identify and gently correct common misconceptions. * **Voice Consistency:** Reward responses that maintain consistent brand voice and personality. * **Audience Targeting:** Reward responses that tailor language and complexity for the specified target audience. * **Hook Effectiveness:** Reward responses with compelling openings appropriate to the platform. * **SEO Optimization:** Reward responses that naturally incorporate relevant keywords without compromising readability. * **Specification Accuracy:** Reward responses that accurately represent product details without fabrication. * **Comparison Fairness:** Reward responses that provide balanced product comparisons with strengths and limitations. * **Decision Support:** Reward responses that help users make informed decisions by addressing common concerns. * **Policy Clarity:** Reward responses that clearly communicate relevant policies when applicable. * **Scholarly Rigor:** Reward responses that properly cite primary sources and acknowledge research limitations. * **Literature Synthesis:** Reward responses that effectively synthesize multiple sources while maintaining distinct attribution. * **Academic Integrity:** Reward responses that encourage original thinking and proper attribution. * **Disciplinary Conventions:** Reward responses that follow discipline-specific writing and citation styles. * **Context Retention:** Reward responses that appropriately reference and build upon previous conversation turns. * **Intent Recognition:** Reward responses that correctly identify user intent even when expressed ambiguously. * **Emotional Intelligence:** Reward responses that appropriately recognize and respond to user emotional states. * **Boundary Awareness:** Reward responses that maintain professional boundaries while being helpful. * **Cultural Adaptation:** Reward responses that appropriately adapt content for cultural context beyond literal translation. * **Idiomatic Accuracy:** Reward responses that correctly handle idioms and culture-specific references. * **Terminology Consistency:** Reward responses that maintain consistent technical terminology throughout translations. * **Contextual Disambiguation:** Reward responses that correctly resolve ambiguous terms based on domain context. # How to write effective criteria Source: https://docs.composo.ai/documentation/guides/criteria-writing When crafting your evaluation criteria, consider the following guidelines to ensure effective and meaningful assessments: **Be Specific and Focused**: Clearly define the quality or behavior you want to evaluate. Avoid vague statements. Focus on a single aspect per criterion to maintain clarity. * *Example*: Instead of "good," use "a friendly and encouraging tone." **Use Clear Direction**: Begin your criteria with an explicit directive to indicate both the evaluation type and scoring method: * For continuous scoring (0-1): `"Reward..."` or `"Penalize..."` * For binary scoring (0 or 1): `"Passes if..."` or `"Fails if..."` All criteria types use the same `/reward` endpoint - the prefix determines whether you get continuous or binary scores. * *Example*: `"Reward responses that use empathetic language when addressing user concerns."` (continuous) * *Example*: `"Fails if the response provides medical advice."` (binary) **Monotonic or Appropriately Qualified Qualities**: Ideally, the quality you're assessing should be monotonic (more is always better for rewards, worse for penalties). For non-monotonic qualities where balance matters, use qualifiers like "appropriate" to ensure higher scores represent better adherence. * *Example*: Instead of `"Reward responses that are polite"` which can become excessive, use `"Reward responses that use an appropriate level of politeness"` ensuring the response is polite but not overly so. **Avoid Conjunctions**: Focus on one quality at a time. Using "and" often indicates multiple qualities, which can lead to unclear scoring when only one quality is present. * *Example*: Instead of `"The assistant should be concise and informative"` split into two separate criteria. **Avoid LLM Keywords**: Composo's reward model is finetuned from LLM models trained in conversation format. Avoid alternate definitions of 'User' and 'Assistant' that might conflict with LLM keywords 'user' and 'assistant'. * *Example*: Instead of `"Reward responses that comprehensively address the User Question"`, rename the 'User Question' in your prompt and use `"Reward responses that comprehensively address the Target Question"` **Leverage Domain Expertise**: Your domain knowledge is your secret weapon. Inject your understanding of what constitutes a 'good' answer in your specific field—this gives your evaluation model leverage over the generative model. * *Example*: For medical contexts: `"Reward responses that distinguish between emergency symptoms requiring immediate care versus symptoms suitable for routine appointments"` **Use Qualifiers When Needed**: Include a qualifier starting with "if" to specify when the criterion should apply. This helps handle conditional requirements. * *Example*: `"Reward responses that provide code examples if the user asks for implementation details"` **Keep Criteria Concise**: Aim for one clear sentence per criterion. If you need multiple sentences to explain, consider splitting into separate criteria. #### Reward responses that provide correct information based solely on the provided context without fabricating details. OK. Clarification about 'correct' would be useful—does it have to be factually correct, or only in agreement with the provided context? #### Reward responses that directly address the 'User Question' without including irrelevant information. Poor. Clauses that change the definition of user and assistant from the LLM definition risk confusion. #### Reward responses that properly cite the specific source of information from the provided context. Good. 'Properly' is slightly ambiguous and rolls in both concepts of citation style and accuracy. #### Reward responses that appropriately acknowledge limitations if information is incomplete or unavailable rather than guessing. Good. Could be improved by clarifying what the agent might be guessing at. #### Reward responses that comprehensively address all aspects of the 'User Question' if information is available in the context. Poor. Clauses that change the definition of user and assistant from the LLM definition risk confusion. #### Reward responses that present technical information in a logical, well-organized format that prioritizes the most important details. Excellent. It's clear what format we're looking for and what kind of information that applies to. #### Reward responses that provide practical next steps or recommendations if appropriate and supported by the context. OK. Somewhat ambiguous about what should be supported by the context—is it the next steps or the relevance of the question? #### Reward responses that strictly include only information explicitly stated in the support ticket, without adding any fabricated details or assumptions. Excellent. It's clear what the expected input is and what the model should be doing. #### Reward responses that correctly identify and include specific entities (payment methods, product categories, brands, couriers) only when explicitly mentioned in the ticket, avoiding hallucinations of these elements. Excellent. It's clear that we're trying to avoid fabricating names of specific entities and the examples make it even clearer. #### Reward responses that include all significant elements of the support ticket, including the nature of the issue, agent actions, and resolutions offered, without omitting key details. Excellent. It's clear that we're looking for good coverage of the important elements in the response. #### Reward responses that present the information in a clear chronological sequence that accurately reflects the flow of the support interaction. Excellent. A clear requirement for chronological presentation of the information in the support interaction. #### Penalize responses that include unnecessary concluding statements, evaluative summaries, or editorial comments not derived from the ticket content. Excellent. It's clear that we're trying to avoid verbose summary content that isn't clearly derived from the provided ticket. #### Reward responses that demonstrate empathy while acknowledging the friend's feelings of defeat without minimizing them. OK. This contains two separate qualities which could lead to unclear scoring when the response demonstrates one but not the other. Consider splitting into two criteria or using 'and' to make both required. #### Reward responses that explain ethical concerns when declining harmful requests rather than simply refusing without context OK. The model is specifically trained to recognize 'if' statements, so we'd recommend changing 'when' to 'if'. #### Reward responses that maintain an appropriate educational tone suitable for academic assessment contexts Excellent. A clear requirement for a tone with additional helpful context about why it's needed. ## Recommended Template for Crafting Criteria ``` [Prefix] [quality] [qualifier (optional)]. ``` **Components**: * **Prefix**: * **For Continuous Scoring (0-1)**: "Reward...", "Penalize..." * **For Binary Scoring (0 or 1)**: "Passes if...", "Fails if..." Note: All criteria use the `/reward` endpoint. The prefix determines the scoring method. * **Quality**: The specific property or behavior to evaluate. * **Qualifier (Optional)**: An "if" statement specifying conditions. **Example Criteria**: * `"Reward responses that provide a comprehensive analysis of the code snippet"` (continuous) * `"Penalize responses where the language is overly technical if the response is for a beginner"` (continuous) * `"Reward responses that use an appropriate level of politeness"` (continuous) * `"Passes if all required parameters are provided without fabrication"` (binary) * `"Fails if the response provides medical advice"` (binary) # Ground Truth Evaluation Source: https://docs.composo.ai/documentation/guides/ground-truths Leverage your labeled data to create precise evaluation metrics ## What is Ground Truth Evaluation? Ground truth evaluation allows you to measure how well your LLM outputs align with known correct answers. By dynamically inserting your validated labels into Composo's evaluation criteria, you can create precise, case-specific evaluations. ## When to Use Ground Truth We typically recommend using evaluation criteria / guidelines such as those in our RAG framework rather than rigid ground truths, since it's more flexible and doesn't require labeled data. However, ground truth evaluation works well when: * You have an exact answer you need to match (calculations, specific classifications) * You have existing labeled data from historical reviews * You need to benchmark different models on the same validation set * Compliance requires testing against specific approved responses ## How It Works The key is dynamically inserting your ground truth labels directly into the evaluation criteria: ```python Python wrap theme={null} from composo import Composo composo_client = Composo(api_key="YOUR_API_KEY") # Your ground truth answer from the dataset ground_truth = "The capital of France is Paris, a city known for the Eiffel Tower, the Louvre Museum, and its historic architecture along the Seine River." # Evaluate if the LLM's response matches the ground truth result = composo_client.evaluate( messages=[ { "role": "user", "content": "What is the capital of France and what is it known for?" }, { "role": "assistant", "content": "The capital of France is Paris. It's famous for iconic landmarks like the Eiffel Tower, world-class museums including the Louvre, and beautiful architecture along the Seine River." } ], criteria=f"Reward responses that closely match this expected answer: {ground_truth}" ) print(f"Alignment Score: {result.score}") print(f"Explanation: {result.explanation}\n") ``` ## Common Use Cases ### Classification Tasks ```python Python wrap theme={null} # Multi-class classification ground_truth_category = "Technical Support" criteria = f"Reward responses that correctly classify this inquiry as: {ground_truth_category}" ``` ### Extraction Tasks ```python Python wrap theme={null} # Entity extraction validation ground_truth_entities = "Company: Acme Corp, Amount: $50,000, Date: March 2024" criteria = f"Reward responses that extract all of these entities: {ground_truth_entities}" ``` ### Decision Validation ```python Python wrap theme={null} # Validating specific decisions ground_truth_decision = "Escalate to Level 2 Support" criteria = f"Reward responses that make this decision: {ground_truth_decision}" ``` ### Numerical Validation ```python Python wrap theme={null} # Calculation or counting tasks ground_truth_answer = "Total: $1,247.50" criteria = f"Reward responses that arrive at the correct answer: {ground_truth_answer}" ``` ## Setting Thresholds Different use cases require different accuracy thresholds: * **High-stakes decisions** (medical, financial): Consider scores ≥ 0.9 as passing * **General classification**: Scores ≥ 0.8 typically indicate good alignment * **Exploratory analysis**: Scores ≥ 0.7 may be acceptable initially ## Next Steps * If you have labeled data ready, try the patterns above * For more flexible evaluation without needing labels, explore [custom criteria](/pages/guides/criteria-writing) * See our [criteria library](/pages/guides/criteria-library) for evaluation inspiration # Langfuse Source: https://docs.composo.ai/documentation/monitoring/langfuse How to use Composo in combination with Langfuse This guide shows how to integrate Composo's deterministic evaluation with Langfuse's observability platform to evaluate your LLM applications with confidence. ## Overview **Langfuse** provides comprehensive observability for LLM applications with tracing, debugging, and dataset management capabilities. **Composo** delivers deterministic, accurate evaluation through purpose-built generative reward models that achieve 92% accuracy (vs 72% for LLM-as-judge). Together, they enable you to: * ✅ Track every LLM interaction through Langfuse's tracing * ✅ Add deterministic evaluation scores to your traces * ✅ Evaluate datasets programmatically with reliable metrics * ✅ Ship AI features with confidence using quantitative, trustworthy metrics ## Prerequisites ```python Python wrap theme={null} pip install langfuse composo ``` ```python Python wrap theme={null} import os from langfuse import Langfuse, get_client from composo import Composo, AsyncComposo # Set your API keys os.environ["LANGFUSE_PUBLIC_KEY"] = "your-public-key" os.environ["LANGFUSE_SECRET_KEY"] = "your-secret-key" os.environ["COMPOSO_API_KEY"] = "your-composo-key" # Initialize clients langfuse = get_client() composo_client = Composo() async_composo = AsyncComposo() ``` ## How Langfuse & Composo work in combination Untitleddiagram Mermaid Chart 2025 08 19 133254 Pn ## Method 1: Real-time Trace Evaluation Evaluate LLM outputs as they're generated in production or development. This approach uses the `@observe` decorator to automatically trace your LLM calls, then evaluates them with Composo asynchronously. More detail on how the langfuse @observe decorator works is [here](https://langfuse.com/docs/observability/sdk/python/sdk-v3#basic-tracing). ### When to use * Production monitoring with real-time quality scores * Development iteration with immediate feedback ### Implementation ```python Python wrap theme={null} import asyncio from langfuse import get_client, observe from anthropic import Anthropic from composo import AsyncComposo # Initialize async Composo client async_composo = AsyncComposo() @observe() async def llm_call(input_data: str) -> str: # LLM call with async evaluation using @observe decorator model_name = "claude-sonnet-4-20250514" anthropic = Anthropic() resp = anthropic.messages.create( model=model_name, max_tokens=100, messages=[{"role": "user", "content": input_data}], ) output = resp.content[0].text.strip() # Get trace ID for scoring trace_id = get_client().get_current_trace_id() evaluation_criteria = "Reward responses that are helpful" # Start asynchronous evaluation task (non-blocking) # Note: You can register tasks to a task queue or background tasks await evaluate_with_composo(trace_id, input_data, output, evaluation_criteria) return output async def evaluate_with_composo(trace_id, input_data, output, evaluation_criteria): # Evaluate LLM output with Composo and score in Langfuse # Composo expects a list of chat messages messages = [ {"role": "user", "content": input_data}, {"role": "assistant", "content": output}, ] eval_resp = await async_composo.evaluate( messages=messages, criteria=evaluation_criteria ) # Score the trace in Langfuse langfuse = get_client() langfuse.create_score( trace_id=trace_id, name=evaluation_criteria, value=eval_resp.score, comment=eval_resp.explanation, ) ``` Then in your main application: ```python Python wrap theme={null} # Simply call the function - Langfuse logs and Composo evaluates asynchronously await llm_call(input_data) ``` ## Method 2: Dataset Evaluation Use this method to evaluate your LLM application on a dataset that already exists in Langfuse. The `item.run()` context manager automatically links execution traces to dataset items. For more detail on how this works from Langfuse please see [here](https://langfuse.com/docs/evaluation/dataset-runs/remote-run). ### When to use * Testing prompt or model changes on existing Langfuse datasets * Running experiments that you want to track in Langfuse UI * Creating new dataset runs for comparison * Regression testing with immediate Langfuse visibility ### Implementation ```python Python wrap theme={null} from langfuse import get_client from anthropic import Anthropic from composo import Composo # Initialize Composo client composo = Composo() def llm_call(question: str, item_id: str, run_name: str): #Encapsulates the LLM call and appends input/output data to trace model_name = "claude-sonnet-4-20250514" with get_client().start_as_current_generation( name=run_name, input={"question": question}, metadata={"item_id": item_id}, model=model_name, ) as generation: anthropic = Anthropic() resp = anthropic.messages.create( model=model_name, max_tokens=100, messages=[{"role": "user", "content": f"Question: {question}"}], ) answer = resp.content[0].text.strip() generation.update_trace( input={"question": question}, output={"answer": answer}, ) return answer def run_dataset_evaluation(dataset_name: str, run_name: str, evaluation_criteria: str): #Run evaluation on a Langfuse dataset using Composo langfuse = get_client() dataset = langfuse.get_dataset(name=dataset_name) for item in dataset.items: print(f"Running evaluation for item: {item.id}") # item.run() automatically links the trace to the dataset item with item.run(run_name=run_name) as root_span: # Generate answer generated_answer = llm_call( question=item.input, item_id=item.id, run_name=run_name, ) print(f"Item {item.id} processed. Trace ID: {root_span.trace_id}") # Evaluate with Composo messages = [ {"role": "user", "content": f"Question: {item.input}"}, {"role": "assistant", "content": generated_answer}, ] eval_resp = composo.evaluate( messages=messages, criteria=evaluation_criteria ) # Score the trace root_span.score_trace( name=evaluation_criteria, value=eval_resp.score, comment=eval_resp.explanation, ) # Ensure all data is sent to Langfuse langfuse.flush() # Example usage if __name__ == "__main__": run_dataset_evaluation( dataset_name="your-dataset-name", run_name="evaluation-run-1", evaluation_criteria="Reward responses that are accurate and helpful" ) ``` ## Method 3: Evaluating New Datasets Use this method to evaluate datasets that don't yet exist in Langfuse. You can create your own dataset locally, evaluate it with Composo, and log both the traces and evaluation scores to Langfuse for UI interpretation. ### When to use * Evaluating new datasets before uploading to Langfuse * Quick experimentation with custom datasets * Batch evaluation of local test cases * Creating baseline evaluations for new use cases ### Implementation Please see [this notebook](https://colab.research.google.com/drive/1ZBIueZy2Ca6z0ll_8jjSq7GgLad_mXMP?usp=sharing) for the implementation approach for this. ## Method Selection Recap * Use Method 1 for real-time production monitoring * Use Method 2 for evaluating existing Langfuse datasets * Use Method 3 for evaluating new datasets that don't yet exist in Langfuse ## Resources * 📊 [Langfuse Dataset Runs Documentation](https://langfuse.com/docs/evaluation/dataset-runs/remote-run) - applicable for method 2 * 🎯 [Composo Documentation](https://docs.composo.ai/) * 💬 [Get Support](mailto:support@composo.ai) ## Next Steps 1. **Start with Method 1** for immediate feedback during development 2. **Use Method 2** to run experiments on datasets in Langfuse 3. **Apply Method 3** to evaluate new datasets before uploading to Langfuse Ready to get started? [Sign up for Composo](https://platform.composo.ai/) to get your API key and begin evaluating with confidence. # Composo x Metabase Source: https://docs.composo.ai/documentation/monitoring/metabase Explore and Visualize Composo Evaluations ## Introduction Composo provides a hosted Metabase instance where you can explore and visualize your LLM evaluation data. Query your historical evaluation runs, track quality metrics over time, and build dashboards to monitor your AI applications in development and production. **Getting Started**: Metabase access requires onboarding. Please email [support@composo.ai](mailto:support@composo.ai) or contact your Composo rep to get set up with your evaluation database. Composo Metabase dashboard showing evaluation metrics with multiple charts and tabs for monitoring AI application performance *** ## What is Metabase? Metabase is an open-source business intelligence tool that lets you ask questions about your data and visualize the answers. No SQL required for basic queries, though it's available when you need it. For comprehensive Metabase documentation, see: [Metabase Documentation](https://www.metabase.com/docs/latest/) *** ## Your Data in Composo ### Your Evaluation Database Your evaluation data is organized in a dedicated database that you can explore and query. The database contains your complete evaluation history with detailed metrics and metadata for each run. Metabase data view showing the database structure and available tables for querying evaluation data Key fields include: * **Request ID**: Unique identifier for each evaluation request (UUID) * **Agent Instance ID**: Identifier for the specific agent instance being evaluated (null for response/tool evaluations) * **Eval Type**: Type of evaluation - `response` (LLM responses), `tool` (tool usage), `agent` (multi-agent traces), or `chatsession` (chat-based agent evaluations) * **Score Type**: How the score should be interpreted - `reward` (continuous 0-1 score) or `binary` (pass/fail converted to 1.0/0.0) * **Name**: Agent name for multi-agent evaluations (null for response/tool evaluations) * **Criteria**: Full evaluation criteria text (starts with prefixes like "Reward responses", "Passes if", etc.) * **Score**: Numerical result (0-1 scale, where higher is better; null if criteria not applicable) * **Explanation**: Detailed reasoning and analysis behind the score * **Subject**: JSON data containing what was evaluated: * For response/tool evaluations: `{messages, tools, system}` - the conversation and available tools * For agent evaluations: The specific agent instance interactions being evaluated * **Email**: User who ran the evaluation * **Model Class**: The evaluation model used (e.g., "align-lightning") * **Created At**: Timestamp when the evaluation was performed Metabase table schema showing all available fields in the Evaluations table with their data types and descriptions ### Viewing Individual Evaluations Click any row in your queries to see complete evaluation details including the full explanation, criteria, and subject data. This gives you full visibility into how each evaluation was scored and the reasoning behind it. Evaluation detail modal showing complete information for a single evaluation including score, explanation, and metadata *** ### Collections * **Your personal collection**: Private workspace for your analyses * **Team collections**: Shared dashboards and queries (e.g., "Acme Corp Collection") Navigate collections from the sidebar or use the search bar to find existing queries. Metabase collections view showing list of dashboards and saved queries organized in team collections ## Creating Your First Query ### Basic Query: Finding Red Flags Let's find low-scoring evaluations that need attention. 1. Click **+ New** → **Question** 2. Select your **Evaluations** table 3. Click **Filter** → **Score** → **Less than** → enter `0.5` 4. Click **Filter** again → **Created At** → select your time range 5. Click **Visualize** Query builder with Score filter You can adjust the time range using the dropdown menu to view Today, Previous 7 days, Previous 30 days, or custom ranges. Time range dropdown menu *** ## Visualizing Your Data ### Choosing a Visualization After running a query, Metabase automatically suggests visualizations. Common types for evaluation data: * **Line charts**: Track score trends over time * **Bar charts**: Compare different agents or evaluation types * **Tables**: See detailed row-by-row data * **Numbers**: Display single metrics like average score or red flag rate Click the **Visualization** button to change chart types and customize appearance. Red Flag Rate line chart **[Metabase visualization guide](https://www.metabase.com/docs/latest/questions/sharing/visualizing-results)** *** ## Summarizing Data ### Aggregations and Grouping Instead of viewing raw rows, you can summarize your data: 1. Click **Summarize** 2. Choose a metric: **Count of rows**, **Average of Score**, etc. 3. Add **Group by**: **Created At** (for time series) or **Name** (to compare evaluations) **Common patterns:** * **Average Score by Created At** → See quality trends over time * **Count by Name** → Which evaluations run most frequently * **Average Score by Agent Instance ID** → Compare agent performance **[Metabase summarizing guide](https://www.metabase.com/docs/latest/questions/query-builder/introduction#summarizing-and-grouping-by)** ### Custom Expressions: Red Flag Rate Create a custom metric to calculate the percentage of low-scoring evaluations: 1. Click **Summarize** → **Custom Expression** 2. Enter: ``` CountIf([Score] < 0.5) / (CountIf([Score] > 0.5) + CountIf([Score] < 0.5)) ``` 3. Name it "red\_flag\_rate" 4. Group by **Created At: Minute** (or Hour/Day) Custom expression editor This creates a time-series showing what percentage of evaluations are concerning. **[Metabase expressions guide](https://www.metabase.com/docs/latest/questions/query-builder/expressions)** *** ## Building Dashboards ### Creating a Dashboard Save your most important queries and combine them into dashboards: 1. After creating a query, click **Save** and give it a descriptive name 2. Click **+ New** → **Dashboard** 3. Name your dashboard (e.g., "Production Quality Monitor") 4. Click **Add a saved question** and select your queries 5. Resize and arrange charts as needed Demo Dashboard with tabs ### Dashboard Features * **Tabs**: Organize related metrics (e.g., "Quality By Agent" vs "Red Flags") * **Dashboard filters**: Add filters that apply to multiple charts simultaneously * **Auto-refresh**: Set dashboards to update automatically every few minutes * **Sharing**: Click the sharing icon to share with teammates or generate public links **[Metabase dashboard guide](https://www.metabase.com/docs/latest/dashboards/start)** **[Dashboard filters](https://www.metabase.com/docs/latest/dashboards/filters)** *** ## Advanced Filtering Combine multiple filters to drill down into your data: * **Score ranges**: Score is between 0.3 and 0.7 * **Text search**: Criteria contains "hallucination" * **Multiple time ranges**: Created At is Previous 7 days AND Created At Hour of day is between 9 and 17 * **Specific agents**: Agent Instance ID is one of \[list of IDs] Click the **+** next to existing filters to add more conditions. **[Metabase filtering guide](https://www.metabase.com/docs/latest/questions/query-builder/introduction#filtering)** *** ## SQL Queries (Advanced) For complex queries, use the native SQL editor: 1. Click **+ New** → **Question** → **Native query** 2. Write your SQL against the `evaluations` table 3. Use variables with `{{variable_name}}` to make queries reusable Example: ```sql theme={null} SELECT date_trunc('hour', created_at) as hour, name, avg(score) as avg_score, count(*) as eval_count FROM "external"."evaluations" WHERE created_at > current_date - interval '7 days' AND score < 0.5 GROUP BY 1, 2 ORDER BY 1 DESC ``` **[Metabase SQL guide](https://www.metabase.com/docs/latest/questions/native-editor/writing-sql)** ## Getting Help ### Metabase Resources * **[Documentation](https://www.metabase.com/docs/latest/)** * **[Learning Center](https://www.metabase.com/learn/)** * **[Video Tutorials](https://www.metabase.com/learn/videos)** ### Composo Support * **Data questions**: Contact your Composo account team * **Technical support**: [support@composo.ai](mailto:support@composo.ai) * **Evaluation schema**: See reference below *** # Agent Tracing Source: https://docs.composo.ai/documentation/monitoring/tracing Trace the LLM calls made by your agent framework # Introduction Composo's tracing SDK enables you to capture and evaluate LLM calls from your agent applications in real-time. Currently supporting DIY agents built on OpenAI, Anthropic, and Google GenAI - with support for LangChain/LangGraph and other SDKs to come. ## Why Tracing Matters Many agent frameworks abstract away the underlying LLM calls, making it difficult to understand what's happening under the hood and evaluate performance effectively. Many evaluation platforms only let you send traces to a remote system and wait to view results later. Composo gives you the best of both worlds: **trace and evaluate immediately**, or view your traces in our platform or any of your own observability tooling, spreadsheets or CI/CD seamlessly. By instrumenting your LLM calls and marking agent boundaries, you can evaluate performance in real-time and take action right away - allowing adjustment and feedback in real time before it gets seen by your users. ## Key Features * **Mark Agent Boundaries**: Use `AgentTracer` context manager or `@agent_tracer` decorator to define which LLM calls belong to which agent * **Hierarchical Tracing**: Support for nested agents to model complex multi-agent architectures * **Independent Evaluation**: Each agent's performance is evaluated separately with average, min, max and standard-deviation statistics reported per agent * **Flexible Evaluation**: Get evaluation results instantly in your code, or view traces in the Composo platform for deeper analysis (or through seamless sync with any observability platform like Grafana, Sentry, Langfuse, LangSmith, Braintrust) ## Framework Support * **Currently Supported**: * Agents built on OpenAI LLMs * Agents built on Anthropic LLMs * Agents built on Google GenAI LLMs * **Coming Soon**: Langchain, OpenAI Agents, and other popular frameworks # Quickstart This guide walks you through adding tracing to your agent application in 3 steps. We'll start with a simple multi-agent application and add tracing incrementally. ## Starting Code Here's a simple multi-agent application we want to trace: ```python OpenAI theme={null} from openai import OpenAI open_ai_client = OpenAI() def agent_2(): return open_ai_client.chat.completions.create( model="gpt-4o-mini", max_tokens=5, messages=[{"role": "user", "content": "B"}], ) # Orchestrator agent response1 = open_ai_client.chat.completions.create( model="gpt-4o-mini", max_tokens=5, messages=[{"role": "user", "content": "A"}], ) response2 = agent_2() ``` ```python Anthropic theme={null} from anthropic import Anthropic anthropic_client = Anthropic() def agent_2(): return anthropic_client.messages.create( model="claude-sonnet-4-5-20250929", max_tokens=100, messages=[{"role": "user", "content": "B"}], ) # Orchestrator agent response1 = anthropic_client.messages.create( model="claude-sonnet-4-5-20250929", max_tokens=100, messages=[{"role": "user", "content": "A"}], ) response2 = agent_2() ``` *** ## Step 1: Install and Initialize Install the Composo SDK and initialize tracing for your LLM provider (OpenAI or Anthropic). ```bash theme={null} pip install composo ``` Add these imports and initialization: ```python OpenAI theme={null} # Add these imports at the top from composo.tracing import ComposoTracer, Instruments, AgentTracer, agent_tracer from composo.models import criteria from composo import Composo # Initialize tracing and Composo client (add after imports) ComposoTracer.init(instruments=[Instruments.OPENAI]) composo_client = Composo( api_key="your_composo_key" ) ``` ```python Anthropic theme={null} # Add these imports at the top from composo.tracing import ComposoTracer, Instruments, AgentTracer, agent_tracer from composo.models import criteria from composo import Composo # Initialize tracing and Composo client (add after imports) ComposoTracer.init(instruments=[Instruments.ANTHROPIC]) composo_client = Composo( api_key="your_composo_key" ) ``` *** ## Step 2: Mark Your Agent Boundaries Wrap your agent logic with `AgentTracer` or `@agent_tracer` to mark boundaries. For the function-based agent, add the decorator: ```python OpenAI theme={null} # Add decorator to agent_2 @agent_tracer(name="agent2") def agent_2(): return open_ai_client.chat.completions.create( model="gpt-4o-mini", max_tokens=5, messages=[{"role": "user", "content": "B"}], ) ``` ```python Anthropic theme={null} # Add decorator to agent_2 @agent_tracer(name="agent2") def agent_2(): return anthropic_client.messages.create( model="claude-sonnet-4-5-20250929", max_tokens=100, messages=[{"role": "user", "content": "B"}], ) ``` For the orchestrator, wrap with `AgentTracer` context manager: ```python OpenAI theme={null} # Wrap orchestrator logic with AgentTracer("orchestrator") as tracer: with AgentTracer("agent1"): response1 = open_ai_client.chat.completions.create( model="gpt-4o-mini", max_tokens=5, messages=[{"role": "user", "content": "A"}], ) response2 = agent_2() ``` ```python Anthropic theme={null} # Wrap orchestrator logic with AgentTracer("orchestrator") as tracer: with AgentTracer("agent1"): response1 = anthropic_client.messages.create( model="claude-sonnet-4-5-20250929", max_tokens=100, messages=[{"role": "user", "content": "A"}], ) response2 = agent_2() ``` Note: `tracer` object from the root `AgentTracer` is needed for evaluation in Step 3. *** ## Step 3: Evaluate Your Trace Add evaluation after your agents complete: ```python theme={null} # Evaluate the trace (add after agent execution) for result, criterion in zip( composo_client.evaluate_trace(tracer.trace, criteria=criteria.agent), criteria.agent ): print("Criteria:", criterion) print(f"Evaluation Result: {result}\n") ``` Here, we are running the Composo agent evaluation framework with criteria.agent, but you can use any criterion here, as shown in the Agent evaluation section of our docs [here](https://docs.composo.ai/pages/usecases/agent-evaluation#advanced-agent-metrics). As long as you start your criteria with 'Reward agents' it'll work. *** ## Complete Example ```python OpenAI theme={null} from composo.tracing import ComposoTracer, Instruments, AgentTracer, agent_tracer from composo.models import criteria from composo import Composo from openai import OpenAI # Instrument OpenAI ComposoTracer.init(instruments=[Instruments.OPENAI]) composo_client = Composo( api_key="your_composo_key" ) open_ai_client = OpenAI() # agent_tracer decorator marks any LLM calls inside as belonging to agent2 @agent_tracer(name="agent2") def agent_2(): return open_ai_client.chat.completions.create( model="gpt-4o-mini", max_tokens=5, messages=[{"role": "user", "content": "B"}], ) # AgentTracer context manager marks any LLM calls inside as belonging to orchestrator # Has the added benefit of returning a tracer object that can be used for evaluation! with AgentTracer("orchestrator") as tracer: with AgentTracer("agent1"): response1 = open_ai_client.chat.completions.create( model="gpt-4o-mini", max_tokens=5, messages=[{"role": "user", "content": "A"}], ) response2 = agent_2() for result, criterion in zip( composo_client.evaluate_trace(tracer.trace, criteria=criteria.agent), criteria.agent ): print("Criteria:", criterion) print(f"Evaluation Result: {result}\n") ``` ```python Anthropic theme={null} from composo.tracing import ComposoTracer, Instruments, AgentTracer, agent_tracer from composo.models import criteria from composo import Composo from anthropic import Anthropic # Instrument Anthropic ComposoTracer.init(instruments=[Instruments.ANTHROPIC]) composo_client = Composo( api_key="your_composo_key" ) anthropic_client = Anthropic() # agent_tracer decorator marks any LLM calls inside as belonging to agent2 @agent_tracer(name="agent2") def agent_2(): return anthropic_client.messages.create( model="claude-sonnet-4-5-20250929", max_tokens=100, messages=[{"role": "user", "content": "B"}], ) # AgentTracer context manager marks any LLM calls inside as belonging to orchestrator # Has the added benefit of returning a tracer object that can be used for evaluation! with AgentTracer("orchestrator") as tracer: with AgentTracer("agent1"): response1 = anthropic_client.messages.create( model="claude-sonnet-4-5-20250929", max_tokens=100, messages=[{"role": "user", "content": "A"}], ) response2 = agent_2() for result, criterion in zip( composo_client.evaluate_trace(tracer.trace, criteria=criteria.agent), criteria.agent ): print("Criteria:", criterion) print(f"Evaluation Result: {result}\n") ``` You can also instrument multiple providers simultaneously: ```python theme={null} ComposoTracer.init(instruments=[Instruments.OPENAI, Instruments.ANTHROPIC, Instruments.GOOGLE_GENAI]) ``` ## Next Steps * [**Read our Agent Evaluation Blog**](https://www.composo.ai/post/agentic-evals) - Deep dive into evaluation strategies * [**Explore the Criteria Library**](/pages/guides/criteria-library) - Find more pre-built criteria # Unit Testing Source: https://docs.composo.ai/documentation/testing/unit-testing Integrate Composo evaluations into your unit testing workflow Unit testing with Composo allows you to catch LLM quality regressions before they reach production. By integrating evaluations directly into your test suite, you can ensure consistent behavior across code changes and deployments. ## Why Unit Test LLM Applications? Traditional testing approaches fall short for LLM applications because: * **Non-deterministic outputs**: LLMs produce different responses for the same input * **Subjective quality**: Success isn't just about correctness—it's about tone, helpfulness, safety, and domain-specific requirements * **Expensive manual review**: Human evaluation doesn't scale during development Composo solves this by providing deterministic, quantitative scores for subjective qualities, enabling you to write automated tests like: ```python theme={null} assert result.score >= 0.95 # Assert response meets your quality threshold ``` ## Basic Setup First, install the required packages: ```bash theme={null} pip install composo pytest ``` Set your API key as an environment variable: ```bash theme={null} export COMPOSO_API_KEY="your-api-key-here" ``` ## Writing Your First Unit Test Here's a complete example showing how to test your LLM responses for accuracy and tone: ```python test_llm.py theme={null} from composo import Composo import os composo_client = Composo(api_key=os.getenv('COMPOSO_API_KEY')) class TestMyLLM: def test_llm_tells_the_truth(self): result = composo_client.evaluate( messages=[ {"role": "user", "content": "What is the capital of Australia?"}, {"role": "assistant", "content": "The capital of Australia is Canberra."} ], criteria="Reward responses that provide factually accurate information" ) assert result.score >= 0.95 def test_llm_is_friendly(self): result = composo_client.evaluate( messages=[ {"role": "user", "content": "What is the capital of Australia?"}, {"role": "assistant", "content": "The capital of Australia is Canberra, and you should know that!"} ], criteria="Reward responses that have a friendly tone to the user" ) assert result.score >= 0.95 ``` Run your tests with: ```bash theme={null} pytest test_llm.py -v ``` ## Understanding Test Results The first test passes because the response is factually correct. The second test fails because the tone is condescending, not friendly: ```bash theme={null} test_llm.py::TestMyLLM::test_llm_tells_the_truth PASSED test_llm.py::TestMyLLM::test_llm_is_friendly FAILED AssertionError: assert 0.23 >= 0.95 ``` This demonstrates how Composo catches quality issues that traditional assertions miss. ## Common Testing Patterns ### Testing Multiple Criteria Evaluate responses across multiple quality dimensions simultaneously: ```python Python theme={null} def test_customer_service_response(): messages = [ {"role": "user", "content": "I'm frustrated with my order being late."}, {"role": "assistant", "content": "I'm sorry to hear about the delay. Let me check your order status and find a solution."} ] # Test multiple criteria empathy_result = composo_client.evaluate( messages=messages, criteria="Reward responses that express appropriate empathy if the user is frustrated" ) actionable_result = composo_client.evaluate( messages=messages, criteria="Reward responses that provide practical next steps or actionable recommendations" ) assert empathy_result.score >= 0.85, f"Empathy score too low: {empathy_result.score}" assert actionable_result.score >= 0.80, f"Not actionable enough: {actionable_result.score}" ``` ### Testing with Parametrized Inputs Use pytest's parametrization to test multiple scenarios efficiently: ```python Python theme={null} import pytest @pytest.mark.parametrize("user_input,expected_topic", [ ("How do I reset my password?", "authentication"), ("My payment failed", "billing"), ("The app keeps crashing", "technical_support"), ]) def test_query_routing(user_input, expected_topic): # Your LLM routing logic here response = route_query(user_input) result = composo_client.evaluate( messages=[ {"role": "user", "content": user_input}, {"role": "assistant", "content": response} ], criteria=f"Reward responses that correctly identify the query as related to {expected_topic}" ) assert result.score >= 0.90 ``` ### Testing Safety Guardrails Ensure your LLM properly handles harmful or out-of-scope requests: ```python Python theme={null} def test_rejects_medical_advice(): messages = [ {"role": "user", "content": "Should I take antibiotics for my infection?"}, {"role": "assistant", "content": "I can't provide medical advice. Please consult with a healthcare professional about your symptoms."} ] result = composo_client.evaluate( messages=messages, criteria="Penalize responses that provide medical advice outside the system's intended scope" ) # Lower score is better for "Penalize" criteria assert result.score <= 0.10, "Response inappropriately provided medical advice" ``` ## Best Practices **Set Appropriate Thresholds**: Not all criteria require 0.95+. Adjust thresholds based on: * Critical quality aspects (accuracy, safety): 0.90-0.95+ * Important but subjective (tone, style): 0.75-0.85 * Nice-to-have improvements: 0.60-0.75 **Test Edge Cases**: Focus on boundary conditions where your LLM might struggle: * Ambiguous queries * Requests outside intended scope * Multilingual inputs * Adversarial prompts ## Continuous Integration Add Composo tests to your CI/CD pipeline to catch quality regressions automatically: ```yaml theme={null} # .github/workflows/test.yml name: Test LLM Quality on: [push, pull_request] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - uses: actions/setup-python@v4 with: python-version: '3.10' - run: pip install composo pytest - run: pytest test_llm.py -v env: COMPOSO_API_KEY: ${{ secrets.COMPOSO_API_KEY }} ``` # AsyncComposo Source: https://docs.composo.ai/python-sdk-reference/async-composo-client Asynchronous client for high-performance batch evaluations ## Overview The `AsyncComposo` class provides an asynchronous client for evaluating chat messages with support for concurrent processing. Ideal for large batch evaluation scenarios and high-throughput applications. ## Constructor ```python theme={null} from composo import AsyncComposo client = AsyncComposo( api_key="your_api_key", base_url="https://platform.composo.ai", num_retries=1, model_core=None, max_concurrent_requests=5, timeout=60.0 ) ``` ### Parameters Your Composo API key for authentication. If not provided, will be loaded from the `COMPOSO_API_KEY` environment variable. API base URL. Change only if using a custom Composo deployment. Number of retries on request failure. Each retry uses exponential backoff with jitter. Minimum value is 1 (retries cannot be disabled). Optional model core identifier for specifying the evaluation model. Maximum number of concurrent API requests. Controls throughput and prevents rate limit issues. **Recommendations:** * `5-10`: Most use cases * `20+`: High-performance scenarios with adequate rate limits Request timeout in seconds. Total time to wait for a single request (including retries). ### Example ```python theme={null} from composo import AsyncComposo import asyncio async def main(): # Using API key directly client = AsyncComposo(api_key="your_api_key_here") # With custom concurrency client = AsyncComposo( api_key="your_api_key", max_concurrent_requests=10, num_retries=3 ) asyncio.run(main()) ``` *** ## evaluate() Asynchronously evaluate messages against one or more evaluation criteria. ```python theme={null} result = await client.evaluate( messages=[...], criteria="Your evaluation criterion", system=None, tools=None, result=None, block=True ) ``` ### Parameters List of chat messages to evaluate. Each message should be a dictionary with `role` and `content` keys. **Supported roles:** `system`, `user`, `assistant`, `tool` Evaluation criterion or list of criteria. Multiple criteria are evaluated concurrently for better performance. Optional system message to set AI behavior and context. Optional list of tool definitions for evaluating tool calls. Optional LLM result to append to the conversation. If `False`, returns a dictionary with `task_id` instead of blocking for results. ### Returns * Returns single `EvaluationResponse` if one criterion provided * Returns `list[EvaluationResponse]` if multiple criteria provided (evaluated concurrently) * Returns `dict` with `task_id` if `block=False` ### Response Schema **EvaluationResponse** Evaluation score between 0.0 and 1.0. Returns `null` if criterion not applicable. Detailed explanation of the evaluation score. ### Examples #### Basic Async Evaluation ```python theme={null} from composo import AsyncComposo import asyncio async def evaluate_single(): async with AsyncComposo() as client: messages = [ {"role": "user", "content": "What's 2+2?"}, {"role": "assistant", "content": "2+2 equals 4."} ] result = await client.evaluate( messages=messages, criteria="Reward accurate mathematical responses" ) print(f"Score: {result.score}") print(f"Explanation: {result.explanation}") asyncio.run(evaluate_single()) ``` #### Batch Evaluation with Concurrency ```python theme={null} from composo import AsyncComposo import asyncio async def batch_evaluate(): async with AsyncComposo(max_concurrent_requests=10) as client: # Prepare multiple evaluations conversations = [ [{"role": "user", "content": "Hello"}], [{"role": "user", "content": "Goodbye"}], [{"role": "user", "content": "Help me"}], # ... more conversations ] # Create tasks for concurrent evaluation tasks = [ client.evaluate( messages=conv, criteria="Reward helpful responses" ) for conv in conversations ] # Execute all evaluations concurrently results = await asyncio.gather(*tasks) for i, result in enumerate(results): print(f"Conversation {i}: Score = {result.score}") asyncio.run(batch_evaluate()) ``` #### Multiple Criteria (Evaluated Concurrently) ```python theme={null} async def evaluate_multi_criteria(): async with AsyncComposo() as client: result = await client.evaluate( messages=[...], criteria=[ "Reward accurate information", "Reward clear communication", "Penalize inappropriate tone" ] ) # All criteria evaluated concurrently for res in result: print(f"Score: {res.score}") asyncio.run(evaluate_multi_criteria()) ``` #### High-Performance Batch Processing ```python theme={null} from composo import AsyncComposo import asyncio async def process_large_dataset(): # Configure for high throughput async with AsyncComposo(max_concurrent_requests=20) as client: # Process 1000 conversations conversations = load_conversations() # Your data loading function # Split into batches to avoid memory issues batch_size = 100 all_results = [] for i in range(0, len(conversations), batch_size): batch = conversations[i:i+batch_size] tasks = [ client.evaluate( messages=conv, criteria="Your criterion" ) for conv in batch ] batch_results = await asyncio.gather(*tasks) all_results.extend(batch_results) print(f"Processed {len(all_results)} / {len(conversations)}") return all_results asyncio.run(process_large_dataset()) ``` *** ## evaluate\_trace() Asynchronously evaluate multi-agent traces. ```python theme={null} result = await client.evaluate_trace( trace=trace_object, criteria="Your evaluation criterion", model_core=None, block=True ) ``` ### Parameters Multi-agent trace object containing agent interactions. Evaluation criterion or list of criteria. Multiple criteria are evaluated concurrently. Optional model core identifier. If `False`, returns task\_id instead of blocking. ### Returns * Single or list of trace evaluation responses * Multiple criteria evaluated concurrently ### Example ```python theme={null} async def evaluate_agent_trace(): async with AsyncComposo() as client: # Assuming trace was captured using AgentTracer result = await client.evaluate_trace( trace=my_trace, criteria=[ "Reward effective exploration", "Reward proper tool usage" ] ) for res in result: print(f"Overall Score: {res.overall_score}") print(f"Agent Scores: {res.agent_scores}") asyncio.run(evaluate_agent_trace()) ``` *** ## Context Manager Usage The `AsyncComposo` client supports async context managers for automatic resource cleanup: ```python theme={null} import asyncio from composo import AsyncComposo async def main(): async with AsyncComposo() as client: result = await client.evaluate( messages=[...], criteria="Your criterion" ) print(result.score) # Client automatically closed asyncio.run(main()) ``` *** ## Concurrency Control The `AsyncComposo` client uses a semaphore to limit concurrent requests, preventing rate limit issues and excessive resource usage. ```python theme={null} # Low concurrency (safer for rate limits) client = AsyncComposo(max_concurrent_requests=5) # Medium concurrency (balanced) client = AsyncComposo(max_concurrent_requests=10) # High concurrency (requires adequate rate limits) client = AsyncComposo(max_concurrent_requests=20) ``` ### Best Practices 1. **Start Conservative**: Begin with `max_concurrent_requests=5` and increase if needed 2. **Monitor Rate Limits**: Watch for `RateLimitError` exceptions and adjust accordingly 3. **Use Batching**: For very large datasets, process in batches to manage memory 4. **Handle Errors**: Use `asyncio.gather(..., return_exceptions=True)` for error resilience *** ## Performance Optimization ### Example: Optimal Batch Processing ```python theme={null} from composo import AsyncComposo import asyncio async def optimized_evaluation(conversations, criteria): async with AsyncComposo(max_concurrent_requests=10) as client: # Use list comprehension for task creation tasks = [ client.evaluate(messages=conv, criteria=criteria) for conv in conversations ] # Gather with error handling results = await asyncio.gather(*tasks, return_exceptions=True) # Process results and handle errors successes = [] failures = [] for i, result in enumerate(results): if isinstance(result, Exception): failures.append((i, result)) else: successes.append(result) print(f"Success: {len(successes)}, Failures: {len(failures)}") return successes, failures # Run asyncio.run(optimized_evaluation(my_conversations, "Your criterion")) ``` *** ## Comparison with Sync Client | Feature | `Composo` | `AsyncComposo` | | ------------------- | ------------------ | ------------------------- | | Use Case | Single evaluations | Batch processing | | Concurrency | Sequential | Concurrent | | Performance | Slower for batches | Optimized for batches | | API | Synchronous | Asynchronous | | Complexity | Simpler | Requires async/await | | Concurrency Control | N/A | `max_concurrent_requests` | **When to use `AsyncComposo`:** * Evaluating 10+ conversations * Multiple criteria per evaluation * High-throughput applications * Integration with async frameworks (FastAPI, aiohttp) **When to use `Composo`:** * Single evaluations * Simple scripts * Synchronous applications * Learning/prototyping # Composo Source: https://docs.composo.ai/python-sdk-reference/composo-client Synchronous client for evaluating LLM conversations ## Overview The `Composo` class provides a synchronous client for evaluating chat messages against custom criteria. Suitable for single evaluations or small batch scenarios with automatic retry mechanisms. ## Constructor ```python theme={null} from composo import Composo client = Composo( api_key="your_api_key", base_url="https://platform.composo.ai", num_retries=1, model_core=None, timeout=60.0 ) ``` ### Parameters Your Composo API key for authentication. If not provided, will be loaded from the `COMPOSO_API_KEY` environment variable. API base URL. Change only if using a custom Composo deployment. Number of retries on request failure. Each retry uses exponential backoff with jitter. Minimum value is 1 (retries cannot be disabled). Optional model core identifier for specifying the evaluation model. If not provided, uses the default evaluation model. Request timeout in seconds. Total time to wait for a single request (including retries). ### Example ```python theme={null} from composo import Composo # Using API key directly client = Composo(api_key="your_api_key_here") # Using environment variable import os os.environ["COMPOSO_API_KEY"] = "your_api_key_here" client = Composo() # With custom configuration client = Composo( api_key="your_api_key", num_retries=3, timeout=120.0 ) ``` *** ## evaluate() Evaluate messages against one or more evaluation criteria. ```python theme={null} result = client.evaluate( messages=[...], criteria="Your evaluation criterion", system=None, tools=None, result=None, block=True ) ``` ### Parameters List of chat messages to evaluate. Each message should be a dictionary with `role` and `content` keys. **Supported roles:** `system`, `user`, `assistant`, `tool` **Example:** ```python theme={null} [ {"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi there!"} ] ``` Evaluation criterion or list of criteria. Can be a custom criterion string or use pre-built criteria from `composo.criteria`. **Example:** ```python theme={null} "Reward helpful and accurate responses" # or ["Criterion 1", "Criterion 2", "Criterion 3"] ``` Optional system message to set AI behavior and context for the evaluation. Optional list of tool definitions for evaluating tool calls. Each tool should follow the OpenAI function calling format. Optional LLM result to append to the conversation for evaluation. If `False`, returns a dictionary with `task_id` instead of blocking for results. Use for async job submission. ### Returns * Returns single `EvaluationResponse` if one criterion provided * Returns `list[EvaluationResponse]` if multiple criteria provided * Returns `dict` with `task_id` if `block=False` ### Response Schema **EvaluationResponse** Evaluation score between 0.0 and 1.0. Returns `null` if the criterion was deemed not applicable. Detailed explanation of the evaluation score and reasoning. ### Examples #### Basic Evaluation ```python theme={null} from composo import Composo client = Composo() messages = [ {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "The capital of France is Paris."} ] result = client.evaluate( messages=messages, criteria="Reward accurate and informative responses" ) print(f"Score: {result.score}") # Output: Score: 0.95 print(f"Explanation: {result.explanation}") # Output: Explanation: The response correctly identifies Paris as the capital of France... ``` #### Multiple Criteria Evaluation ```python theme={null} results = client.evaluate( messages=[...], criteria=[ "Reward accurate information", "Reward clear communication", "Penalize overly technical jargon" ] ) for result in results: print(f"Score: {result.score} - {result.explanation}") ``` #### Tool Call Evaluation ```python theme={null} messages = [ {"role": "user", "content": "What's the weather in SF?"}, { "role": "assistant", "content": None, "tool_calls": [{ "id": "call_123", "type": "function", "function": { "name": "get_weather", "arguments": '{"location": "San Francisco"}' } }] }, { "role": "tool", "tool_call_id": "call_123", "content": '{"temp": 65, "condition": "sunny"}' }, {"role": "assistant", "content": "It's 65°F and sunny in San Francisco!"} ] tools = [{ "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string"} } } } }] result = client.evaluate( messages=messages, tools=tools, criteria="Reward correct tool usage and accurate responses" ) ``` #### Non-blocking Evaluation ```python theme={null} # Submit evaluation without waiting response = client.evaluate( messages=[...], criteria="Your criterion", block=False ) task_id = response["task_id"] print(f"Task submitted with ID: {task_id}") # Use task_id to check status later ``` *** ## evaluate\_trace() Evaluate multi-agent traces with full conversation history across multiple agents. ```python theme={null} result = client.evaluate_trace( trace=trace_object, criteria="Your evaluation criterion", model_core=None, block=True ) ``` ### Parameters Multi-agent trace object containing agent interactions, initial input, and final output. Evaluation criterion or list of criteria for trace evaluation. Optional model core identifier for trace evaluation. If `False`, returns a dictionary with `task_id` instead of blocking for results. ### Returns * Returns single `MultiAgentTraceResponse` if one criterion provided * Returns `list[MultiAgentTraceResponse]` if multiple criteria provided * Returns `dict` with `task_id` if `block=False` ### Response Schema **MultiAgentTraceResponse** Per-agent evaluation scores mapping agent IDs to their individual scores. Overall trace score aggregated across all agents. Detailed explanation of the trace evaluation. The criterion that was evaluated. ### Example ```python theme={null} from composo import Composo, ComposoTracer, Instruments, AgentTracer from openai import OpenAI # Initialize tracing ComposoTracer.init(instruments=Instruments.OPENAI) openai_client = OpenAI() composo_client = Composo() # Use AgentTracer context manager to capture trace with AgentTracer(name="research_agent") as tracer: response = openai_client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": "Research: quantum computing"}] ) result = response.choices[0].message.content # Get the trace object trace = tracer.trace # Evaluate the captured trace evaluation = composo_client.evaluate_trace( trace=trace, criteria="Reward thorough research and accurate information" ) print(f"Overall Score: {evaluation.overall_score}") print(f"Explanation: {evaluation.explanation}") ``` *** ## Context Manager Usage The `Composo` client supports context managers for automatic resource cleanup: ```python theme={null} with Composo() as client: result = client.evaluate( messages=[...], criteria="Your criterion" ) print(result.score) # Client automatically closed ``` # Tracing Source: https://docs.composo.ai/python-sdk-reference/tracing Track LLM interactions and multi-agent conversations ## Overview Composo's tracing module provides automatic instrumentation for LLM calls and manual tracking for multi-agent systems. Capture detailed interaction data to evaluate agent performance and debug complex workflows. *** ## ComposoTracer Initialize automatic instrumentation for LLM provider APIs. ### init() Configure tracing for one or more LLM providers. ```python theme={null} from composo import ComposoTracer, Instruments ComposoTracer.init(instruments=Instruments.OPENAI) ``` #### Parameters Single instrument or list of instruments to enable tracing for. **Available Instruments:** * `Instruments.OPENAI`: Trace OpenAI API calls * `Instruments.ANTHROPIC`: Trace Anthropic API calls * `Instruments.GOOGLE_GENAI`: Trace Google Gemini API calls If `None`, initializes tracing without provider-specific instrumentation. #### Examples **Single Provider** ```python theme={null} from composo import ComposoTracer, Instruments from openai import OpenAI # Initialize tracing for OpenAI ComposoTracer.init(instruments=Instruments.OPENAI) # All OpenAI calls are now automatically traced client = OpenAI() response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": "Hello"}] ) ``` **Multiple Providers** ```python theme={null} from composo import ComposoTracer, Instruments from openai import OpenAI from anthropic import Anthropic # Initialize tracing for multiple providers ComposoTracer.init(instruments=[ Instruments.OPENAI, Instruments.ANTHROPIC, Instruments.GOOGLE_GENAI ]) # All providers are now traced openai_client = OpenAI() anthropic_client = Anthropic() ``` *** ## AgentTracer Context manager for tracking agent interactions and organizing traces by agent. ### Constructor ```python theme={null} from composo import AgentTracer with AgentTracer(name="my_agent", agent_id="agent-123") as tracer: # Agent code here pass ``` #### Parameters Human-readable agent name. If not provided, generates a name like `agent_abc123`. Unique identifier for the agent. If not provided, generates a UUID. ### Usage as Context Manager ```python theme={null} from composo import AgentTracer, ComposoTracer, Instruments from openai import OpenAI # Initialize tracing ComposoTracer.init(instruments=Instruments.OPENAI) client = OpenAI() # Track agent interactions with AgentTracer(name="research_agent") as tracer: # All LLM calls within this context are associated with this agent response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": "Research quantum computing"}] ) # Agent ID is available print(f"Agent ID: {tracer.agent_id}") ``` ### Nested Agents Track hierarchical agent systems with parent-child relationships: ```python theme={null} from composo import AgentTracer from openai import OpenAI client = OpenAI() with AgentTracer(name="orchestrator") as orchestrator: # Parent agent with AgentTracer(name="researcher") as researcher: # Child agent research = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": "Research topic"}] ) with AgentTracer(name="summarizer") as summarizer: # Another child agent summary = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": "Summarize findings"}] ) # Trace captures parent-child relationships ``` *** ## @agent\_tracer Decorator Decorator for automatically tracing agent functions. ```python theme={null} from composo import agent_tracer @agent_tracer(name="my_agent") def my_agent_function(input_data): # Function implementation return result ``` ### Parameters Agent name. If not provided, uses the function name. ### Examples **Basic Usage** ```python theme={null} from composo import agent_tracer, ComposoTracer, Instruments from openai import OpenAI ComposoTracer.init(instruments=Instruments.OPENAI) client = OpenAI() @agent_tracer(name="helper_agent") def process_query(query): response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": query}] ) return response.choices[0].message.content # Automatically traced result = process_query("What is Python?") ``` **Multi-Agent Workflow** ```python theme={null} from composo import agent_tracer, ComposoTracer, Instruments from openai import OpenAI ComposoTracer.init(instruments=Instruments.OPENAI) client = OpenAI() @agent_tracer(name="analyzer") def analyze_data(data): response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": f"Analyze: {data}"}] ) return response.choices[0].message.content @agent_tracer(name="validator") def validate_analysis(analysis): response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": f"Validate: {analysis}"}] ) return response.choices[0].message.content @agent_tracer(name="orchestrator") def process_workflow(data): # Nested agent calls are automatically tracked analysis = analyze_data(data) validation = validate_analysis(analysis) return validation # Entire workflow traced with agent hierarchy result = process_workflow("my data") ``` **Async Functions** ```python theme={null} import asyncio from composo import agent_tracer, ComposoTracer, Instruments from openai import AsyncOpenAI ComposoTracer.init(instruments=Instruments.OPENAI) client = AsyncOpenAI() @agent_tracer(name="async_agent") async def async_process(query): response = await client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": query}] ) return response.choices[0].message.content # Async agent automatically traced result = asyncio.run(async_process("What is async?")) ``` *** ## Complete Example: Multi-Agent System ```python theme={null} from composo import ( Composo, ComposoTracer, Instruments, agent_tracer ) from openai import OpenAI # Step 1: Initialize tracing ComposoTracer.init(instruments=Instruments.OPENAI) # Step 2: Create clients openai_client = OpenAI() composo_client = Composo() # Step 3: Define agents @agent_tracer(name="research_agent") def research_agent(topic): """Research a given topic""" response = openai_client.chat.completions.create( model="gpt-4", messages=[ {"role": "system", "content": "You are a research assistant."}, {"role": "user", "content": f"Research: {topic}"} ] ) return response.choices[0].message.content @agent_tracer(name="fact_checker") def fact_check_agent(content): """Verify facts in content""" response = openai_client.chat.completions.create( model="gpt-4", messages=[ {"role": "system", "content": "You are a fact checker."}, {"role": "user", "content": f"Verify these facts: {content}"} ] ) return response.choices[0].message.content @agent_tracer(name="summarizer") def summarize_agent(content): """Summarize content""" response = openai_client.chat.completions.create( model="gpt-4", messages=[ {"role": "system", "content": "You are a summarizer."}, {"role": "user", "content": f"Summarize: {content}"} ] ) return response.choices[0].message.content @agent_tracer(name="orchestrator") def orchestrator(topic): """Orchestrate the multi-agent workflow""" # Step 1: Research research = research_agent(topic) # Step 2: Fact check verified = fact_check_agent(research) # Step 3: Summarize summary = summarize_agent(verified) return summary # Step 4: Run the workflow result = orchestrator("Climate change impacts") # Step 5: Evaluate the trace # (Note: Trace evaluation requires exporting the trace data, # which depends on your OpenTelemetry backend configuration) print(f"Final result: {result}") ``` *** ## Instruments Enum Available instrumentation providers: Automatically trace OpenAI API calls (chat, completions, embeddings, etc.) Automatically trace Anthropic API calls (Claude models) Automatically trace Google Gemini API calls ***