This guide shows how to integrate Composo’s deterministic evaluation with Langfuse’s observability platform to evaluate your LLM applications with confidence.

Overview

Langfuse provides comprehensive observability for LLM applications with tracing, debugging, and dataset management capabilities. Composo delivers deterministic, accurate evaluation through purpose-built generative reward models that achieve 92% accuracy (vs 72% for LLM-as-judge). Together, they enable you to:
  • ✅ Track every LLM interaction through Langfuse’s tracing
  • ✅ Add deterministic evaluation scores to your traces
  • ✅ Evaluate datasets programmatically with reliable metrics
  • ✅ Ship AI features with confidence using quantitative, trustworthy metrics

Prerequisites

Python
pip install langfuse composo
Python
import os
from langfuse import Langfuse, get_client
from composo import Composo, AsyncComposo

# Set your API keys
os.environ["LANGFUSE_PUBLIC_KEY"] = "your-public-key"
os.environ["LANGFUSE_SECRET_KEY"] = "your-secret-key"
os.environ["COMPOSO_API_KEY"] = "your-composo-key"

# Initialize clients
langfuse = get_client()
composo_client = Composo()
async_composo = AsyncComposo()

How Langfuse & Composo work in combination

Untitleddiagram Mermaid Chart 2025 08 19 133254 Pn

Method 1: Real-time Trace Evaluation

Evaluate LLM outputs as they’re generated in production or development. This approach uses the @observe decorator to automatically trace your LLM calls, then evaluates them with Composo asynchronously. More detail on how the langfuse @observe decorator works is here.

When to use

  • Production monitoring with real-time quality scores
  • Development iteration with immediate feedback

Implementation

Python
import asyncio
from langfuse import get_client, observe
from anthropic import Anthropic
from composo import AsyncComposo

# Initialize async Composo client
async_composo = AsyncComposo()

@observe()
async def llm_call(input_data: str) -> str:
    # LLM call with async evaluation using @observe decorator
    model_name = "claude-sonnet-4-20250514"
    
    anthropic = Anthropic()
    resp = anthropic.messages.create(
        model=model_name,
        max_tokens=100,
        messages=[{"role": "user", "content": input_data}],
    )
    output = resp.content[0].text.strip()
    
    # Get trace ID for scoring
    trace_id = get_client().get_current_trace_id()
    evaluation_criteria = "Reward responses that are helpful"
    
    # Start asynchronous evaluation task (non-blocking)
    # Note: You can register tasks to a task queue or background tasks
    await evaluate_with_composo(trace_id, input_data, output, evaluation_criteria)
    
    return output

async def evaluate_with_composo(trace_id, input_data, output, evaluation_criteria):
    # Evaluate LLM output with Composo and score in Langfuse
    # Composo expects a list of chat messages
    messages = [
        {"role": "user", "content": input_data},
        {"role": "assistant", "content": output},
    ]
    
    eval_resp = await async_composo.evaluate(
        messages=messages, 
        criteria=evaluation_criteria
    )
    
    # Score the trace in Langfuse
    langfuse = get_client()
    langfuse.create_score(
        trace_id=trace_id,
        name=evaluation_criteria,
        value=eval_resp.score,
        comment=eval_resp.explanation,
    )
Then in your main application:
Python
# Simply call the function - Langfuse logs and Composo evaluates asynchronously
await llm_call(input_data)

Method 2: Dataset Evaluation

Use this method to evaluate your LLM application on a dataset that already exists in Langfuse. The item.run() context manager automatically links execution traces to dataset items. For more detail on how this works from Langfuse please see here.

When to use

  • Testing prompt or model changes on existing Langfuse datasets
  • Running experiments that you want to track in Langfuse UI
  • Creating new dataset runs for comparison
  • Regression testing with immediate Langfuse visibility

Implementation

Python
from langfuse import get_client
from anthropic import Anthropic
from composo import Composo

# Initialize Composo client
composo = Composo()

def llm_call(question: str, item_id: str, run_name: str):
    #Encapsulates the LLM call and appends input/output data to trace
    model_name = "claude-sonnet-4-20250514"
    
    with get_client().start_as_current_generation(
        name=run_name,
        input={"question": question},
        metadata={"item_id": item_id},
        model=model_name,
    ) as generation:
        anthropic = Anthropic()
        resp = anthropic.messages.create(
            model=model_name,
            max_tokens=100,
            messages=[{"role": "user", "content": f"Question: {question}"}],
        )
        answer = resp.content[0].text.strip()
        
        generation.update_trace(
            input={"question": question},
            output={"answer": answer},
        )
        return answer

def run_dataset_evaluation(dataset_name: str, run_name: str, evaluation_criteria: str):
    #Run evaluation on a Langfuse dataset using Composo
    langfuse = get_client()
    dataset = langfuse.get_dataset(name=dataset_name)
    
    for item in dataset.items:
        print(f"Running evaluation for item: {item.id}")
        
        # item.run() automatically links the trace to the dataset item
        with item.run(run_name=run_name) as root_span:
            # Generate answer
            generated_answer = llm_call(
                question=item.input,
                item_id=item.id,
                run_name=run_name,
            )
            
            print(f"Item {item.id} processed. Trace ID: {root_span.trace_id}")
            
            # Evaluate with Composo
            messages = [
                {"role": "user", "content": f"Question: {item.input}"},
                {"role": "assistant", "content": generated_answer},
            ]
            
            eval_resp = composo.evaluate(
                messages=messages, 
                criteria=evaluation_criteria
            )
            
            # Score the trace
            root_span.score_trace(
                name=evaluation_criteria,
                value=eval_resp.score,
                comment=eval_resp.explanation,
            )
    
    # Ensure all data is sent to Langfuse
    langfuse.flush()

# Example usage
if __name__ == "__main__":
    run_dataset_evaluation(
        dataset_name="your-dataset-name",
        run_name="evaluation-run-1",
        evaluation_criteria="Reward responses that are accurate and helpful"
    )

Method 3: Evaluating New Datasets

Use this method to evaluate datasets that don’t yet exist in Langfuse. You can create your own dataset locally, evaluate it with Composo, and log both the traces and evaluation scores to Langfuse for UI interpretation.

When to use

  • Evaluating new datasets before uploading to Langfuse
  • Quick experimentation with custom datasets
  • Batch evaluation of local test cases
  • Creating baseline evaluations for new use cases

Implementation

Please see this notebook for the implementation approach for this.

Method Selection Recap

  • Use Method 1 for real-time production monitoring
  • Use Method 2 for evaluating existing Langfuse datasets
  • Use Method 3 for evaluating new datasets that don’t yet exist in Langfuse

Resources

Next Steps

  1. Start with Method 1 for immediate feedback during development
  2. Use Method 2 to run experiments on datasets in Langfuse
  3. Apply Method 3 to evaluate new datasets before uploading to Langfuse
Ready to get started? Sign up for Composo to get your API key and begin evaluating with confidence.