RAG Evaluation

Why RAG Evaluation Matters

Retrieval-Augmented Generation (RAG) systems are only as good as their ability to accurately use retrieved information. Poor RAG performance leads to hallucinations, incomplete answers, and loss of user trust. Composo’s RAG framework provides comprehensive evaluation across the critical dimensions of RAG quality.

The Composo RAG Framework

Our framework, developed through extensive R&D and rigorously tested with Fortune 500 companies and leading AI teams, delivers 92% accuracy in detecting hallucinations and faithfulness violations—far exceeding the ~70% accuracy of LLM-as-judge approaches.

Proven Performance

18 months of research refining the optimal RAG evaluation criteria
Battle-tested across hundreds of production RAG systems including for critical hallucination detection in regulated industries
92% agreement with expert human evaluators on RAG quality assessment
70% reduction in error rate compared to traditional LLM-as-judge methods

This isn’t just another evaluation tool—it’s the result of deep collaboration with industry leaders who needed evaluation that actually works for production RAG systems handling millions of queries daily.

Core RAG Metrics

📖 Context Faithfulness “Reward responses that make only claims directly supported by the provided source material without any hallucination or speculation” ✅ Completeness “Reward responses that comprehensively include all relevant information from the source material needed to fully answer the question” 🎯 Context Precision “Reward responses that include only information necessary to answer the question without extraneous details from the source material” 🔍 Relevance “Reward responses where all content directly addresses and is relevant to answering the user’s specific question”

Implementation Example

Our SDK now provides independent criteria variables for RAG evaluation, making it easier to use specific criteria or create custom combinations. Each criterion is defined as a separate variable with clear, focused descriptions. RAG Evaluation Criteria:

rag_faithfulness - Reward responses that make only claims directly supported by the provided source material without any hallucination or speculation
rag_completeness - Reward responses that comprehensively include all relevant information from the source material needed to fully answer the question
rag_precision - Reward responses that include only information necessary to answer the question without extraneous details from the source material
rag_relevance - Reward responses where all content directly addresses and is relevant to answering the user’s specific question

Alternatively, criteria.rag is a list contains all above. Here’s how to evaluate a RAG system’s performance using our framework:

Python

from composo import Composo, criteria

composo_client = Composo(api_key="your-api-key-here")

# Example RAG conversation with retrieved context
messages = [
    {
        "role": "user", 
        "content": """What is the current population of Tokyo?

Context:
According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents, making it the world's most populous urban agglomeration. The Tokyo Metropolis itself has 14.0 million people."""
    },
    {
        "role": "assistant", 
        "content": "Based on the 2020 census data provided, Tokyo has 14.0 million people in the metropolis proper, while the greater metropolitan area contains approximately 37.4 million residents, making it the world's largest urban agglomeration."
    }
]

# Evaluate with the RAG framework
results = composo_client.evaluate(
    messages=messages,
    criteria=criteria.rag
)

for result, criterion in zip(results, criteria.rag):
    print(f"Criterion: {criterion}")
    print(f"Score: {result.score}")
    print(f"Explanation: {result.explanation}")
    print("-" * 40)

Evaluating Retrieval Quality

Beyond evaluating the generated responses, you can also assess the quality of your retrieval system itself. This helps identify when your vector search or retrieval mechanism needs improvement before it impacts downstream generation.

How It Works

Treat your retrieval step as a “tool call” and evaluate whether the retrieved chunks are actually relevant to the user’s query. This gives you quantitative metrics on retrieval precision.

Implementation

Python

from composo import Composo

composo_client = Composo(api_key="your-api-key-here")

# User's question
user_query = "What is the current population of Tokyo?"

# Chunks retrieved by your RAG system
retrieved_chunks = """
Chunk 1: According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents.
Chunk 2: The Tokyo Metropolis itself has 14.0 million people.
Chunk 3: Population density in Tokyo is approximately 6,158 people per square kilometer.
"""

# Define the retrieval tool (for context)
tools = [
    {
        "type": "function",
        "function": {
            "name": "rag_retrieval",
            "description": "Retrieves relevant document chunks based on semantic search",
            "parameters": {"type": "object", "required": [], "properties": {}}
        }
    }
]

# Evaluate retrieval quality
result = composo_client.evaluate(
    messages=[
        {"role": "user", "content": user_query},
        {"role": "function", "name": "rag_retrieval", "content": retrieved_chunks}
    ],
    tools=tools,
    criteria="Reward tool calls that retrieve chunks directly relevant to answering the user's question"
)

print(f"Retrieval Quality Score: {result.score:.2f}/1.00")
# High scores (>0.8) indicate good retrieval
# Low scores (<0.6) suggest retrieval improvements needed

Intro

Use cases

Python SDK

Monitoring

Guides

Why RAG Evaluation Matters

The Composo RAG Framework

Proven Performance

Core RAG Metrics

Implementation Example

Evaluating Retrieval Quality

How It Works

Implementation

Intro

Use cases

Python SDK

Monitoring

Guides

​Why RAG Evaluation Matters

​The Composo RAG Framework

​Proven Performance

​Core RAG Metrics

​Implementation Example

​Evaluating Retrieval Quality

​How It Works

​Implementation

Why RAG Evaluation Matters

The Composo RAG Framework

Proven Performance

Core RAG Metrics

Implementation Example

Evaluating Retrieval Quality

How It Works

Implementation