Why RAG Evaluation Matters

Retrieval-Augmented Generation (RAG) systems are only as good as their ability to accurately use retrieved information. Poor RAG performance leads to hallucinations, incomplete answers, and loss of user trust. Composo’s RAG framework provides comprehensive evaluation across the critical dimensions of RAG quality.

The Composo RAG Framework

Our framework, developed through extensive R&D and rigorously tested with Fortune 500 companies and leading AI teams, delivers 92% accuracy in detecting hallucinations and faithfulness violations—far exceeding the ~70% accuracy of LLM-as-judge approaches.

Proven Performance

  • 18 months of research refining the optimal RAG evaluation criteria
  • Battle-tested across hundreds of production RAG systems including for critical hallucination detection in regulated industries
  • 92% agreement with expert human evaluators on RAG quality assessment
  • 70% reduction in error rate compared to traditional LLM-as-judge methods
This isn’t just another evaluation tool—it’s the result of deep collaboration with industry leaders who needed evaluation that actually works for production RAG systems handling millions of queries daily.

Core RAG Metrics

📖 Context Faithfulness “Reward responses that make only claims directly supported by the provided source material without any hallucination or speculation” ✅ Completeness “Reward responses that comprehensively include all relevant information from the source material needed to fully answer the question” 🎯 Context Precision “Reward responses that include only information necessary to answer the question without extraneous details from the source material” 🔍 Relevance “Reward responses where all content directly addresses and is relevant to answering the user’s specific question”

Implementation Example

Here’s how to evaluate a RAG system’s performance using our framework:
Python
from composo import Composo, criteria

composo_client = Composo(api_key="your-api-key-here")

# Example RAG conversation with retrieved context
messages = [
    {
        "role": "user", 
        "content": """What is the current population of Tokyo?

Context:
According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents, making it the world's most populous urban agglomeration. The Tokyo Metropolis itself has 14.0 million people."""
    },
    {
        "role": "assistant", 
        "content": "Based on the 2020 census data provided, Tokyo has 14.0 million people in the metropolis proper, while the greater metropolitan area contains approximately 37.4 million residents, making it the world's largest urban agglomeration."
    }
]

# Evaluate with the RAG framework
results = composo_client.evaluate(
    messages=messages,
    criteria=criteria.rag
)

for result in results:
    print(f"Score: {result.score}/1.00")
    print(f"Explanation: {result.explanation}\n")

Evaluating Retrieval Quality

Beyond evaluating the generated responses, you can also assess the quality of your retrieval system itself. This helps identify when your vector search or retrieval mechanism needs improvement before it impacts downstream generation.

How It Works

Treat your retrieval step as a “tool call” and evaluate whether the retrieved chunks are actually relevant to the user’s query. This gives you quantitative metrics on retrieval precision.

Implementation

Python
from composo import Composo

composo_client = Composo(api_key="your-api-key-here")

# User's question
user_query = "What is the current population of Tokyo?"

# Chunks retrieved by your RAG system
retrieved_chunks = """
Chunk 1: According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents.
Chunk 2: The Tokyo Metropolis itself has 14.0 million people.
Chunk 3: Population density in Tokyo is approximately 6,158 people per square kilometer.
"""

# Define the retrieval tool (for context)
tools = [
    {
        "type": "function",
        "function": {
            "name": "rag_retrieval",
            "description": "Retrieves relevant document chunks based on semantic search",
            "parameters": {"type": "object", "required": [], "properties": {}}
        }
    }
]

# Evaluate retrieval quality
result = composo_client.evaluate(
    messages=[
        {"role": "user", "content": user_query},
        {"role": "function", "name": "rag_retrieval", "content": retrieved_chunks}
    ],
    tools=tools,
    criteria="Reward tool calls that retrieve chunks directly relevant to answering the user's question"
)

print(f"Retrieval Quality Score: {result.score:.2f}/1.00")
# High scores (>0.8) indicate good retrieval
# Low scores (<0.6) suggest retrieval improvements needed