Retrieval-Augmented Generation (RAG) systems are only as good as their ability to accurately use retrieved information. Poor RAG performance leads to hallucinations, incomplete answers, and loss of user trust. Composo’s RAG framework provides comprehensive evaluation across the critical dimensions of RAG quality.
Our framework, developed through extensive R&D and rigorously tested with Fortune 500 companies and leading AI teams, delivers 92% accuracy in detecting hallucinations and faithfulness violations—far exceeding the ~70% accuracy of LLM-as-judge approaches.
18 months of research refining the optimal RAG evaluation criteria
Battle-tested across hundreds of production RAG systems including for critical hallucination detection in regulated industries
92% agreement with expert human evaluators on RAG quality assessment
70% reduction in error rate compared to traditional LLM-as-judge methods
This isn’t just another evaluation tool—it’s the result of deep collaboration with industry leaders who needed evaluation that actually works for production RAG systems handling millions of queries daily.
📖 Context Faithfulness“Reward responses that make only claims directly supported by the provided source material without any hallucination or speculation”✅ Completeness“Reward responses that comprehensively include all relevant information from the source material needed to fully answer the question”🎯 Context Precision“Reward responses that include only information necessary to answer the question without extraneous details from the source material”🔍 Relevance“Reward responses where all content directly addresses and is relevant to answering the user’s specific question”
Here’s how to evaluate a RAG system’s performance using our framework:
Python
Copy
from composo import Composo, criteriacomposo_client = Composo(api_key="your-api-key-here")# Example RAG conversation with retrieved contextmessages = [ { "role": "user", "content": """What is the current population of Tokyo?Context:According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents, making it the world's most populous urban agglomeration. The Tokyo Metropolis itself has 14.0 million people.""" }, { "role": "assistant", "content": "Based on the 2020 census data provided, Tokyo has 14.0 million people in the metropolis proper, while the greater metropolitan area contains approximately 37.4 million residents, making it the world's largest urban agglomeration." }]# Evaluate with the RAG frameworkresults = composo_client.evaluate( messages=messages, criteria=criteria.rag)for result in results: print(f"Score: {result.score}/1.00") print(f"Explanation: {result.explanation}\n")
Beyond evaluating the generated responses, you can also assess the quality of your retrieval system itself. This helps identify when your vector search or retrieval mechanism needs improvement before it impacts downstream generation.
Treat your retrieval step as a “tool call” and evaluate whether the retrieved chunks are actually relevant to the user’s query. This gives you quantitative metrics on retrieval precision.
from composo import Composocomposo_client = Composo(api_key="your-api-key-here")# User's questionuser_query = "What is the current population of Tokyo?"# Chunks retrieved by your RAG systemretrieved_chunks = """Chunk 1: According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents.Chunk 2: The Tokyo Metropolis itself has 14.0 million people.Chunk 3: Population density in Tokyo is approximately 6,158 people per square kilometer."""# Define the retrieval tool (for context)tools = [ { "type": "function", "function": { "name": "rag_retrieval", "description": "Retrieves relevant document chunks based on semantic search", "parameters": {"type": "object", "required": [], "properties": {}} } }]# Evaluate retrieval qualityresult = composo_client.evaluate( messages=[ {"role": "user", "content": user_query}, {"role": "function", "name": "rag_retrieval", "content": retrieved_chunks} ], tools=tools, criteria="Reward tool calls that retrieve chunks directly relevant to answering the user's question")print(f"Retrieval Quality Score: {result.score:.2f}/1.00")# High scores (>0.8) indicate good retrieval# Low scores (<0.6) suggest retrieval improvements needed