Why RAG Evaluation Matters
Retrieval-Augmented Generation (RAG) systems are only as good as their ability to accurately use retrieved information. Poor RAG performance leads to hallucinations, incomplete answers, and loss of user trust. Composo’s RAG framework provides comprehensive evaluation across the critical dimensions of RAG quality.The Composo RAG Framework
Our framework, developed through extensive R&D and rigorously tested with Fortune 500 companies and leading AI teams, delivers 92% accuracy in detecting hallucinations and faithfulness violations—far exceeding the ~70% accuracy of LLM-as-judge approaches.Proven Performance
- 18 months of research refining the optimal RAG evaluation criteria
- Battle-tested across hundreds of production RAG systems including for critical hallucination detection in regulated industries
- 92% agreement with expert human evaluators on RAG quality assessment
- 70% reduction in error rate compared to traditional LLM-as-judge methods
Core RAG Metrics
📖 Context Faithfulness “Reward responses that make only claims directly supported by the provided source material without any hallucination or speculation” ✅ Completeness “Reward responses that comprehensively include all relevant information from the source material needed to fully answer the question” 🎯 Context Precision “Reward responses that include only information necessary to answer the question without extraneous details from the source material” 🔍 Relevance “Reward responses where all content directly addresses and is relevant to answering the user’s specific question”Implementation Example
Our SDK now provides independent criteria variables for RAG evaluation, making it easier to use specific criteria or create custom combinations. Each criterion is defined as a separate variable with clear, focused descriptions. RAG Evaluation Criteria:rag_faithfulness
- Reward responses that make only claims directly supported by the provided source material without any hallucination or speculationrag_completeness
- Reward responses that comprehensively include all relevant information from the source material needed to fully answer the questionrag_precision
- Reward responses that include only information necessary to answer the question without extraneous details from the source materialrag_relevance
- Reward responses where all content directly addresses and is relevant to answering the user’s specific question
Python
Evaluating Retrieval Quality
Beyond evaluating the generated responses, you can also assess the quality of your retrieval system itself. This helps identify when your vector search or retrieval mechanism needs improvement before it impacts downstream generation.How It Works
Treat your retrieval step as a “tool call” and evaluate whether the retrieved chunks are actually relevant to the user’s query. This gives you quantitative metrics on retrieval precision.Implementation
Python