Response Evaluation
Conduct precise evaluation of LLM & agent outputs
Response evaluation encompasses three primary use cases:
- Response quality evaluation
- Response accuracy evaluation
- Response usage of tool call evaluation
1) Response Quality Evaluation
Assesses the overall quality of responses based on user input and conversation history.
Evaluates: Response quality, tone, safety, adherence to guidelines and any other custom criteria that can be highly domain specific.
Example criteria:
- Concise: “Reward responses that are clear and concise, avoiding unnecessary verbosity or repetition.”
- Information Structure: “Reward responses that present technical information in a logical, well-organized format that prioritizes the most important details.”
- Actionable Guidance: “Reward responses that provide practical next steps”
- Professional: “Reward responses that maintain appropriate professional tone and language suitable for the context”
- Harmful output: “Penalize responses that provide medical advice”
2) Response Accuracy Evaluation
Validates responses against retrieved contexts or source materials, typically for RAG systems.
Evaluates: Faithfulness to sources, completeness of information use, and proper citations
Implementation: Include retrieved contexts within the user message, then use criteria focused on:
- Faithfulness: “Reward responses that make only claims directly supported by the provided source material without any hallucination or speculation”
- Completeness: “Reward responses that comprehensively include all relevant information from the source material needed to fully answer the question”
- Precision: “Reward responses that include only information necessary to answer the question without extraneous details from the source material”
- Relevance: “Reward responses where all content directly addresses and is relevant to answering the user’s specific question”
- Refusals: “Reward responses that appropriately refuse to answer when the source material lacks sufficient information to address the question”
- Sources: “Reward responses that explicitly cite or reference the specific source documents or sections used to support each claim”
Example format:
3) Response Usage of Tool Calls
Measures how effectively responses incorporate information from function/tool returns.
Evaluates: Accurate use of tool results, avoiding hallucination beyond returned data
Example format: