Response evaluation encompasses three primary use cases:
- Response quality evaluation
- Response accuracy evaluation
- Response usage of tool call evaluation
1) Response Quality Evaluation
Assesses the overall quality of responses based on user input and conversation history.
Evaluates: Response quality, tone, safety, adherence to guidelines and any other custom criteria that can be highly domain specific.
Example criteria:
- Concise: “Reward responses that are clear and concise, avoiding unnecessary verbosity or repetition.”
- Information Structure: “Reward responses that present technical information in a logical, well-organized format that prioritizes the most important details.”
- Actionable Guidance: “Reward responses that provide practical next steps”
- Professional: “Reward responses that maintain appropriate professional tone and language suitable for the context”
- Harmful output: “Penalize responses that provide medical advice”
2) Response Accuracy Evaluation
Validates responses against retrieved contexts or source materials, typically for RAG systems.
Evaluates: Faithfulness to sources, completeness of information use, and proper citations
Implementation: Include retrieved contexts within the user message, then use criteria focused on:
- Faithfulness: “Reward responses that make only claims directly supported by the provided source material without any hallucination or speculation”
- Completeness: “Reward responses that comprehensively include all relevant information from the source material needed to fully answer the question”
- Precision: “Reward responses that include only information necessary to answer the question without extraneous details from the source material”
- Relevance: “Reward responses where all content directly addresses and is relevant to answering the user’s specific question”
- Refusals: “Reward responses that appropriately refuse to answer when the source material lacks sufficient information to address the question”
- Sources: “Reward responses that explicitly cite or reference the specific source documents or sections used to support each claim”
Example format:
import requests
url = "<https://platform.composo.ai/api/v1/evals/reward>"
headers = {
"API-Key": 'your-api-key-here'
}
# Example: Evaluating how well an LLM uses provided context
payload = {
"messages": [
{
"role": "user",
"content": """What is the current population of Tokyo?
Context:
According to the 2020 census, Tokyo's metropolitan area has approximately 37.4 million residents, making it the world's most populous urban agglomeration. The Tokyo Metropolis itself has 14.0 million people."""
},
{
"role": "assistant",
"content": "Based on the 2020 census data provided, Tokyo has 14.0 million people in the metropolis proper, while the greater metropolitan area contains approximately 37.4 million residents, making it the world's largest urban agglomeration."
}
],
"evaluation_criteria": "Reward responses that accurately use the provided context and cite specific data points"
}
response = requests.post(url, headers=headers, json=payload)
result = response.json()
print(f"Score: {result['score']}")
print(f"Explanation: {result.get('explanation', 'No feedback provided.')}")
Measures how effectively responses incorporate information from function/tool returns.
Evaluates: Accurate use of tool results, avoiding hallucination beyond returned data
Example format:
{
"messages": [
{"role": "system", "content": "You are a helpful AI assistant. Be concise and accurate."},
{"role": "user", "content": "What's the weather like in London today?"},
{"role": "assistant", "content": "I'll check the current weather in London for you."},
{"role": "function", "name": "get_weather", "content": "Current weather: 15°C, partly cloudy, light rain expected"},
{"role": "assistant", "content": "It's currently 15°C in London with partly cloudy skies. Light rain is expected later today."}
],
"evaluation_criteria": "Reward responses that only contain information supported by the provided function results"
}
Responses are generated using AI and may contain mistakes.