Ground truth evaluation allows you to measure how well your LLM outputs align with known correct answers. By dynamically inserting your validated labels into Composo’s evaluation criteria, you can create precise, case-specific evaluations.
We typically recommend using evaluation criteria / guidelines such as those in our RAG framework rather than rigid ground truths, since it’s more flexible and doesn’t require labeled data. However, ground truth evaluation works well when:
You have an exact answer you need to match (calculations, specific classifications)
You have existing labeled data from historical reviews
You need to benchmark different models on the same validation set
Compliance requires testing against specific approved responses
The key is dynamically inserting your ground truth labels directly into the evaluation criteria:
Python
Copy
from composo import Composocomposo_client = Composo(api_key="YOUR_API_KEY")# Your ground truth answer from the datasetground_truth = "The capital of France is Paris, a city known for the Eiffel Tower, the Louvre Museum, and its historic architecture along the Seine River."# Evaluate if the LLM's response matches the ground truthresult = composo_client.evaluate( messages=[ { "role": "user", "content": "What is the capital of France and what is it known for?" }, { "role": "assistant", "content": "The capital of France is Paris. It's famous for iconic landmarks like the Eiffel Tower, world-class museums including the Louvre, and beautiful architecture along the Seine River." } ], criteria=f"Reward responses that closely match this expected answer: {ground_truth}")print(f"Alignment Score: {result.score}")print(f"Explanation: {result.explanation}\n")
# Entity extraction validationground_truth_entities = "Company: Acme Corp, Amount: $50,000, Date: March 2024"criteria = f"Reward responses that extract all of these entities: {ground_truth_entities}"
# Validating specific decisionsground_truth_decision = "Escalate to Level 2 Support"criteria = f"Reward responses that make this decision: {ground_truth_decision}"
# Calculation or counting tasksground_truth_answer = "Total: $1,247.50"criteria = f"Reward responses that arrive at the correct answer: {ground_truth_answer}"