What is Ground Truth Evaluation?

Ground truth evaluation allows you to measure how well your LLM outputs align with known correct answers. By dynamically inserting your validated labels into Composo’s evaluation criteria, you can create precise, case-specific evaluations.

When to Use Ground Truth

We typically recommend using evaluation criteria / guidelines such as those in our RAG framework rather than rigid ground truths, since it’s more flexible and doesn’t require labeled data. However, ground truth evaluation works well when:
  • You have an exact answer you need to match (calculations, specific classifications)
  • You have existing labeled data from historical reviews
  • You need to benchmark different models on the same validation set
  • Compliance requires testing against specific approved responses

How It Works

The key is dynamically inserting your ground truth labels directly into the evaluation criteria:
Python
from composo import Composo

composo_client = Composo(api_key="YOUR_API_KEY")

# Your ground truth answer from the dataset
ground_truth = "The capital of France is Paris, a city known for the Eiffel Tower, the Louvre Museum, and its historic architecture along the Seine River."

# Evaluate if the LLM's response matches the ground truth
result = composo_client.evaluate(
    messages=[
        {
            "role": "user",
            "content": "What is the capital of France and what is it known for?"
        },
        {
            "role": "assistant",
            "content": "The capital of France is Paris. It's famous for iconic landmarks like the Eiffel Tower, world-class museums including the Louvre, and beautiful architecture along the Seine River."
        }
    ],
    criteria=f"Reward responses that closely match this expected answer: {ground_truth}"
)

print(f"Alignment Score: {result.score}")
print(f"Explanation: {result.explanation}\n")

Common Use Cases

Classification Tasks

Python
# Multi-class classification
ground_truth_category = "Technical Support"

criteria = f"Reward responses that correctly classify this inquiry as: {ground_truth_category}"

Extraction Tasks

Python
# Entity extraction validation
ground_truth_entities = "Company: Acme Corp, Amount: $50,000, Date: March 2024"

criteria = f"Reward responses that extract all of these entities: {ground_truth_entities}"

Decision Validation

Python
# Validating specific decisions
ground_truth_decision = "Escalate to Level 2 Support"

criteria = f"Reward responses that make this decision: {ground_truth_decision}"

Numerical Validation

Python
# Calculation or counting tasks
ground_truth_answer = "Total: $1,247.50"

criteria = f"Reward responses that arrive at the correct answer: {ground_truth_answer}"

Setting Thresholds

Different use cases require different accuracy thresholds:
  • High-stakes decisions (medical, financial): Consider scores ≥ 0.9 as passing
  • General classification: Scores ≥ 0.8 typically indicate good alignment
  • Exploratory analysis: Scores ≥ 0.7 may be acceptable initially

Next Steps

  • If you have labeled data ready, try the patterns above
  • For more flexible evaluation without needing labels, explore custom criteria
  • See our criteria library for evaluation inspiration