Ground Truth Evaluation

What is Ground Truth Evaluation?

Ground truth evaluation allows you to measure how well your LLM outputs align with known correct answers. By dynamically inserting your validated labels into Composo’s evaluation criteria, you can create precise, case-specific evaluations.

When to Use Ground Truth

We typically recommend using evaluation criteria / guidelines such as those in our RAG framework rather than rigid ground truths, since it’s more flexible and doesn’t require labeled data. However, ground truth evaluation works well when:

You have an exact answer you need to match (calculations, specific classifications)
You have existing labeled data from historical reviews
You need to benchmark different models on the same validation set
Compliance requires testing against specific approved responses

How It Works

The key is dynamically inserting your ground truth labels directly into the evaluation criteria:

Python

from composo import Composo

composo_client = Composo(api_key="YOUR_API_KEY")

# Your ground truth answer from the dataset
ground_truth = "The capital of France is Paris, a city known for the Eiffel Tower, the Louvre Museum, and its historic architecture along the Seine River."

# Evaluate if the LLM's response matches the ground truth
result = composo_client.evaluate(
    messages=[
        {
            "role": "user",
            "content": "What is the capital of France and what is it known for?"
        },
        {
            "role": "assistant",
            "content": "The capital of France is Paris. It's famous for iconic landmarks like the Eiffel Tower, world-class museums including the Louvre, and beautiful architecture along the Seine River."
        }
    ],
    criteria=f"Reward responses that closely match this expected answer: {ground_truth}"
)

print(f"Alignment Score: {result.score}")
print(f"Explanation: {result.explanation}\n")

Common Use Cases

Classification Tasks

Python

# Multi-class classification
ground_truth_category = "Technical Support"

criteria = f"Reward responses that correctly classify this inquiry as: {ground_truth_category}"

Extraction Tasks

Python

# Entity extraction validation
ground_truth_entities = "Company: Acme Corp, Amount: $50,000, Date: March 2024"

criteria = f"Reward responses that extract all of these entities: {ground_truth_entities}"

Decision Validation

Python

# Validating specific decisions
ground_truth_decision = "Escalate to Level 2 Support"

criteria = f"Reward responses that make this decision: {ground_truth_decision}"

Numerical Validation

Python

# Calculation or counting tasks
ground_truth_answer = "Total: $1,247.50"

criteria = f"Reward responses that arrive at the correct answer: {ground_truth_answer}"

Setting Thresholds

Different use cases require different accuracy thresholds:

High-stakes decisions (medical, financial): Consider scores ≥ 0.9 as passing
General classification: Scores ≥ 0.8 typically indicate good alignment
Exploratory analysis: Scores ≥ 0.7 may be acceptable initially

Next Steps

If you have labeled data ready, try the patterns above
For more flexible evaluation without needing labels, explore custom criteria
See our criteria library for evaluation inspiration

Intro

Use cases

Python SDK

Monitoring

Guides

Ground Truth Evaluation

What is Ground Truth Evaluation?

When to Use Ground Truth

How It Works

Common Use Cases

Classification Tasks

Extraction Tasks

Decision Validation

Numerical Validation

Setting Thresholds

Next Steps

Intro

Use cases

Python SDK

Monitoring

Guides

​What is Ground Truth Evaluation?

​When to Use Ground Truth

​How It Works

​Common Use Cases

​Classification Tasks

​Extraction Tasks

​Decision Validation

​Numerical Validation

​Setting Thresholds

​Next Steps

What is Ground Truth Evaluation?

When to Use Ground Truth

How It Works

Common Use Cases

Classification Tasks

Extraction Tasks

Decision Validation

Numerical Validation

Setting Thresholds

Next Steps