What is Ground Truth Evaluation?
Ground truth evaluation allows you to measure how well your LLM outputs align with known correct answers. By dynamically inserting your validated labels into Composo’s evaluation criteria, you can create precise, case-specific evaluations.When to Use Ground Truth
We typically recommend using evaluation criteria / guidelines such as those in our RAG framework rather than rigid ground truths, since it’s more flexible and doesn’t require labeled data. However, ground truth evaluation works well when:- You have an exact answer you need to match (calculations, specific classifications)
- You have existing labeled data from historical reviews
- You need to benchmark different models on the same validation set
- Compliance requires testing against specific approved responses
How It Works
The key is dynamically inserting your ground truth labels directly into the evaluation criteria:Python
Common Use Cases
Classification Tasks
Python
Extraction Tasks
Python
Decision Validation
Python
Numerical Validation
Python
Setting Thresholds
Different use cases require different accuracy thresholds:- High-stakes decisions (medical, financial): Consider scores ≥ 0.9 as passing
- General classification: Scores ≥ 0.8 typically indicate good alignment
- Exploratory analysis: Scores ≥ 0.7 may be acceptable initially
Next Steps
- If you have labeled data ready, try the patterns above
- For more flexible evaluation without needing labels, explore custom criteria
- See our criteria library for evaluation inspiration