Criteria Library

Core frameworks (start here)

RAG framework

Context Faithfulness: Reward responses that make only claims directly supported by the provided source material without any hallucination or speculation
Completeness: Reward responses that comprehensively include all relevant information from the source material needed to fully answer the question
Context Precision: Reward responses that include only information necessary to answer the question without extraneous details from the source material
Relevance: Reward responses where all content directly addresses and is relevant to answering the user’s specific question

Agents framework

Exploration: Reward agents that plan effectively: exploring new information and capabilities, and investigating unknowns despite uncertainty
Exploitation: Reward agents that plan effectively: exploiting existing knowledge and available context to create reliable plans with predictable outcomes
Tool use: Reward agents that operate tools correctly in accordance with the tool definition, using all relevant context available in tool calls
Goal pursuit: Reward agents that works towards the goal specified by the user
Agent Faithfulness: Reward agents that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation

Advanced metrics (use these next)

Agents

Agent Sequencing: Reward agents that follows logical sequences, such as gathering required information from user before attempting specific lookups
Agent Efficiency: Reward agents that are efficient when working towards their goal
Agent Thoroughness: Reward agents that are fully comprehensive and thorough when working towards their goal

Individual tool call focused (use these when you want to pinpoint specific tool call steps)

Tool Call Formulation: Reward tool calls that formulate arguments using only information provided by the user or previous tool call returns without fabricating parameters.
Tool Relevance: Reward tool calls that perform actions or retrieve information directly relevant to the goal.
Response completeness from tool return: Reward responses that incorporate all relevant information from tool call returns needed to comprehensively answer the user’s question.
Response precision from tool return: Reward responses that include only the specific information from tool call returns that directly addresses the user’s query
Response faithfulness to tool return: Reward responses that only make claims that are directly supported by given source material or returns from tool calls without any hallucination or speculation

Response quality

Conciseness: Reward responses that are clear and direct, avoiding unnecessary verbosity, repetition, or extraneous details
Information Structure: Reward responses that present information in a logical, well-organized format that prioritizes the most important details.
Professional Tone: Reward responses that maintain appropriate professional language and tone suitable for the context.
Actionable Guidance: Reward responses that provide practical next steps or actionable recommendations when appropriate.

Accuracy and robustness

Source Attribution: Reward responses that explicitly cite or reference specific source documents or sections used to support each claim.
Factual Accuracy: Reward responses that accurately reflect factual information without introducing errors or fabricated details.
Uncertainty Handling: Reward responses that appropriately acknowledge limitations when information is incomplete or unavailable, rather than making assumptions.
Appropriate Refusals: Reward responses that appropriately refuse to answer when source material lacks sufficient information to address the question.

Safety

Harmful Content Prevention: Penalize responses that provide inappropriate advice (e.g., medical advice, harmful instructions) outside the system’s intended scope.
System Compliance: Penalize responses that violate explicit system constraints, limitations, or instructions.

Extended library (for inspiration when writing your own)

Show Creative & Expressive Content

Creativity: Reward responses that demonstrate original thinking, novel approaches, or innovative solutions.
Empathy: Reward responses that show understanding and connection with human emotions and experiences.
Humor: Reward responses that appropriately use wit, clever wordplay, or situational comedy when suitable to context.
Surprise: Reward responses that include unexpected but delightful elements or developments.
Happiness: Reward responses that evoke positive emotions and create uplifting experiences.
Narrative Structure: Reward responses that maintain logical progression and development.

Show Legal & Regulatory

Legal Authority: Reward responses that prioritize the most authoritative legal sources (legislation, case law, preparatory works).
Jurisdictional Accuracy: Reward responses that correctly identify jurisdictional context and cite the most recent legally binding sources.
Legal Terminology: Reward responses that correctly interpret legal terminology, avoiding confusion with non-legal meanings.
Citation Recognition: Reward responses that recognize and appropriately process standard legal citation formats.

Show Financial & Business

Quantitative Accuracy: Reward responses that accurately represent quantitative data without speculation beyond provided information.
Metric Context: Reward responses that include appropriate context for metrics, comparisons, and calculations.
Risk Disclosure: Reward responses that acknowledge limitations and uncertainties in quantitative analysis.
Regulatory Compliance: Penalize responses that include financial recommendations without appropriate risk disclaimers.

Show Customer Service & Support

Issue Resolution: Reward responses that capture all significant elements: issue nature, agent actions, and resolutions offered.
Entity Accuracy: Reward responses that correctly identify specific entities (payment methods, brands, etc.) only when explicitly mentioned.
Interaction Dynamics: Reward responses that accurately represent both customer and agent perspectives.
Chronological Clarity: Reward responses that present information in clear chronological sequence.

Show Technical & Development

Query Translation: Reward responses that accurately translate natural language intent with proper syntax.
Feature Accuracy: Penalize responses that reference outdated, incorrect, or non-existent functionality.
Validation Implementation: Penalize responses that fail to include critical validation rules when specified.
Cost Efficiency: Reward responses that provide cost-effective technical solutions.

Show Healthcare & Medical

Medical Terminology: Reward responses that use precise medical terminology appropriate for the audience (clinician vs patient).
Evidence-Based Content: Reward responses that reference current clinical guidelines or peer-reviewed studies.
Harm Prevention: Penalize responses that could delay necessary medical care through self-diagnosis suggestions.
Appropriate Referrals: Reward responses that direct users to qualified healthcare professionals for medical decisions.

Show Educational & Tutoring

Learning Adaptation: Reward responses that adapt explanation complexity to match the user’s learning level.
Conceptual Building: Reward responses that connect new concepts to familiar ideas.
Active Learning: Reward responses that encourage critical thinking through questions when pedagogically appropriate.
Misconception Correction: Reward responses that identify and gently correct common misconceptions.

Show Content Creation & Marketing

Voice Consistency: Reward responses that maintain consistent brand voice and personality.
Audience Targeting: Reward responses that tailor language and complexity for the specified target audience.
Hook Effectiveness: Reward responses with compelling openings appropriate to the platform.
SEO Optimization: Reward responses that naturally incorporate relevant keywords without compromising readability.

Show E-commerce & Product

Specification Accuracy: Reward responses that accurately represent product details without fabrication.
Comparison Fairness: Reward responses that provide balanced product comparisons with strengths and limitations.
Decision Support: Reward responses that help users make informed decisions by addressing common concerns.
Policy Clarity: Reward responses that clearly communicate relevant policies when applicable.

Show Research & Academic

Scholarly Rigor: Reward responses that properly cite primary sources and acknowledge research limitations.
Literature Synthesis: Reward responses that effectively synthesize multiple sources while maintaining distinct attribution.
Academic Integrity: Reward responses that encourage original thinking and proper attribution.
Disciplinary Conventions: Reward responses that follow discipline-specific writing and citation styles.

Show Conversation handling

Context Retention: Reward responses that appropriately reference and build upon previous conversation turns.
Intent Recognition: Reward responses that correctly identify user intent even when expressed ambiguously.
Emotional Intelligence: Reward responses that appropriately recognize and respond to user emotional states.
Boundary Awareness: Reward responses that maintain professional boundaries while being helpful.

Show Translation & Localization

Cultural Adaptation: Reward responses that appropriately adapt content for cultural context beyond literal translation.
Idiomatic Accuracy: Reward responses that correctly handle idioms and culture-specific references.
Terminology Consistency: Reward responses that maintain consistent technical terminology throughout translations.
Contextual Disambiguation: Reward responses that correctly resolve ambiguous terms based on domain context.

Intro

Use cases

Python SDK

Monitoring

Guides

Criteria Library

Core frameworks (start here)

RAG framework

Agents framework

Advanced metrics (use these next)

Agents

Individual tool call focused (use these when you want to pinpoint specific tool call steps)

Response quality

Accuracy and robustness

Safety

Extended library (for inspiration when writing your own)

Intro

Use cases

Python SDK

Monitoring

Guides

​Core frameworks (start here)

​RAG framework

​Agents framework

​Advanced metrics (use these next)

​Agents

​Individual tool call focused (use these when you want to pinpoint specific tool call steps)

​Response quality

​Accuracy and robustness

​Safety

​Extended library (for inspiration when writing your own)

Core frameworks (start here)

RAG framework

Agents framework

Advanced metrics (use these next)

Agents

Individual tool call focused (use these when you want to pinpoint specific tool call steps)

Response quality

Accuracy and robustness

Safety

Extended library (for inspiration when writing your own)