Skip to main content
When crafting your evaluation criteria, consider the following guidelines to ensure effective and meaningful assessments: Be Specific and Focused: Clearly define the quality or behavior you want to evaluate. Avoid vague statements. Focus on a single aspect per criterion to maintain clarity.
  • Example: Instead of “good,” use “a friendly and encouraging tone.”
Use Clear Direction: Begin your criteria with an explicit directive such as "Reward responses that...", "Penalize responses that...", "Reward tool calls...", "Reward agents that...".
  • Example: "Reward responses that use empathetic language when addressing user concerns."
Monotonic or Appropriately Qualified Qualities: Ideally, the quality you’re assessing should be monotonic (more is always better for rewards, worse for penalties). For non-monotonic qualities where balance matters, use qualifiers like “appropriate” to ensure higher scores represent better adherence.
  • Example: Instead of "Reward responses that are polite" which can become excessive, use "Reward responses that use an appropriate level of politeness" ensuring the response is polite but not overly so.
Avoid Conjunctions: Focus on one quality at a time. Using “and” often indicates multiple qualities, which can lead to unclear scoring when only one quality is present.
  • Example: Instead of "The assistant should be concise and informative" split into two separate criteria.
Avoid LLM Keywords: Composo’s reward model is finetuned from LLM models trained in conversation format. Avoid alternate definitions of ‘User’ and ‘Assistant’ that might conflict with LLM keywords ‘user’ and ‘assistant’.
  • Example: Instead of "Reward responses that comprehensively address the User Question", rename the ‘User Question’ in your prompt and use "Reward responses that comprehensively address the Target Question"
Leverage Domain Expertise: Your domain knowledge is your secret weapon. Inject your understanding of what constitutes a ‘good’ answer in your specific field—this gives your evaluation model leverage over the generative model.
  • Example: For medical contexts: "Reward responses that distinguish between emergency symptoms requiring immediate care versus symptoms suitable for routine appointments"
Use Qualifiers When Needed: Include a qualifier starting with “if” to specify when the criterion should apply. This helps handle conditional requirements.
  • Example: "Reward responses that provide code examples if the user asks for implementation details"
Keep Criteria Concise: Aim for one clear sentence per criterion. If you need multiple sentences to explain, consider splitting into separate criteria.

Reward responses that provide correct information based solely on the provided context without fabricating details.

OK. Clarification about ‘correct’ would be useful—does it have to be factually correct, or only in agreement with the provided context?

Reward responses that directly address the ‘User Question’ without including irrelevant information.

Poor. Clauses that change the definition of user and assistant from the LLM definition risk confusion.

Reward responses that properly cite the specific source of information from the provided context.

Good. ‘Properly’ is slightly ambiguous and rolls in both concepts of citation style and accuracy.

Reward responses that appropriately acknowledge limitations if information is incomplete or unavailable rather than guessing.

Good. Could be improved by clarifying what the agent might be guessing at.

Reward responses that comprehensively address all aspects of the ‘User Question’ if information is available in the context.

Poor. Clauses that change the definition of user and assistant from the LLM definition risk confusion.

Reward responses that present technical information in a logical, well-organized format that prioritizes the most important details.

Excellent. It’s clear what format we’re looking for and what kind of information that applies to.

Reward responses that provide practical next steps or recommendations if appropriate and supported by the context.

OK. Somewhat ambiguous about what should be supported by the context—is it the next steps or the relevance of the question?

Reward responses that strictly include only information explicitly stated in the support ticket, without adding any fabricated details or assumptions.

Excellent. It’s clear what the expected input is and what the model should be doing.

Reward responses that correctly identify and include specific entities (payment methods, product categories, brands, couriers) only when explicitly mentioned in the ticket, avoiding hallucinations of these elements.

Excellent. It’s clear that we’re trying to avoid fabricating names of specific entities and the examples make it even clearer.

Reward responses that include all significant elements of the support ticket, including the nature of the issue, agent actions, and resolutions offered, without omitting key details.

Excellent. It’s clear that we’re looking for good coverage of the important elements in the response.

Reward responses that present the information in a clear chronological sequence that accurately reflects the flow of the support interaction.

Excellent. A clear requirement for chronological presentation of the information in the support interaction.

Penalize responses that include unnecessary concluding statements, evaluative summaries, or editorial comments not derived from the ticket content.

Excellent. It’s clear that we’re trying to avoid verbose summary content that isn’t clearly derived from the provided ticket.

Reward responses that demonstrate empathy while acknowledging the friend’s feelings of defeat without minimizing them.

OK. This contains two separate qualities which could lead to unclear scoring when the response demonstrates one but not the other. Consider splitting into two criteria or using ‘and’ to make both required.

Reward responses that explain ethical concerns when declining harmful requests rather than simply refusing without context

OK. The model is specifically trained to recognize ‘if’ statements, so we’d recommend changing ‘when’ to ‘if’.

Reward responses that maintain an appropriate educational tone suitable for academic assessment contexts

Excellent. A clear requirement for a tone with additional helpful context about why it’s needed.
[Prefix] [quality] [qualifier (optional)].
Components:
  • Prefix:
    • For 0-1 Reward Scoring: “Reward responses that”, “Penalize responses that”, “Reward tool calls that”, “Penalize tool calls that”, “Reward agents that”, “Penalize agents that”
    • For Binary Evaluation: “Response passes if”, “Response fails if”, “Tool call passes if”, “Tool call fails if”, “Agent passes if”, “Agent fails if”
  • Quality: The specific property or behavior to evaluate.
  • Qualifier (Optional): An “if” statement specifying conditions.
Example Criteria:
  • "Reward responses that provide a comprehensive analysis of the code snippet"
  • "Penalize responses where the language is overly technical if the response is for a beginner"
  • "Reward responses that use an appropriate level of politeness"
  • "Reward agents that explore new information and capabilities despite uncertainty"
  • "Tool call passes if all required parameters are provided without fabrication"
I