When crafting your evaluation criteria, consider the following guidelines to ensure effective and meaningful assessments:
Be Specific and Focused: Clearly define the quality or behavior you want to evaluate. Avoid vague statements. Focus on a single aspect per criterion to maintain clarity.
Use Clear Direction: Begin your criteria with an explicit directive such as "Reward responses that..."
or "Penalize responses where..."
.
"Reward responses that use empathetic language when addressing user concerns."
Monotonic or Appropriately Qualified Qualities: Ideally, the quality you’re assessing should be monotonic, i.e. more of the quality is better (for rewards) or worse (for penalties). However, when dealing with non-monotonic qualities (where more is not always better), use qualifiers such as “appropriate” to ensure that higher scores represent better adherence to the desired quality.
"Reward responses that are polite"
which can become excessive, use "Reward responses that use an appropriate level of politeness"
ensuring that the response is polite but not overly so.Avoid Conjunctions: Focus on one quality at a time. Using conjunctions like “and” might indicate multiple qualities, which can lead to poorly defined behavior if one of the two qualities is poorly adhered to.
"The assistant should be concise and informative"
split into separate criteria.Avoid LLM Keywords: Composo’s reward model is finetuned from LLM models trained in conversation format. Avoid alternate definitions of ‘User’ and ‘Assistant’ that might conflict with LLM keywords ‘user’ and ‘assistant’
"Reward responses that comprehensively address the User Question"
, rename the ‘User Question’ in your prompt and consider "Reward responses that comprehensively address the Target Question"
Domain-Specific: Domain expertise is your secret weapon in getting good evaluation quality. Injecting your own domain knowledge and understanding of what a ‘good’ answer is improves the leverage your evaluation model has over the generative model.
Qualifiers (Optional): If the criterion applies only to certain situations, include a qualifier starting with “if” to specify when it should be applied.
"Reward responses that provide code examples if the user asks for implementation details"
Example Clauses And Recommendations for Improvements
OK. Clarification about ‘correct’ would be useful, does it have to be correct, or only in agreement with the provided context?
Poor. Clauses that change the definition of user and assistant from the LLM definition risk confusion.
Good. ‘Properly’ is slightly ambiguous and rolls in both concepts of citation style and accuracy.
Good. Could be made better by clarifying what the agent might be guessing at.
Poor. Clauses that change the definition of user and assistant from the LLM definition risk confusion.
Excellent. It’s clear what format we’re looking for and what kind of information that applies to.
OK. Somewhat ambiguous about what should be supported by the context, is that the next steps or if the question is relevant to the context.
Excellent. It’s clear what the expected input is and what the model should be doing.
Excellent. It’s clear that we’re trying to avoid fabricating names of specific entities and the examples make it even clearer.
Excellent. It’s clear that we’re looking for good coverage of the important elements in the response.
Excellent. A clear requirement for chronological presentation of the information in the support interaction.
Excellent. It’s clear that we’re trying to avoid verbose summary content that isn’t clearly derived from the provided ticket.
OK. Concern with this is there are two separate criteria and the behaviour will be poorly conditioned when the response does one but not the other e.g. demonstrates empathy but doesn’t acknowledging the friend’s feelings. Changing to ‘and’ would be better, but two criteria would be even better.
OK. The model is specifically trained to recognise ‘if’ statements, so we’d recommend changing ‘when’ to ‘if’.
Excellent. A clear requirement for a tone with additional helpful context about why why it’s needed.
Components:
Example Criteria:
"Reward responses that provide a comprehensive analysis of the code snippet"
"Penalize responses where the language is overly technical if the response is for a beginner"
"Reward responses that use an appropriate level of politeness"
When crafting your evaluation criteria, consider the following guidelines to ensure effective and meaningful assessments:
Be Specific and Focused: Clearly define the quality or behavior you want to evaluate. Avoid vague statements. Focus on a single aspect per criterion to maintain clarity.
Use Clear Direction: Begin your criteria with an explicit directive such as "Reward responses that..."
or "Penalize responses where..."
.
"Reward responses that use empathetic language when addressing user concerns."
Monotonic or Appropriately Qualified Qualities: Ideally, the quality you’re assessing should be monotonic, i.e. more of the quality is better (for rewards) or worse (for penalties). However, when dealing with non-monotonic qualities (where more is not always better), use qualifiers such as “appropriate” to ensure that higher scores represent better adherence to the desired quality.
"Reward responses that are polite"
which can become excessive, use "Reward responses that use an appropriate level of politeness"
ensuring that the response is polite but not overly so.Avoid Conjunctions: Focus on one quality at a time. Using conjunctions like “and” might indicate multiple qualities, which can lead to poorly defined behavior if one of the two qualities is poorly adhered to.
"The assistant should be concise and informative"
split into separate criteria.Avoid LLM Keywords: Composo’s reward model is finetuned from LLM models trained in conversation format. Avoid alternate definitions of ‘User’ and ‘Assistant’ that might conflict with LLM keywords ‘user’ and ‘assistant’
"Reward responses that comprehensively address the User Question"
, rename the ‘User Question’ in your prompt and consider "Reward responses that comprehensively address the Target Question"
Domain-Specific: Domain expertise is your secret weapon in getting good evaluation quality. Injecting your own domain knowledge and understanding of what a ‘good’ answer is improves the leverage your evaluation model has over the generative model.
Qualifiers (Optional): If the criterion applies only to certain situations, include a qualifier starting with “if” to specify when it should be applied.
"Reward responses that provide code examples if the user asks for implementation details"
Example Clauses And Recommendations for Improvements
OK. Clarification about ‘correct’ would be useful, does it have to be correct, or only in agreement with the provided context?
Poor. Clauses that change the definition of user and assistant from the LLM definition risk confusion.
Good. ‘Properly’ is slightly ambiguous and rolls in both concepts of citation style and accuracy.
Good. Could be made better by clarifying what the agent might be guessing at.
Poor. Clauses that change the definition of user and assistant from the LLM definition risk confusion.
Excellent. It’s clear what format we’re looking for and what kind of information that applies to.
OK. Somewhat ambiguous about what should be supported by the context, is that the next steps or if the question is relevant to the context.
Excellent. It’s clear what the expected input is and what the model should be doing.
Excellent. It’s clear that we’re trying to avoid fabricating names of specific entities and the examples make it even clearer.
Excellent. It’s clear that we’re looking for good coverage of the important elements in the response.
Excellent. A clear requirement for chronological presentation of the information in the support interaction.
Excellent. It’s clear that we’re trying to avoid verbose summary content that isn’t clearly derived from the provided ticket.
OK. Concern with this is there are two separate criteria and the behaviour will be poorly conditioned when the response does one but not the other e.g. demonstrates empathy but doesn’t acknowledging the friend’s feelings. Changing to ‘and’ would be better, but two criteria would be even better.
OK. The model is specifically trained to recognise ‘if’ statements, so we’d recommend changing ‘when’ to ‘if’.
Excellent. A clear requirement for a tone with additional helpful context about why why it’s needed.
Components:
Example Criteria:
"Reward responses that provide a comprehensive analysis of the code snippet"
"Penalize responses where the language is overly technical if the response is for a beginner"
"Reward responses that use an appropriate level of politeness"