Continuous Evaluation provides nuanced, hyper-personalized scoring for your LLM outputs based on custom criteria.

When to Use Continuous Evaluation

Use Continuous Evaluation when you need fine-grained assessments of responses based on complex, subjective criteria. This method is ideal for:

  • Optimizing model outputs during development to align with desired qualities such as tone, style, or adherence to brand guidelines.
  • Tailoring responses to match specific user preferences or brand voices.
  • Continuous monitoring of model performance in production environments.

Guidelines for Writing Evaluation Criteria

When crafting your evaluation criteria, consider the following guidelines to ensure effective and meaningful assessments:

  • Be Specific and Focused: Clearly define the quality or behavior you want to evaluate. Avoid vague statements. Focus on a single aspect per criterion to maintain clarity.

    • Example: Instead of “The assistant should be good,” use “The assistant should provide responses in a friendly and encouraging tone.”
  • Use Clear Direction: Begin your criteria with an explicit directive such as “Reward responses that…” or “Penalize responses where…“.

    • Example: “Reward responses that use empathetic language when addressing user concerns.”
  • Monotonic or Appropriately Qualified Qualities: Ideally, the quality you’re assessing should be monotonic—more of the quality is better (for rewards) or worse (for penalties). However, when dealing with non-monotonic qualities (where more is not always better), use qualifiers such as “appropriate” to ensure that higher scores represent better adherence to the desired quality.

    • Example: Instead of “Reward responses that are polite,” which can become excessive, use “Reward responses that use an appropriate level of politeness,” ensuring that the response is polite but not overly so.
  • Avoid Conjunctions: Focus on one quality at a time. Using conjunctions like “and” might indicate multiple qualities, which can dilute the evaluation.

    • Example: Instead of “The assistant should be concise and informative,” split into separate criteria if needed.
  • Qualifiers (Optional): If the criterion applies only to certain situations, include a qualifier starting with “if” to specify when it should be applied.

    • Example: “Reward responses that provide code examples if the user asks for implementation details.”
  • Achievability: Ensure that the criteria are achievable and realistic for the assistant to meet.

  • Domain-Specific: Tailor the criteria to the specific domain or context of your application for more relevant evaluations.

Template for Crafting Criteria:

[Direction] responses [quality] [qualifier (optional)].

Components:

  • Direction: “Reward” or “Penalize”.
  • Qualifier (Optional): An “if” statement specifying conditions.
  • Quality: The specific property or behavior to evaluate.

Example Criteria:

  • “Reward responses that provide thorough explanations."
  • "Penalize responses where the language is overly technical if the user is a beginner."
  • "Reward responses that use an appropriate level of politeness.”

Example: Evaluating Tone and Style

Suppose you are developing a customer support chatbot and want to ensure the responses are empathetic and helpful.

import requests

url = "https://platform.composo.ai/api/v1/evals/reward"
headers = {
    "API-Key": "YOUR_API_KEY"
}
payload = {
    "messages": [
        {"role": "user", "content": "I'm really frustrated with my device not working."},
        {"role": "assistant", "content": "I'm sorry to hear that you're experiencing issues with your device. Let's see how I can assist you to resolve this problem."}
    ],
    "evaluation_criteria": "Reward responses that express appropriate empathy and offer helpful solutions to the user's problem."
}

response = requests.post(url, headers=headers, json=payload)
result = response.json()

print(f"Score: {result['score']}")
print(f"Feedback: {result.get('feedback', 'No feedback provided.')}")

Note: When evaluating non-monotonic qualities with qualifiers (e.g., “appropriate”), the score reflects how well the response meets the optimal level of that quality. Higher scores indicate better adherence to what is considered appropriate or optimal in the context.

Interpreting the Score

  • Score: A continuous value between 0 and 1 indicating how well the response meets the evaluation criteria.

    • 0: Does not meet the criteria at all.
    • Values between 0 and 1: Partially meets the criteria.
    • 1: Fully meets the criteria.
  • Feedback: Optional detailed feedback providing insights into the evaluation.

Continuous Evaluation helps fine-tune your model’s responses to align closely with your application’s goals.