Anonymizing Data for Composo Evaluations

When dealing with sensitive customer information, you may need to anonymize data before sending it to Composo evaluation services. This guide explains how to effectively anonymize your data while preserving evaluation quality.

For optimal evaluation results, we recommend using a consistent placeholder substitution approach rather than removing or scrambling PII. This preserves relationships between entities that are important for evaluation quality.

Best Practices

  1. Use sequential placeholders for each entity type

    • Replace “Bob sent an email to Sally” with “NAME_1 sent an email to NAME_2”
    • This preserves relationships between entities
  2. Maintain placeholder consistency across all related content

    • The same entity should have the same placeholder ID throughout a single evaluation request
    • Example: If “Sally” is “NAME_2” in one part, it should remain “NAME_2” everywhere in that request
  3. Preserve structure and context

    • Keep sentence structure, formatting, and non-PII context intact
    • This ensures evaluations remain accurate and meaningful

Numbering can be omitted if there is only one instance of a particular entity type. For example, if only one name appears in your data, you can simply use “NAME” instead of “NAME_1”.

  • Person names → “NAME_1”, “NAME_2”, etc.
  • Email addresses → “EMAIL_1”, “EMAIL_2”, etc.
  • Phone numbers → “PHONE_1”, “PHONE_2”, etc.
  • Physical addresses → “ADDRESS_1”, “ADDRESS_2”, etc. (you can retain country/region)
  • URLs → “URL_1”, “URL_2”, etc.

Implementation Example

Original Data:

{
  "messages": [
    {"role": "user", "content": "How do I contact Bob Smith?"},
    {"role": "assistant", "content": "You can reach Bob Smith at bob.smith@example.com or call him at (555) 123-4567."}
  ],
  "evaluation_criteria": "Reward responses that provide complete contact information when requested."
}

Anonymized Data:

{
  "messages": [
    {"role": "user", "content": "How do I contact NAME_1?"},
    {"role": "assistant", "content": "You can reach NAME_1 at EMAIL_1 or call him at PHONE_1."}
  ],
  "evaluation_criteria": "Reward responses that provide complete contact information when requested."
}

Tools for Anonymization

We recommend using Microsoft Presidio, an open-source framework for PII detection and anonymization. It provides:

  • Entity recognition for common PII types
  • Multiple anonymization methods
  • Support for multiple languages
  • Customizable entity detection