Anonymization
Anonymizing your data while maintaining evaluation quality
Anonymizing Data for Composo Evaluations
When dealing with sensitive customer information, you may need to anonymize data before sending it to Composo evaluation services. This guide explains how to effectively anonymize your data while preserving evaluation quality.
Recommended Anonymization Approach
For optimal evaluation results, we recommend using a consistent placeholder substitution approach rather than removing or scrambling PII. This preserves relationships between entities that are important for evaluation quality.
Best Practices
-
Use sequential placeholders for each entity type
- Replace “Bob sent an email to Sally” with “NAME_1 sent an email to NAME_2”
- This preserves relationships between entities
-
Maintain placeholder consistency across all related content
- The same entity should have the same placeholder ID throughout a single evaluation request
- Example: If “Sally” is “NAME_2” in one part, it should remain “NAME_2” everywhere in that request
-
Preserve structure and context
- Keep sentence structure, formatting, and non-PII context intact
- This ensures evaluations remain accurate and meaningful
Numbering can be omitted if there is only one instance of a particular entity type. For example, if only one name appears in your data, you can simply use “NAME” instead of “NAME_1”.
Recommended PII Types to Anonymize
- Person names → “NAME_1”, “NAME_2”, etc.
- Email addresses → “EMAIL_1”, “EMAIL_2”, etc.
- Phone numbers → “PHONE_1”, “PHONE_2”, etc.
- Physical addresses → “ADDRESS_1”, “ADDRESS_2”, etc. (you can retain country/region)
- URLs → “URL_1”, “URL_2”, etc.
Implementation Example
Original Data:
Anonymized Data:
Tools for Anonymization
We recommend using Microsoft Presidio, an open-source framework for PII detection and anonymization. It provides:
- Entity recognition for common PII types
- Multiple anonymization methods
- Support for multiple languages
- Customizable entity detection