Recommended Anonymization Approach
For optimal evaluation results, we recommend using a consistent placeholder substitution approach rather than removing or scrambling PII. This preserves relationships between entities that are important for evaluation quality.Best Practices
- Use sequential placeholders for each entity type
- Replace “Bob sent an email to Sally” with “NAME_1 sent an email to NAME_2”
- This preserves relationships between entities
- Maintain placeholder consistency across all related content
- The same entity should have the same placeholder ID throughout a single evaluation request
- Example: If “Sally” is “NAME_2” in one part, it should remain “NAME_2” everywhere in that request
- Preserve structure and context
- Keep sentence structure, formatting, and non-PII context intact
- This ensures evaluations remain accurate and meaningful
Recommended PII Types to Anonymize
- Person names → “NAME_1”, “NAME_2”, etc.
- Email addresses → “EMAIL_1”, “EMAIL_2”, etc.
- Phone numbers → “PHONE_1”, “PHONE_2”, etc.
- Physical addresses → “ADDRESS_1”, “ADDRESS_2”, etc. (you can retain country/region)
- URLs → “URL_1”, “URL_2”, etc.
Implementation Example
Original Data:Tools for Anonymization
We recommend using Microsoft Presidio, an open-source framework for PII detection and anonymization. It provides:- Entity recognition for common PII types
- Multiple anonymization methods
- Support for multiple languages
- Customizable entity detection