Enterprise-grade reversible anonymization using Google Cloud DLP
Project description
Reversible Anonymizer
An enterprise-grade Python package for reversible text anonymization using Google Cloud services.
Overview
Reversible Anonymizer provides a powerful, scalable solution for detecting and replacing sensitive information in text with realistic-looking fake data, while maintaining the ability to reverse the process and recover the original information.
Key features:
- Complete Google DLP InfoType Support: Over 100 pre-configured detectors for PII, financial data, healthcare information, and more
- Consistent Anonymization: Name parts and repeated information anonymized consistently
- Multiple Cache Strategies: In-memory LRU cache and Google Cloud Memcache support
- Asynchronous Storage: High throughput processing with asynchronous storage updates
- Realistic or Token-based Replacement: Choose between human-readable fake data or systematic tokens
- Batch Operations: Process multiple texts efficiently in parallel
- Production-ready Resilience: Comprehensive error handling, fallbacks, and operational modes
Installation
pip install reversible-anonymizer
# With Memcache support
pip install reversible-anonymizer[memcache]
Prerequisites
- Python 3.8+
- Google Cloud project with the following services enabled:
- Cloud DLP API (dlp.googleapis.com)
- Cloud Firestore API (firestore.googleapis.com)
- [Optional] Memorystore for Memcached (memcache.googleapis.com)
- Google Cloud credentials configured
Quick Start
from reversible_anonymizer import ReversibleAnonymizer
# Initialize the anonymizer
anonymizer = ReversibleAnonymizer(project="your-project-id")
# Anonymize text
original_text = "Hello, my name is John Smith. Please contact me at john.smith@example.com."
anonymized_text = anonymizer.anonymize(original_text)
print(f"Anonymized: {anonymized_text}")
# Output: "Hello, my name is Michael Johnson. Please contact me at robert.brown@domain.net."
# De-anonymize back to original
recovered_text = anonymizer.deanonymize(anonymized_text)
print(f"De-anonymized: {recovered_text}")
# Output:
"Hello, my name is John Smith. Please contact me at john.smith@example.com."
Usage Examples
Basic Configuration
anonymizer = ReversibleAnonymizer(
project="your-project-id",
info_types=["PERSON_NAME", "EMAIL_ADDRESS", "PHONE_NUMBER"]
)
Realistic vs Token-Based Anonymization
# Realistic fake data (default)
anonymizer = ReversibleAnonymizer(
project="your-project-id",
use_realistic_fake_data=True
)
# Token-based anonymization
anonymizer = ReversibleAnonymizer(
project="your-project-id",
use_realistic_fake_data=False
)
Detailed Results with Statistics
result = anonymizer.anonymize(text, detailed_result=True)
print(f"Anonymized: {result['anonymized_text']}")
print(f"Findings: {len(result['findings'])}")
print(f"Duration: {result['stats']['duration_ms']} ms")
print(f"Cache hits: {result['stats']['cache_hits']}")
Batch Processing
texts = [
"John Smith lives in New York.",
"Jane Doe is from Seattle.",
"Contact John Smith at john@example.com."
]
# Process in parallel
anonymized_texts = anonymizer.anonymize_batch(texts)
Modes of Operation
# Strict mode (default) - raises exceptions on errors
anonymizer = ReversibleAnonymizer(
project="your-project-id",
mode="strict"
)
# Tolerant mode - continues despite errors
anonymizer = ReversibleAnonymizer(
project="your-project-id",
mode="tolerant"
)
# Audit mode - detects but doesn't replace sensitive information
anonymizer = ReversibleAnonymizer(
project="your-project-id",
mode="audit"
)
Caching Strategies
In-Memory Cache (Default)
anonymizer = ReversibleAnonymizer(
project="your-project-id",
cache_type="memory",
cache_config={
"capacity": 10000, # Maximum items in cache
"ttl": 3600 # Time-to-live in seconds
}
)
Google Cloud Memcache
# Connect to existing Memcache instance by host
anonymizer = ReversibleAnonymizer(
project="your-project-id",
cache_type="memcache",
cache_config={
"host": "10.0.0.1", # Memcache IP address
"port": 11211, # Memcache port
"ttl": 86400 # Cache TTL in seconds
}
)
# Or connect using instance name and let the adapter discover endpoints
anonymizer = ReversibleAnonymizer(
project="your-project-id",
cache_type="memcache",
cache_config={
"instance_id": "anonymizer-cache",
"region": "us-central1",
"create_if_missing": True # Auto-create if not exists
}
)
Asynchronous Storage Updates
# Enable async storage updates for better performance
anonymizer = ReversibleAnonymizer(
project="your-project-id",
async_storage_updates=True
)
Supported InfoTypes
Reversible Anonymizer supports all Google DLP InfoTypes, organized into categories:
Person Information
- PERSON_NAME
- FIRST_NAME, LAST_NAME
- EMAIL_ADDRESS
- PHONE_NUMBER
- AGE, DATE_OF_BIRTH, GENDER
Financial Information
- CREDIT_CARD_NUMBER
- BANK_ACCOUNT, IBAN_CODE
- SWIFT_CODE, CURRENCY
Government IDs
- US_SOCIAL_SECURITY_NUMBER
- PASSPORT_NUMBER
- DRIVER_LICENSE_NUMBER
Location Information
- STREET_ADDRESS
- CITY, STATE, ZIPCODE
- GPS_COORDINATES And many more!
List available InfoTypes:
# Get all supported info types
info_types = anonymizer.get_supported_infotypes()
# Get info types by category
health_info_types = anonymizer.get_infotypes_by_category("Health Information")
Configuration Options
Environment Variables
# Core configuration
ANONYMIZER_PROJECT=your-project-id
ANONYMIZER_INFO_TYPES=PERSON_NAME,EMAIL_ADDRESS,PHONE_NUMBER
ANONYMIZER_COLLECTION=custom_mappings
ANONYMIZER_MODE=tolerant
ANONYMIZER_USE_REALISTIC_FAKE_DATA=true
# Cache configuration
ANONYMIZER_CACHE_TYPE=memcache
ANONYMIZER_MEMCACHE_HOST=10.0.0.1
ANONYMIZER_MEMCACHE_PORT=11211
ANONYMIZER_CACHE_TTL=3600
Configuration File
# Load from file
anonymizer = ReversibleAnonymizer.from_config("config.json")
Example config.json:
{
"project": "your-project-id",
"info_types": ["PERSON_NAME", "EMAIL_ADDRESS", "PHONE_NUMBER"],
"collection_name": "custom_mappings",
"mode": "strict",
"use_realistic_fake_data": true,
"cache_type": "memcache",
"cache_config": {
"host": "10.0.0.1",
"port": 11211,
"ttl": 3600
}
}
CLI Usage
# List supported info types
anonymizer --project your-project-id list-info-types
# Anonymize a file
anonymizer --project your-project-id anonymize --input input.txt --output anonymized.txt
# De-anonymize a file
anonymizer --project your-project-id deanonymize --input anonymized.txt --output original.txt
Memcache Setup
# Enable the API
python -m tools.memcache_setup --project your-project-id enable-api
# Create a new Memcache instance
python -m tools.memcache_setup --project your-project-id create \
--name anonymizer-cache \
--region us-central1 \
--node-count 1 \
--node-memory 1
Error Handling
The package provides custom exceptions:
- ServiceNotEnabledError: Raised when required Google Cloud services are not enabled
- AnonymizationError: Raised when anonymization fails
- DeAnonymizationError: Raised when de-anonymization fails
- InfoTypeNotSupportedError: Raised when an unsupported info type is requested
- StorageError: Raised when there's an error with the storage adapter
- ConfigurationError: Raised when there's a configuration error
Security Best Practices
- Use secure storage: Enable encryption for stored mappings
- Limit access: Use IAM to restrict access to the Firestore collection
- Set appropriate TTLs: Configure cache TTLs to minimize data exposure
- Enable audit logging: Use detailed_result to log anonymization operations
Performance Considerations
- Use Memcache: For high-throughput applications
- Enable async storage: Reduce latency by updating storage asynchronously
- Batch processing: Use batch methods for multiple texts
- Optimize info types: Select only the info types you need
Contributing
Contributions are welcome! Please feel free to submit a pull request.
- Fork the repository
- Create your feature branch (git checkout -b feature/amazing-feature)
- Commit your changes (git commit -m 'Add some amazing feature')
- Push to the branch (git push origin feature/amazing-feature)
- Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgements
- Google Sensitive Data Protection (formerly Cloud DLP) for the powerful detection capabilities
- Faker library for generating realistic fake data
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file reversible_anonymizer-1.0.13.tar.gz.
File metadata
- Download URL: reversible_anonymizer-1.0.13.tar.gz
- Upload date:
- Size: 55.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
13337f173a399a8642628a9d8bb6214ec27366af1a4a6fa71eb9af6c326937b6
|
|
| MD5 |
11ccc249ee8405d61ce306eb10a21d90
|
|
| BLAKE2b-256 |
708021a92b74ebd2479203f641756fee830d98d40c7c47335264414d027b3fec
|
File details
Details for the file reversible_anonymizer-1.0.13-py3-none-any.whl.
File metadata
- Download URL: reversible_anonymizer-1.0.13-py3-none-any.whl
- Upload date:
- Size: 56.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b6abf9ce1b0370b1ccbf2a2a4f81d3c69b9e6e7724e154168c4655a43b306bb
|
|
| MD5 |
f98f91e5701233956a553918fd46b9ad
|
|
| BLAKE2b-256 |
b224e59505b7fd0acf78533091baafd58e058b756e2f3164e59f113a90260743
|