Enterprise-grade reversible anonymization using Google Cloud DLP

These details have not been verified by PyPI

Project links

Project description

Reversible Anonymizer

PyPI version Python versions License

An enterprise-grade Python package for reversible text anonymization using Google Cloud services.

Overview

Reversible Anonymizer provides a powerful, scalable solution for detecting and replacing sensitive information in text with realistic-looking fake data, while maintaining the ability to reverse the process and recover the original information.

Key features:

Complete Google DLP InfoType Support: Over 100 pre-configured detectors for PII, financial data, healthcare information, and more
Consistent Anonymization: Name parts and repeated information anonymized consistently
Multiple Cache Strategies: In-memory LRU cache and Google Cloud Memcache support
Asynchronous Storage: High throughput processing with asynchronous storage updates
Realistic or Token-based Replacement: Choose between human-readable fake data or systematic tokens
Batch Operations: Process multiple texts efficiently in parallel
Production-ready Resilience: Comprehensive error handling, fallbacks, and operational modes

Installation

pip install reversible-anonymizer

# With Memcache support
pip install reversible-anonymizer[memcache]

Prerequisites

Python 3.8+
Google Cloud project with the following services enabled:
Cloud DLP API (dlp.googleapis.com)
Cloud Firestore API (firestore.googleapis.com)
[Optional] Memorystore for Memcached (memcache.googleapis.com)
Google Cloud credentials configured

Quick Start

from reversible_anonymizer import ReversibleAnonymizer

# Initialize the anonymizer
anonymizer = ReversibleAnonymizer(project="your-project-id")

# Anonymize text
original_text = "Hello, my name is John Smith. Please contact me at john.smith@example.com."
anonymized_text = anonymizer.anonymize(original_text)
print(f"Anonymized: {anonymized_text}")
# Output: "Hello, my name is Michael Johnson. Please contact me at robert.brown@domain.net."

# De-anonymize back to original
recovered_text = anonymizer.deanonymize(anonymized_text)
print(f"De-anonymized: {recovered_text}")
# Output:
"Hello, my name is John Smith. Please contact me at john.smith@example.com."

Usage Examples

Basic Configuration

anonymizer = ReversibleAnonymizer(
    project="your-project-id",
    info_types=["PERSON_NAME", "EMAIL_ADDRESS", "PHONE_NUMBER"]
)

Realistic vs Token-Based Anonymization

# Realistic fake data (default)
anonymizer = ReversibleAnonymizer(
    project="your-project-id",
    use_realistic_fake_data=True
)

# Token-based anonymization
anonymizer = ReversibleAnonymizer(
    project="your-project-id",
    use_realistic_fake_data=False
)

Detailed Results with Statistics

result = anonymizer.anonymize(text, detailed_result=True)
print(f"Anonymized: {result['anonymized_text']}")
print(f"Findings: {len(result['findings'])}")
print(f"Duration: {result['stats']['duration_ms']} ms")
print(f"Cache hits: {result['stats']['cache_hits']}")

Batch Processing

texts = [
    "John Smith lives in New York.",
    "Jane Doe is from Seattle.",
    "Contact John Smith at john@example.com."
]

# Process in parallel
anonymized_texts = anonymizer.anonymize_batch(texts)

Modes of Operation

# Strict mode (default) - raises exceptions on errors
anonymizer = ReversibleAnonymizer(
    project="your-project-id",
    mode="strict"
)

# Tolerant mode - continues despite errors
anonymizer = ReversibleAnonymizer(
    project="your-project-id",
    mode="tolerant"
)

# Audit mode - detects but doesn't replace sensitive information
anonymizer = ReversibleAnonymizer(
    project="your-project-id",
    mode="audit"
)

Caching Strategies

In-Memory Cache (Default)

anonymizer = ReversibleAnonymizer(
    project="your-project-id",
    cache_type="memory",
    cache_config={
        "capacity": 10000,  # Maximum items in cache
        "ttl": 3600         # Time-to-live in seconds
    }
)

Google Cloud Memcache

# Connect to existing Memcache instance by host
anonymizer = ReversibleAnonymizer(
    project="your-project-id",
    cache_type="memcache",
    cache_config={
        "host": "10.0.0.1",          # Memcache IP address
        "port": 11211,               # Memcache port
        "ttl": 86400                 # Cache TTL in seconds
    }
)

# Or connect using instance name and let the adapter discover endpoints
anonymizer = ReversibleAnonymizer(
    project="your-project-id",
    cache_type="memcache",
    cache_config={
        "instance_id": "anonymizer-cache",
        "region": "us-central1",
        "create_if_missing": True     # Auto-create if not exists
    }
)

Asynchronous Storage Updates

# Enable async storage updates for better performance
anonymizer = ReversibleAnonymizer(
    project="your-project-id",
    async_storage_updates=True
)

Supported InfoTypes

Reversible Anonymizer supports all Google DLP InfoTypes, organized into categories:

Person Information

PERSON_NAME
FIRST_NAME, LAST_NAME
EMAIL_ADDRESS
PHONE_NUMBER
AGE, DATE_OF_BIRTH, GENDER

Financial Information

CREDIT_CARD_NUMBER
BANK_ACCOUNT, IBAN_CODE
SWIFT_CODE, CURRENCY

Government IDs

US_SOCIAL_SECURITY_NUMBER
PASSPORT_NUMBER
DRIVER_LICENSE_NUMBER

Location Information

STREET_ADDRESS
CITY, STATE, ZIPCODE
GPS_COORDINATES And many more!

List available InfoTypes:

# Get all supported info types
info_types = anonymizer.get_supported_infotypes()

# Get info types by category
health_info_types = anonymizer.get_infotypes_by_category("Health Information")

Configuration Options

Environment Variables

# Core configuration
ANONYMIZER_PROJECT=your-project-id
ANONYMIZER_INFO_TYPES=PERSON_NAME,EMAIL_ADDRESS,PHONE_NUMBER
ANONYMIZER_COLLECTION=custom_mappings
ANONYMIZER_MODE=tolerant
ANONYMIZER_USE_REALISTIC_FAKE_DATA=true

# Cache configuration
ANONYMIZER_CACHE_TYPE=memcache
ANONYMIZER_MEMCACHE_HOST=10.0.0.1
ANONYMIZER_MEMCACHE_PORT=11211
ANONYMIZER_CACHE_TTL=3600

Configuration File

# Load from file
anonymizer = ReversibleAnonymizer.from_config("config.json")

Example config.json:

{
  "project": "your-project-id",
  "info_types": ["PERSON_NAME", "EMAIL_ADDRESS", "PHONE_NUMBER"],
  "collection_name": "custom_mappings",
  "mode": "strict",
  "use_realistic_fake_data": true,
  "cache_type": "memcache",
  "cache_config": {
    "host": "10.0.0.1",
    "port": 11211,
    "ttl": 3600
  }
}

CLI Usage

# List supported info types
anonymizer --project your-project-id list-info-types

# Anonymize a file
anonymizer --project your-project-id anonymize --input input.txt --output anonymized.txt

# De-anonymize a file
anonymizer --project your-project-id deanonymize --input anonymized.txt --output original.txt

Memcache Setup

# Enable the API
python -m tools.memcache_setup --project your-project-id enable-api

# Create a new Memcache instance
python -m tools.memcache_setup --project your-project-id create \
  --name anonymizer-cache \
  --region us-central1 \
  --node-count 1 \
  --node-memory 1

Error Handling

The package provides custom exceptions:

ServiceNotEnabledError: Raised when required Google Cloud services are not enabled
AnonymizationError: Raised when anonymization fails
DeAnonymizationError: Raised when de-anonymization fails
InfoTypeNotSupportedError: Raised when an unsupported info type is requested
StorageError: Raised when there's an error with the storage adapter
ConfigurationError: Raised when there's a configuration error

Security Best Practices

Use secure storage: Enable encryption for stored mappings
Limit access: Use IAM to restrict access to the Firestore collection
Set appropriate TTLs: Configure cache TTLs to minimize data exposure
Enable audit logging: Use detailed_result to log anonymization operations

Performance Considerations

Use Memcache: For high-throughput applications
Enable async storage: Reduce latency by updating storage asynchronously
Batch processing: Use batch methods for multiple texts
Optimize info types: Select only the info types you need

Contributing

Contributions are welcome! Please feel free to submit a pull request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

Google Sensitive Data Protection (formerly Cloud DLP) for the powerful detection capabilities
Faker library for generating realistic fake data

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.13

Mar 10, 2025

1.0.12

Mar 9, 2025

1.0.11

Mar 9, 2025

1.0.10

Mar 9, 2025

1.0.9

Mar 9, 2025

1.0.8

Mar 8, 2025

1.0.7

Mar 8, 2025

1.0.6

Mar 8, 2025

1.0.5

Mar 8, 2025

1.0.4

Mar 8, 2025

1.0.3

Mar 8, 2025

1.0.2

Mar 8, 2025

1.0.1

Mar 8, 2025

1.0.0

Mar 8, 2025

0.0.1

Feb 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reversible_anonymizer-1.0.13.tar.gz (55.3 kB view details)

Uploaded Mar 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

reversible_anonymizer-1.0.13-py3-none-any.whl (56.1 kB view details)

Uploaded Mar 10, 2025 Python 3

File details

Details for the file reversible_anonymizer-1.0.13.tar.gz.

File metadata

Download URL: reversible_anonymizer-1.0.13.tar.gz
Upload date: Mar 10, 2025
Size: 55.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for reversible_anonymizer-1.0.13.tar.gz
Algorithm	Hash digest
SHA256	`13337f173a399a8642628a9d8bb6214ec27366af1a4a6fa71eb9af6c326937b6`
MD5	`11ccc249ee8405d61ce306eb10a21d90`
BLAKE2b-256	`708021a92b74ebd2479203f641756fee830d98d40c7c47335264414d027b3fec`

See more details on using hashes here.

File details

Details for the file reversible_anonymizer-1.0.13-py3-none-any.whl.

File metadata

Download URL: reversible_anonymizer-1.0.13-py3-none-any.whl
Upload date: Mar 10, 2025
Size: 56.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for reversible_anonymizer-1.0.13-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3b6abf9ce1b0370b1ccbf2a2a4f81d3c69b9e6e7724e154168c4655a43b306bb`
MD5	`f98f91e5701233956a553918fd46b9ad`
BLAKE2b-256	`b224e59505b7fd0acf78533091baafd58e058b756e2f3164e59f113a90260743`

See more details on using hashes here.

reversible-anonymizer 1.0.13

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Reversible Anonymizer

Overview

Key features:

Installation

Prerequisites

Quick Start

Usage Examples

Basic Configuration

Realistic vs Token-Based Anonymization

Detailed Results with Statistics

Batch Processing

Modes of Operation

Caching Strategies

In-Memory Cache (Default)

Google Cloud Memcache

Asynchronous Storage Updates

Supported InfoTypes

Person Information

Financial Information

Government IDs

Location Information

List available InfoTypes:

Configuration Options

Environment Variables

Configuration File

Example config.json:

CLI Usage

Memcache Setup

Error Handling

Security Best Practices

Performance Considerations

Contributing

License

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes