Skip to main content

doubletake is a module to scrub PII from datasets

Project description

doubletake

Intelligent PII Detection and Replacement for Python

Python License CircleCI Quality Gate Status Coverage Bugs pypi package python contributions welcome

doubletake is a powerful, flexible library for automatically detecting and replacing Personally Identifiable Information (PII) in your data structures. Whether you're anonymizing datasets for testing, protecting sensitive information in logs, or ensuring GDPR compliance, doubletake makes it effortless.

✨ Key Features

  • 🚀 High Performance: Choose between fast JSON-based processing or flexible tree traversal
  • 🎯 Smart Detection: Built-in patterns for emails, phones, SSNs, credit cards, IPs, and URLs
  • 🔧 Highly Configurable: Custom patterns, callbacks, and replacement strategies
  • 📊 Realistic Fake Data: Generate believable replacements using the Faker library
  • 🌳 Deep Traversal: Handle complex nested data structures automatically
  • ⚡ Zero Dependencies: Lightweight with minimal external requirements
  • 🛡️ Type Safe: Full TypeScript-style type hints for better development experience
  • 📋 Path Targeting: Precisely target specific data paths for replacement

🎯 Why doubletake?

The Problem: You have sensitive data in complex structures that needs to be anonymized for testing, logging, or compliance, but existing solutions are either too rigid, too slow, or don't handle your specific use cases.

The Solution: doubletake provides intelligent PII detection with multiple processing strategies, letting you choose the perfect balance of performance and flexibility for your needs.

🚀 Quick Start

Installation

pip install doubletake
# or
pipenv install doubletake
# or
poetry add doubletake

Basic Usage

from doubletake import DoubleTake

# Initialize with default settings
db = DoubleTake()

# Your data with PII
data = [
    {
        "user_id": 12345,
        "name": "John Doe",
        "email": "john.doe@company.com",
        "phone": "555-123-4567",
        "ssn": "123-45-6789"
    },
    {
        "customer": {
            "contact": "jane@example.org",
            "billing": {
                "card": "4532-1234-5678-9012",
                "address": "123 Main St"
            }
        }
    }
]

# Replace PII automatically
masked_data = db.mask_data(data)

print(masked_data)
# Output:
# [
#   {
#     "user_id": 12345,
#     "name": "John Doe", 
#     "email": "****@******.***",
#     "phone": "***-***-****",
#     "ssn": "***-**-****"
#   },
#   ...
# ]

🔧 Advanced Configuration

Using Realistic Fake Data

from doubletake import DoubleTake

# Generate realistic fake data instead of asterisks
db = DoubleTake(use_faker=True)

masked_data = db.mask_data(data)
# Emails become: sarah.johnson@example.net
# Phones become: +1-555-234-5678  
# SSNs become: 987-65-4321

Custom Replacement Logic

def custom_replacer(pattern_key: str, replacement: str, item: Any, key: str, breadcrumbs: List[str]):
    """Custom replacement with full context"""
    if pattern_key == 'email':
        return "***REDACTED_EMAIL***"
    if pattern_key == 'ssn':
        return "XXX-XX-XXXX"
    if 'secret' in item[key]:
        return "***CLASSIFIED***"
    return replacement

db = DoubleTake(callback=custom_replacer)

Targeting Specific Patterns

# Only replace certain types, allow others through
db = DoubleTake(
    allowed=['email'],  # Don't replace emails
    extras=[r'CUST-\d+', r'REF-[A-Z]{3}-\d{4}']  # Custom patterns
)

Precise Path Targeting

# Only replace PII at specific data paths
db = DoubleTake(
    known_paths=[
        'customer.email',
        'billing.ssn', 
        'contacts.emergency.phone'
    ]
)

🏗️ Architecture

doubletake offers two complementary processing strategies:

🚀 JSONGrepper (High Performance)

  • Best for: Large datasets, simple replacement needs
  • Speed: ⚡ Fastest option
  • Method: JSON serialization + regex replacement
  • Trade-offs: Less flexibility, no custom callbacks
# Automatically chosen when no custom logic needed
db = DoubleTake()  # Uses JSONGrepper internally

🔧 StringReplacer (Basic Functionality)

  • Best for: Simple string processing, single-level data structures
  • Speed: 🐰 Moderate performance for straightforward replacements
  • Method: Direct string pattern matching and replacement
  • Features: Basic pattern detection, simple replacements, lightweight processing
  • Trade-offs: No deep traversal, limited to string-to-string operations
# Used for basic string replacement scenarios
db = DoubleTake(use_faker=True)  # Uses StringReplacer for simple string input
db = DoubleTake(replace_wit='x')  # Uses StringReplacer for simple string input
# example simple string input
# ['some log with your phone: 111-333-444', 'some log with your ssn: 123-456-7890']

🌳 DataWalker (Maximum Flexibility)

  • Best for: Complex logic, custom callbacks, path targeting
  • Speed: 🐢 Slower but more capable
  • Method: Recursive tree traversal
  • Features: Full context, breadcrumbs, custom callbacks
# Automatically chosen when using advanced features
db = DoubleTake(use_faker=True)  # Uses DataWalker
db = DoubleTake(callback=my_func)  # Uses DataWalker

📊 Built-in PII Patterns

Pattern Description Example
email Email addresses user@domain.com
phone Phone numbers (US formats) 555-123-4567, (555) 123-4567
ssn Social Security Numbers 123-45-6789, 123456789
credit_card Credit card numbers 4532-1234-5678-9012
ip_address IPv4 addresses 192.168.1.1
url HTTP/HTTPS URLs https://example.com/path

🎛️ Configuration Options

db = DoubleTake(
    use_faker=False,           # Use fake data vs asterisks
    callback=None,             # Custom replacement function
    allowed=[],                # Pattern types to skip
    extras=[],                 # Additional regex patterns  
    known_paths=[],            # Specific paths to target
    replace_with='*',          # Character for replacements
    maintain_length=False      # Preserve original string length
)

🧪 Real-World Examples

API Response Sanitization

# Sanitize API responses for logging
api_response = {
    "status": "success",
    "data": {
        "users": [
            {"id": 1, "email": "user1@corp.com", "role": "admin"},
            {"id": 2, "email": "user2@corp.com", "role": "user"}
        ]
    },
    "metadata": {"request_ip": "203.0.113.42"}
}

db = DoubleTake()
safe_response = db.mask_data([api_response])[0]
# Safe to log without exposing PII

Database Export Anonymization

# Anonymize database exports for development
db_records = [
    {"patient_id": "PT001", "ssn": "123-45-6789", "email": "patient@email.com"},
    {"patient_id": "PT002", "ssn": "987-65-4321", "email": "another@email.com"}
]

db = DoubleTake(
    use_faker=True,
    allowed=[],  # Replace all PII types
)

anonymized_records = db.mask_data(db_records)
# Safe for development environments

Configuration File Sanitization

# Remove secrets from config files
config = {
    "database": {
        "host": "db.company.com",
        "admin_email": "admin@company.com"
    },
    "api_keys": {
        "stripe": "sk_live_abcd1234...",
        "support_email": "support@company.com"
    }
}

db = DoubleTake(known_paths=['database.admin_email', 'api_keys.support_email'])
sanitized_config = db.mask_data([config])[0]

Log Sanitization

# Remove secrets from config files
logs = [
    "Please contact our support team at support@company.com or call +1-555-SUPPORT",
    "Your SSN 123-45-6789 has been verified. Email confirmation sent to user@domain.com",
]

db = DoubleTake()
sanitized_logs = db.mask_data([logs])

🔬 Performance & Testing

doubletake includes comprehensive tests with 100% coverage:

# Run tests
pipenv run test

# Run with coverage
pipenv run pytest --cov=doubletake tests/

Performance Benchmarks (10,000 records):

  • JSONGrepper: ~0.1s (simple patterns)
  • DataWalker: ~0.3s (with fake data generation)

🤝 Contributing

We welcome contributions! Please see our contributing guidelines for details.

# Development setup
git clone https://github.com/paulcruse3/doubletake.git
cd doubletake
pipenv install --dev
pipenv run test

📄 License

MIT License - see LICENSE file for details.

🔗 Links


Made with ❤️ for data privacy and security

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doubletake-1.0.2.tar.gz (42.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doubletake-1.0.2-py3-none-any.whl (48.9 kB view details)

Uploaded Python 3

File details

Details for the file doubletake-1.0.2.tar.gz.

File metadata

  • Download URL: doubletake-1.0.2.tar.gz
  • Upload date:
  • Size: 42.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.17

File hashes

Hashes for doubletake-1.0.2.tar.gz
Algorithm Hash digest
SHA256 c50c8c845dbd03cd8ec45a0ae17304c6efa4e8798f97e8e5eb0c7d790e007634
MD5 ce793c87c9f914d4a745bcacea2d1c82
BLAKE2b-256 43149ed01945b2225e918ed68a3ff630edfdb7c060282137a8c8ab4366cecaf6

See more details on using hashes here.

File details

Details for the file doubletake-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: doubletake-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 48.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.17

File hashes

Hashes for doubletake-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 530f6d9b59339a45cae009cc96c747a76c0ea273a496a2e041cf70bdc3879210
MD5 4aa6f8d7989b6dc6126aa8d053ea32d5
BLAKE2b-256 83f7bdaccb9e13ba7e08b071971218d3a663db0f584897c07960435684e48209

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page