Skip to main content

Secure anonymization/de-anonymization library for PII data

Project description

anonymask

PyPI Version Python Versions License Downloads

Secure anonymization/de-anonymization library for protecting Personally Identifiable Information (PII) in Python applications. Built with Rust for maximum performance.

โœจ Features

  • ๐Ÿš€ Blazing Fast: Rust-powered core with < 5ms processing time
  • ๐Ÿ” Comprehensive Detection: EMAIL, PHONE, SSN, CREDIT_CARD, IP_ADDRESS, URL
  • ๐Ÿ”’ Secure Placeholders: Deterministic UUID-based anonymization
  • ๐Ÿ Pythonic API: Clean, intuitive Python interface
  • โšก Zero Dependencies: No external runtime dependencies
  • ๐Ÿงต Thread-Safe: Safe for concurrent use in multi-threaded applications

๐Ÿ“ฆ Installation

pip install anonymask

๐Ÿš€ Quick Start

from anonymask import Anonymizer

# Initialize with desired entity types
anonymizer = Anonymizer(['email', 'phone', 'ssn'])

# Anonymize text
text = "Contact john@email.com or call 555-123-4567. SSN: 123-45-6789"
result = anonymizer.anonymize(text)

# Result is a tuple: (anonymized_text, mapping, entities)
print(result[0])
# "Contact EMAIL_xxx or call PHONE_xxx. SSN: SSN_xxx"

print(result[1])
# {'EMAIL_xxx': 'john@email.com', 'PHONE_xxx': '555-123-4567', 'SSN_xxx': '123-45-6789'}

print(result[2])
# [
#   {'entity_type': 'email', 'value': 'john@email.com', 'start': 8, 'end': 22},
#   {'entity_type': 'phone', 'value': '555-123-4567', 'start': 31, 'end': 43},
#   {'entity_type': 'ssn', 'value': '123-45-6789', 'start': 50, 'end': 60}
# ]

# Deanonymize back to original
original = anonymizer.deanonymize(result[0], result[1])
print(original)
# "Contact john@email.com or call 555-123-4567. SSN: 123-45-6789"

๐ŸŽฏ Supported Entity Types

Type Description Examples
email Email addresses user@domain.com, john.doe@company.co.uk
phone Phone numbers 555-123-4567, 555-123, (555) 123-4567, 555.123.4567
ssn Social Security Numbers 123-45-6789, 123456789
credit_card Credit card numbers 1234-5678-9012-3456, 1234567890123456
ip_address IP addresses 192.168.1.1, 2001:0db8:85a3::8a2e:0370:7334
url URLs https://example.com, http://sub.domain.org/path

๐Ÿ“š API Reference

Constructor

anonymizer = Anonymizer(entity_types: List[str])
  • entity_types: List of entity types to detect (see supported types above)

Methods

anonymize(text: str) -> Tuple[str, Dict[str, str], List[Dict]]

Anonymizes the input text using automatic detection and returns detailed result.

Returns:

  • str: Text with PII replaced by placeholders
  • Dict[str, str]: Placeholder -> original value mapping
  • List[Dict]: Array of detected entities with metadata

Each entity dictionary contains:

  • entity_type: Type of entity (email, phone, etc.)
  • value: Original detected value
  • start: Start position in original text
  • end: End position in original text

anonymize_with_custom(text: str, custom_entities: Optional[Dict[str, List[str]]] = None) -> Tuple[str, Dict[str, str], List[Dict]]

Anonymizes the input text using both automatic detection and custom entities.

Parameters:

  • text: The input text to anonymize
  • custom_entities: Optional dictionary mapping entity types to lists of custom values

Example:

custom_entities = {
    "email": ["secret@company.com", "admin@internal.org"],
    "phone": ["555-999-0000"]
}

result = anonymizer.anonymize_with_custom(text, custom_entities)

deanonymize(text: str, mapping: Dict[str, str]) -> str

Restores original text using the provided mapping.

๐Ÿ’ก Use Cases

RAG Applications

from anonymask import Anonymizer
import chromadb  # or any vector store

class SecureRAG:
    def __init__(self):
        self.anonymizer = Anonymizer(['email', 'phone', 'ssn', 'credit_card'])
        self.vector_store = chromadb.Client()
    
    def add_document(self, doc_id: str, text: str):
        # Anonymize before storing
        result = self.anonymizer.anonymize(text)
        safe_text = result[0]
        
        # Store anonymized text and mapping
        self.vector_store.add(
            documents=[safe_text],
            metadatas=[{'mapping': result[1], 'entities': result[2]}],
            ids=[doc_id]
        )
    
    def query(self, query: str):
        # Anonymize query
        result = self.anonymizer.anonymize(query)
        safe_query = result[0]
        
        # Search with anonymized query
        results = self.vector_store.query(query_texts=[safe_query])
        
        # Deanonymize results
        deanonymized_results = []
        for doc, metadata in zip(results['documents'][0], results['metadatas'][0]):
            original = self.anonymizer.deanonymize(doc, metadata['mapping'])
            deanonymized_results.append(original)
        
        return deanonymized_results

FastAPI Integration

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from anonymask import Anonymizer

app = FastAPI()
anonymizer = Anonymizer(['email', 'phone', 'ssn'])

class TextInput(BaseModel):
    text: str

class AnonymizedOutput(BaseModel):
    anonymized_text: str
    entities_count: int
    entities: list

@app.post("/anonymize", response_model=AnonymizedOutput)
async def anonymize_text(input_data: TextInput):
    try:
        result = anonymizer.anonymize(input_data.text)
        return AnonymizedOutput(
            anonymized_text=result[0],
            entities_count=len(result[2]),
            entities=result[2]
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/deanonymize")
async def deanonymize_text(anonymized_text: str, mapping: dict):
    try:
        original = anonymizer.deanonymize(anonymized_text, mapping)
        return {"original_text": original}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

Data Processing Pipeline

import pandas as pd
from anonymask import Anonymizer
from typing import List, Dict

class DataProcessor:
    def __init__(self, entity_types: List[str]):
        self.anonymizer = Anonymizer(entity_types)
    
    def process_dataframe(self, df: pd.DataFrame, text_column: str) -> pd.DataFrame:
        """Process a pandas DataFrame with text data"""
        processed_data = []
        
        for idx, row in df.iterrows():
            text = row[text_column]
            result = self.anonymizer.anonymize(text)
            
            processed_row = row.copy()
            processed_row['original_text'] = text
            processed_row['anonymized_text'] = result[0]
            processed_row['pii_mapping'] = result[1]
            processed_row['entities_found'] = len(result[2])
            processed_row['entities'] = result[2]
            
            processed_data.append(processed_row)
        
        return pd.DataFrame(processed_data)
    
    def batch_process(self, texts: List[str]) -> List[Dict]:
        """Process a list of texts"""
        results = []
        
        for text in texts:
            result = self.anonymizer.anonymize(text)
            results.append({
                'original': text,
                'anonymized': result[0],
                'mapping': result[1],
                'entities': result[2],
                'entity_count': len(result[2])
            })
        
        return results

# Usage example
processor = DataProcessor(['email', 'phone', 'ssn'])
df = pd.read_csv('customer_data.csv')
processed_df = processor.process_dataframe(df, 'customer_message')

LLM Integration

from anonymask import Anonymizer
import openai

class SecureLLMClient:
    def __init__(self, api_key: str):
        self.client = openai.OpenAI(api_key=api_key)
        self.anonymizer = Anonymizer(['email', 'phone', 'ssn', 'credit_card'])
    
    def secure_chat_completion(self, messages: list, custom_entities: dict = None) -> str:
        # Anonymize all user messages
        anonymized_messages = []
        mappings = []
        
        for message in messages:
            if message['role'] == 'user':
                if custom_entities:
                    result = self.anonymizer.anonymize_with_custom(message['content'], custom_entities)
                else:
                    result = self.anonymizer.anonymize(message['content'])
                anonymized_messages.append({
                    'role': 'user',
                    'content': result[0]
                })
                mappings.append(result[1])
            else:
                anonymized_messages.append(message)
        
        # Get LLM response
        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=anonymized_messages
        )
        
        llm_response = response.choices[0].message.content
        
        # Deanonymize response using the last mapping
        if mappings:
            safe_response = self.anonymizer.deanonymize(llm_response, mappings[-1])
        else:
            safe_response = llm_response
        
        return safe_response

Custom Entity Anonymization

from anonymask import Anonymizer

# Initialize with basic detection
anonymizer = Anonymizer(['email'])

# Define custom entities to anonymize
custom_entities = {
    'email': ['internal@company.com', 'admin@secure.org'],
    'phone': ['555-999-0000', '555-888-1111'],
    # You can even specify entity types not in the initial list
    'ssn': ['123-45-6789']
}

text = "Contact internal@company.com or call 555-999-0000"
result = anonymizer.anonymize_with_custom(text, custom_entities)

print(result[0])
# "Contact EMAIL_xxx or call PHONE_xxx"

print(result[1])
# {'EMAIL_xxx': 'internal@company.com', 'PHONE_xxx': '555-999-0000'}

๐Ÿงช Testing

# Install development dependencies
pip install pytest

# Run tests
pytest tests/test_anonymask.py -v

# Run tests with coverage
pytest tests/test_anonymask.py --cov=anonymask --cov-report=html

๐Ÿ”ง Development

Building from Source

  1. Prerequisites:

    • Python 3.8+
    • Rust (latest stable)
    • Maturin
  2. Setup:

# Clone the repository
git clone https://github.com/gokul-viswanathan/anonymask.git
cd anonymask/anonymask-py

# Install development dependencies
pip install maturin pytest

# Build the package in development mode
maturin develop

# Run tests
pytest tests/
  1. Build for Release:
# Build wheel and source distribution
maturin build --release --sdist

# The built wheels will be in target/wheels/

Project Structure

anonymask-py/
โ”œโ”€โ”€ src/
โ”‚   โ””โ”€โ”€ lib.rs              # Rust PyO3 bindings
โ”œโ”€โ”€ python/
โ”‚   โ””โ”€โ”€ anonymask/
โ”‚       โ”œโ”€โ”€ __init__.py     # Python package interface
โ”‚       โ””โ”€โ”€ _anonymask.so   # Compiled Rust extension
โ”œโ”€โ”€ tests/
โ”‚   โ””โ”€โ”€ test_anonymask.py   # Test suite
โ”œโ”€โ”€ pyproject.toml          # Python package configuration
โ”œโ”€โ”€ Cargo.toml              # Rust project configuration
โ””โ”€โ”€ README.md

๐Ÿ—๏ธ Architecture

This package uses PyO3 to create high-performance Python bindings from the Rust core library:

Python โ†’ PyO3 โ†’ Rust Core โ†’ Native Performance

The Rust core provides:

  • Memory Safety: No buffer overflows or memory leaks
  • Performance: Near-native execution speed
  • Concurrency: Thread-safe operations
  • Reliability: Robust error handling

๐Ÿ“Š Performance

  • Processing Speed: < 5ms for typical messages (< 500 words)
  • Memory Usage: Minimal footprint with zero-copy operations
  • Startup Time: Fast initialization with lazy loading
  • Concurrency: Safe for use in multi-threaded environments

๐Ÿ”’ Security

  • Cryptographically Secure: UUID v4 for unique placeholder generation
  • Deterministic: Same input always produces same output
  • No Data Leakage: Secure handling of PII throughout the process
  • Input Validation: Comprehensive validation and error handling

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Add tests for new functionality
  4. Ensure all tests pass (pytest)
  5. Follow the existing code style
  6. Submit a pull request

๐Ÿ—บ๏ธ Roadmap

  • Async API support
  • Streaming API for large texts
  • Custom entity pattern support
  • Persistent mapping storage
  • Performance optimizations
  • Additional entity types

๐Ÿ“ž Support


Version: 0.4.5 | Built with โค๏ธ using Rust and PyO3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anonymask-1.1.0.tar.gz (22.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anonymask-1.1.0-cp311-cp311-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

File details

Details for the file anonymask-1.1.0.tar.gz.

File metadata

  • Download URL: anonymask-1.1.0.tar.gz
  • Upload date:
  • Size: 22.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for anonymask-1.1.0.tar.gz
Algorithm Hash digest
SHA256 55b0f06780c9975a8683b24908b392ea84b7990cf5693a3141ee18dd13d7be4f
MD5 6f10494195e05273a97ad125fd8443f7
BLAKE2b-256 ecbe22e4445fab88fd0f03d7458fe704c78e6f248177fb5bfad7bf047548bd36

See more details on using hashes here.

File details

Details for the file anonymask-1.1.0-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for anonymask-1.1.0-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 388669fff2dc96fcc095e1af941fec618e33948fda40fb74a9832aaee94e63fb
MD5 900ba06cb288bb6067a7869185d56707
BLAKE2b-256 424249fa031265fe0273e589a2d01df8cd492eaa99514926bf6faf1098ffa18c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page