Skip to main content

A comprehensive PII redaction and reverse mapping library

Project description

SecretStuff

A comprehensive, production-ready Python library for identifying, redacting, and reversing personally identifiable information (PII) in text documents using advanced NLP models.

Features

  • PII Identification: Uses GLiNER model to identify 150+ types of PII including names, addresses, phone numbers, government IDs, and more
  • Flexible Redaction: Replace identified PII with configurable dummy values while preserving document structure
  • Reverse Mapping: Restore original PII from redacted text using secure mapping files
  • Modular Architecture: Use components independently or through unified pipeline
  • Extensive Coverage: Comprehensive support for Indian and international PII types
  • Production Ready: Type hints, comprehensive tests, and robust error handling

Installation

pip install secretstuff

Quick Start

Simple Pipeline Usage

from secretstuff import SecretStuffPipeline

# Initialize pipeline
pipeline = SecretStuffPipeline()

# Your sensitive text
text = """
Mr. John Doe lives at 123 Main Street, New York.
His phone number is +1-555-123-4567 and email is john.doe@email.com.
His Aadhaar number is 1234 5678 9012 and PAN is ABCDE1234F.
"""

# Identify and redact PII in one step
redacted_text, entities, mapping = pipeline.identify_and_redact(text)

print("Redacted:", redacted_text)
print("Found entities:", entities)

Step-by-Step Process

from secretstuff import SecretStuffPipeline

pipeline = SecretStuffPipeline()

# Step 1: Identify PII
entities = pipeline.identify_pii(text)
print("Identified PII:", entities)

# Step 2: Redact PII
redacted_text = pipeline.redact_pii(text)
print("Redacted text:", redacted_text)

# Step 3: After cloud LLM processing, reverse the redaction
restored_text, count, details = pipeline.reverse_redaction(processed_text)
print("Restored text:", restored_text)

File Processing

# Process files
result = pipeline.process_text_file(
    input_file="document.txt",
    output_redacted="redacted_document.txt",
    output_identified="identified_entities.json",
    output_mapping="replacement_mapping.json"
)

# Later, reverse the redaction
reverse_result = pipeline.reverse_from_files(
    redacted_file="processed_document.txt",  # After LLM processing
    mapping_file="replacement_mapping.json",
    output_file="final_document.txt"
)

Component Usage

Individual Components

from secretstuff import PIIIdentifier, PIIRedactor, ReverseMapper

# Use components individually
identifier = PIIIdentifier()
redactor = PIIRedactor()
reverse_mapper = ReverseMapper()

# Identify PII
entities = identifier.identify_entities(text)

# Redact PII
redacted = redactor.redact_from_identified_entities(text, entities)

# Reverse redaction
reverse_mapper.set_replacement_mapping(redactor.get_replacement_mapping())
restored, count, details = reverse_mapper.reverse_redaction(redacted)

Custom Configuration

from secretstuff import SecretStuffPipeline

# Custom labels and dummy values
custom_labels = ["person", "email", "phone number", "custom_entity"]
custom_dummy_values = {
    "person": ["[PERSON_A]", "[PERSON_B]", "[PERSON_C]"],
    "email": "[EMAIL_REDACTED]",
    "custom_entity": "[CUSTOM_REDACTED]"
}

pipeline = SecretStuffPipeline(
    labels=custom_labels,
    dummy_values=custom_dummy_values
)

# Or configure after initialization
pipeline.configure_labels(custom_labels)
pipeline.configure_dummy_values(custom_dummy_values)

Supported PII Types

SecretStuff identifies 150+ types of PII including:

Personal Information

  • Names, addresses, phone numbers, email addresses
  • Dates of birth, ages, places of birth
  • Family relationships (father's name, mother's name, etc.)

Government IDs (India)

  • Aadhaar numbers, PAN numbers, Voter IDs
  • Passport numbers, driving licenses
  • Various state and central government IDs

Financial Information

  • Bank account numbers, IFSC codes, UPI IDs
  • Credit/debit card numbers, cheque numbers
  • GST numbers, tax identification numbers

Legal & Court Documents

  • Case numbers, FIR numbers, court order numbers
  • CNR numbers, filing numbers, petition numbers

Corporate Information

  • CIN numbers, trade license numbers
  • Professional registration numbers

Technical Identifiers

  • IP addresses, MAC addresses, device serial numbers
  • IMEI numbers, device identifiers

[and more....]

API Reference

SecretStuffPipeline

The main interface for all operations:

class SecretStuffPipeline:
    def identify_pii(text: str, chunk_size: int = 384) -> Dict[str, List[str]]
    def redact_pii(text: str, entities: Optional[Dict] = None) -> str
    def identify_and_redact(text: str) -> Tuple[str, Dict, Dict]
    def reverse_redaction(redacted_text: str, mapping: Optional[Dict] = None) -> Tuple[str, int, Dict]
    def process_text_file(input_file: str, **kwargs) -> Dict
    def reverse_from_files(redacted_file: str, mapping_file: str, output_file: str) -> Dict

PIIIdentifier

class PIIIdentifier:
    def identify_entities(text: str, chunk_size: int = 384) -> List[Dict]
    def create_entity_mapping(entities: List[Dict]) -> Dict[str, List[str]]
    def add_custom_labels(labels: List[str]) -> None
    def set_labels(labels: List[str]) -> None

PIIRedactor

class PIIRedactor:
    def create_replacement_mapping(entities: Dict[str, List[str]]) -> Dict[str, str]
    def redact_text(text: str, mapping: Dict[str, str]) -> str
    def redact_from_identified_entities(text: str, entities: Dict) -> str
    def set_dummy_values(dummy_values: Dict) -> None

ReverseMapper

class ReverseMapper:
    def reverse_redaction(redacted_text: str) -> Tuple[str, int, Dict]
    def load_replacement_mapping(mapping_file: str) -> None
    def validate_mapping() -> bool
    def get_mapping_statistics() -> Dict

Advanced Usage

Custom Model

pipeline = SecretStuffPipeline(
    model_name="your-custom-gliner-model"
)

Batch Processing

# Process multiple files
files = ["doc1.txt", "doc2.txt", "doc3.txt"]
results = []

for file in files:
    result = pipeline.process_text_file(file)
    results.append(result)

Use Cases

1. Cloud LLM Data Protection

# Before sending to cloud LLM
original_text = "Patient John Doe (DOB: 1985-03-15) visited on..."
redacted_text, entities, mapping = pipeline.identify_and_redact(original_text)

# Send redacted_text to cloud LLM
llm_response = call_cloud_llm(redacted_text)

# Restore original PII in response
final_response, _, _ = pipeline.reverse_redaction(llm_response, mapping)

2. Document Anonymization

# Remove PII from documents permanently
entities = pipeline.identify_pii(document_text)
anonymized = pipeline.redact_pii(document_text, entities)
# Don't save the mapping for permanent anonymization

3. Data Processing Pipeline

# Part of larger data processing workflow
def process_sensitive_documents(input_dir, output_dir):
    for filename in os.listdir(input_dir):
        input_path = os.path.join(input_dir, filename)
        output_path = os.path.join(output_dir, f"redacted_{filename}")
        
        pipeline.process_text_file(
            input_file=input_path,
            output_redacted=output_path
        )

Configuration

Environment Variables

export SECRETSTUFF_MODEL_NAME="aksman18/gliner-multi-pii-domains-v2"
export SECRETSTUFF_CHUNK_SIZE="384"
export SECRETSTUFF_CACHE_DIR="/path/to/cache"

Custom Configuration File

# config.py
CUSTOM_LABELS = ["person", "email", "phone", "custom_field"]
CUSTOM_DUMMY_VALUES = {
    "custom_field": "[CUSTOM_REDACTED]"
}

# main.py
from config import CUSTOM_LABELS, CUSTOM_DUMMY_VALUES
pipeline = SecretStuffPipeline(
    labels=CUSTOM_LABELS,
    dummy_values=CUSTOM_DUMMY_VALUES
)

Performance Considerations

  • Model Caching: GLiNER model is cached after first load
  • Batch Processing: Process multiple documents in batches for efficiency

Error Handling

from secretstuff import SecretStuffPipeline
from secretstuff.exceptions import SecretStuffError

try:
    pipeline = SecretStuffPipeline()
    result = pipeline.identify_and_redact(text)
except SecretStuffError as e:
    print(f"SecretStuff error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Testing

Run the test suite:

# Install dev dependencies
pip install secretstuff[dev]

# Run tests
pytest

#or 
python -m pytest tests/ -v # please run all the tests before raising a pr

# Run with coverage
pytest --cov=secretstuff --cov-report=html

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

License

MIT License - see LICENSE file for details.

Support

Changelog

v0.0.1

  • Initial release
  • PII identification with GLiNER
  • Flexible redaction system
  • Reverse mapping functionality
  • Comprehensive test suite
  • Production-ready API

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

secretstuff-1.0.0.tar.gz (34.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

secretstuff-1.0.0-py3-none-any.whl (26.4 kB view details)

Uploaded Python 3

File details

Details for the file secretstuff-1.0.0.tar.gz.

File metadata

  • Download URL: secretstuff-1.0.0.tar.gz
  • Upload date:
  • Size: 34.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for secretstuff-1.0.0.tar.gz
Algorithm Hash digest
SHA256 abc38caa25399659795384da01e2f471fcdf5512b884e448b073ab19f1a77f78
MD5 0400e71f4e6bb0c573620aed46696dc3
BLAKE2b-256 1e5fd29244c5b12ef00066284b75225a328644790ad8eee6c0d39a6a923eb127

See more details on using hashes here.

File details

Details for the file secretstuff-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: secretstuff-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 26.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for secretstuff-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2409d2ce470fe861770480106490a9c70a0f7e7a48f82c43cc7dda887d43a446
MD5 336abef131d4e282ed22865621cdc885
BLAKE2b-256 40845e5f9afd08d80d9eee949f73a71c18789880fa6c5f8a1ed1b73574f118d3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page