A comprehensive PII redaction and reverse mapping library

These details have not been verified by PyPI

Project links

Project description

SecretStuff

A comprehensive, production-ready Python library for identifying, redacting, and reversing personally identifiable information (PII) in text documents using advanced NLP models.

Features

PII Identification: Uses GLiNER model to identify 150+ types of PII including names, addresses, phone numbers, government IDs, and more
Flexible Redaction: Replace identified PII with configurable dummy values while preserving document structure
Reverse Mapping: Restore original PII from redacted text using secure mapping files
Modular Architecture: Use components independently or through unified pipeline
Extensive Coverage: Comprehensive support for Indian and international PII types
Production Ready: Type hints, comprehensive tests, and robust error handling

Installation

pip install secretstuff

Quick Start

Simple Pipeline Usage

from secretstuff import SecretStuffPipeline

# Initialize pipeline
pipeline = SecretStuffPipeline()

# Your sensitive text
text = """
Mr. John Doe lives at 123 Main Street, New York.
His phone number is +1-555-123-4567 and email is john.doe@email.com.
His Aadhaar number is 1234 5678 9012 and PAN is ABCDE1234F.
"""

# Identify and redact PII in one step
redacted_text, entities, mapping = pipeline.identify_and_redact(text)

print("Redacted:", redacted_text)
print("Found entities:", entities)

Step-by-Step Process

from secretstuff import SecretStuffPipeline

pipeline = SecretStuffPipeline()

# Step 1: Identify PII
entities = pipeline.identify_pii(text)
print("Identified PII:", entities)

# Step 2: Redact PII
redacted_text = pipeline.redact_pii(text)
print("Redacted text:", redacted_text)

# Step 3: After cloud LLM processing, reverse the redaction
restored_text, count, details = pipeline.reverse_redaction(processed_text)
print("Restored text:", restored_text)

File Processing

# Process files
result = pipeline.process_text_file(
    input_file="document.txt",
    output_redacted="redacted_document.txt",
    output_identified="identified_entities.json",
    output_mapping="replacement_mapping.json"
)

# Later, reverse the redaction
reverse_result = pipeline.reverse_from_files(
    redacted_file="processed_document.txt",  # After LLM processing
    mapping_file="replacement_mapping.json",
    output_file="final_document.txt"
)

Component Usage

Individual Components

from secretstuff import PIIIdentifier, PIIRedactor, ReverseMapper

# Use components individually
identifier = PIIIdentifier()
redactor = PIIRedactor()
reverse_mapper = ReverseMapper()

# Identify PII
entities = identifier.identify_entities(text)

# Redact PII
redacted = redactor.redact_from_identified_entities(text, entities)

# Reverse redaction
reverse_mapper.set_replacement_mapping(redactor.get_replacement_mapping())
restored, count, details = reverse_mapper.reverse_redaction(redacted)

Custom Configuration

from secretstuff import SecretStuffPipeline

# Custom labels and dummy values
custom_labels = ["person", "email", "phone number", "custom_entity"]
custom_dummy_values = {
    "person": ["[PERSON_A]", "[PERSON_B]", "[PERSON_C]"],
    "email": "[EMAIL_REDACTED]",
    "custom_entity": "[CUSTOM_REDACTED]"
}

pipeline = SecretStuffPipeline(
    labels=custom_labels,
    dummy_values=custom_dummy_values
)

# Or configure after initialization
pipeline.configure_labels(custom_labels)
pipeline.configure_dummy_values(custom_dummy_values)

Supported PII Types

SecretStuff identifies 150+ types of PII including:

Personal Information

Names, addresses, phone numbers, email addresses
Dates of birth, ages, places of birth
Family relationships (father's name, mother's name, etc.)

Government IDs (India)

Aadhaar numbers, PAN numbers, Voter IDs
Passport numbers, driving licenses
Various state and central government IDs

Financial Information

Bank account numbers, IFSC codes, UPI IDs
Credit/debit card numbers, cheque numbers
GST numbers, tax identification numbers

Legal & Court Documents

Case numbers, FIR numbers, court order numbers
CNR numbers, filing numbers, petition numbers

Corporate Information

CIN numbers, trade license numbers
Professional registration numbers

Technical Identifiers

IP addresses, MAC addresses, device serial numbers
IMEI numbers, device identifiers

[and more....]

API Reference

SecretStuffPipeline

The main interface for all operations:

class SecretStuffPipeline:
    def identify_pii(text: str, chunk_size: int = 384) -> Dict[str, List[str]]
    def redact_pii(text: str, entities: Optional[Dict] = None) -> str
    def identify_and_redact(text: str) -> Tuple[str, Dict, Dict]
    def reverse_redaction(redacted_text: str, mapping: Optional[Dict] = None) -> Tuple[str, int, Dict]
    def process_text_file(input_file: str, **kwargs) -> Dict
    def reverse_from_files(redacted_file: str, mapping_file: str, output_file: str) -> Dict

PIIIdentifier

class PIIIdentifier:
    def identify_entities(text: str, chunk_size: int = 384) -> List[Dict]
    def create_entity_mapping(entities: List[Dict]) -> Dict[str, List[str]]
    def add_custom_labels(labels: List[str]) -> None
    def set_labels(labels: List[str]) -> None

PIIRedactor

class PIIRedactor:
    def create_replacement_mapping(entities: Dict[str, List[str]]) -> Dict[str, str]
    def redact_text(text: str, mapping: Dict[str, str]) -> str
    def redact_from_identified_entities(text: str, entities: Dict) -> str
    def set_dummy_values(dummy_values: Dict) -> None

ReverseMapper

class ReverseMapper:
    def reverse_redaction(redacted_text: str) -> Tuple[str, int, Dict]
    def load_replacement_mapping(mapping_file: str) -> None
    def validate_mapping() -> bool
    def get_mapping_statistics() -> Dict

Advanced Usage

Custom Model

pipeline = SecretStuffPipeline(
    model_name="your-custom-gliner-model"
)

Batch Processing

# Process multiple files
files = ["doc1.txt", "doc2.txt", "doc3.txt"]
results = []

for file in files:
    result = pipeline.process_text_file(file)
    results.append(result)

Use Cases

1. Cloud LLM Data Protection

# Before sending to cloud LLM
original_text = "Patient John Doe (DOB: 1985-03-15) visited on..."
redacted_text, entities, mapping = pipeline.identify_and_redact(original_text)

# Send redacted_text to cloud LLM
llm_response = call_cloud_llm(redacted_text)

# Restore original PII in response
final_response, _, _ = pipeline.reverse_redaction(llm_response, mapping)

2. Document Anonymization

# Remove PII from documents permanently
entities = pipeline.identify_pii(document_text)
anonymized = pipeline.redact_pii(document_text, entities)
# Don't save the mapping for permanent anonymization

3. Data Processing Pipeline

# Part of larger data processing workflow
def process_sensitive_documents(input_dir, output_dir):
    for filename in os.listdir(input_dir):
        input_path = os.path.join(input_dir, filename)
        output_path = os.path.join(output_dir, f"redacted_{filename}")
        
        pipeline.process_text_file(
            input_file=input_path,
            output_redacted=output_path
        )

Configuration

Environment Variables

export SECRETSTUFF_MODEL_NAME="aksman18/gliner-multi-pii-domains-v2"
export SECRETSTUFF_CHUNK_SIZE="384"
export SECRETSTUFF_CACHE_DIR="/path/to/cache"

Custom Configuration File

# config.py
CUSTOM_LABELS = ["person", "email", "phone", "custom_field"]
CUSTOM_DUMMY_VALUES = {
    "custom_field": "[CUSTOM_REDACTED]"
}

# main.py
from config import CUSTOM_LABELS, CUSTOM_DUMMY_VALUES
pipeline = SecretStuffPipeline(
    labels=CUSTOM_LABELS,
    dummy_values=CUSTOM_DUMMY_VALUES
)

Performance Considerations

Model Caching: GLiNER model is cached after first load
Batch Processing: Process multiple documents in batches for efficiency

Error Handling

from secretstuff import SecretStuffPipeline
from secretstuff.exceptions import SecretStuffError

try:
    pipeline = SecretStuffPipeline()
    result = pipeline.identify_and_redact(text)
except SecretStuffError as e:
    print(f"SecretStuff error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Testing

Run the test suite:

# Install dev dependencies
pip install secretstuff[dev]

# Run tests
pytest

#or 
python -m pytest tests/ -v # please run all the tests before raising a pr

# Run with coverage
pytest --cov=secretstuff --cov-report=html

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

License

MIT License - see LICENSE file for details.

Support

Documentation: https://github.com/adw777/secretStuff/blob/main/README.md
Issues: https://github.com/adw777/secretStuff/issues
Email: amandogra2016@gmail.com

Changelog

v0.0.1

Initial release
PII identification with GLiNER
Flexible redaction system
Reverse mapping functionality
Comprehensive test suite
Production-ready API

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.1

Oct 16, 2025

This version

1.0.0

Oct 16, 2025

0.0.1

Sep 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

secretstuff-1.0.0.tar.gz (34.6 kB view details)

Uploaded Oct 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

secretstuff-1.0.0-py3-none-any.whl (26.4 kB view details)

Uploaded Oct 16, 2025 Python 3

File details

Details for the file secretstuff-1.0.0.tar.gz.

File metadata

Download URL: secretstuff-1.0.0.tar.gz
Upload date: Oct 16, 2025
Size: 34.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for secretstuff-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`abc38caa25399659795384da01e2f471fcdf5512b884e448b073ab19f1a77f78`
MD5	`0400e71f4e6bb0c573620aed46696dc3`
BLAKE2b-256	`1e5fd29244c5b12ef00066284b75225a328644790ad8eee6c0d39a6a923eb127`

See more details on using hashes here.

File details

Details for the file secretstuff-1.0.0-py3-none-any.whl.

File metadata

Download URL: secretstuff-1.0.0-py3-none-any.whl
Upload date: Oct 16, 2025
Size: 26.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for secretstuff-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2409d2ce470fe861770480106490a9c70a0f7e7a48f82c43cc7dda887d43a446`
MD5	`336abef131d4e282ed22865621cdc885`
BLAKE2b-256	`40845e5f9afd08d80d9eee949f73a71c18789880fa6c5f8a1ed1b73574f118d3`

See more details on using hashes here.

secretstuff 1.0.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

SecretStuff

Features

Installation

Quick Start

Simple Pipeline Usage

Step-by-Step Process

File Processing

Component Usage

Individual Components

Custom Configuration

Supported PII Types

Personal Information

Government IDs (India)

Financial Information

Legal & Court Documents

Corporate Information

Technical Identifiers

API Reference

SecretStuffPipeline

PIIIdentifier

PIIRedactor

ReverseMapper

Advanced Usage

Custom Model

Batch Processing

Use Cases

1. Cloud LLM Data Protection

2. Document Anonymization

3. Data Processing Pipeline

Configuration

Environment Variables

Custom Configuration File

Performance Considerations

Error Handling

Testing

Contributing

License

Support

Changelog

v0.0.1

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes