A blazing-fast, lightweight Python library and CLI tool designed to scrub Personally Identifiable Information (PII)

These details have not been verified by PyPI

Project links

Project description

PiiScrub

A blazing-fast, lightweight Python library and CLI tool designed to scrub Personally Identifiable Information (PII) from datasets for LLM training and RAG pipelines.

Features

Maximum Speed & Zero Dependencies: Relies exclusively on Python's standard library. No pandas, spaCy, or other heavy external packages.
Deterministic Validation: Raw regex matches for high-risk entities (like credit cards and IPs) pass algorithmic checksums (e.g., Luhn algorithm, octet range checks) before being flagged to eliminate false positives.
Pre-compiled Regex: All regular expressions are compiled at the module level using re.compile() for O(1) setup time during execution.
Large Dataset Streaming: Features scrub_stream and extract_stream to process massive datasets chunk-by-chunk without hitting Out-Of-Memory limit.
Multi-Core Parallel Processing: Leverage multiple CPU cores to scrub large files at blazing speed using --parallel.
Pre-Bundled Compliance Profiles: Quickly target specific standards like hipaa, pci-dss, or gdpr using the --profile flag.
Compliance Auditing & Metric Reports: Generate detailed JSON reports with statistics on redacted entities and execution time using --report.
High-Value Secret Detection: Added parsing to locate critical assets like AWS Access Keys, GitHub Tokens, and RSA Private Keys out of the box.
Deterministic Hashing: Replace PII with deterministic SHA-256 hashes instead of generic tags to track uniqueness without leaking data.
Synthetic Data Generation: Replace real PII with realistic "fake" data using the faker library (beta).
Configuration File Support: Manage complex settings via piiscrub.json instead of long CLI commands.
Custom Pattern Injection: Dynamically inject your own regex patterns and validators directly into the engine without modifying the core library.
Allowlist Support: Explicitly bypass scrubbing for public figures, system emails, or company identifiers to prevent false positives.

Supported Entities

Global:
- EMAIL
- PHONE_GENERIC (international)
- CREDIT_CARD (13-16 digits with Luhn algorithm validation)
- IPV4 (validation ensuring all octets <= 255)
- IPV6
US Specific:
- US_SSN
India Specific:
- IN_AADHAAR (12 digits, cannot start with 0 or 1)
- IN_PAN (5 uppercase letters, 4 digits, 1 uppercase letter)
Secrets & Credentials (V2):
- AWS_ACCESS_KEY
- GITHUB_TOKEN
- RSA_PRIVATE_KEY

Installation

pip install .

CLI Usage

Extract PII

piiscrub extract --text "My email is test@example.com"
piiscrub extract --file text.txt

Scrub PII

piiscrub scrub --text "My email is test@example.com"
piiscrub scrub --file text.txt

# Use deterministic hashing instead of standard tags
piiscrub scrub --text "My email is test@example.com" --style hash
# Output: My email is <EMAIL_a1517717>

# Bypass scrubbing for specific public strings
piiscrub scrub --text "Contact support@example.com or user@example.com" --allowlist support@example.com
# Output: Contact support@example.com or <EMAIL>

# Inject Custom Pattern from the CLI
piiscrub scrub --text "This is employee EMP-99881 and email a@b.com" --custom-pattern EMP_ID "\bEMP-\d{5}\b" --entities EMP_ID EMAIL
# Output: This is employee <EMP_ID> and email <EMAIL>

# Synthetic Data Generation
piiscrub scrub --text "Contact me at omkar@example.com" --style synthetic
# Output: Contact me at victoria12@gmail.com

Advanced Features

1. Configuration File (`piiscrub.json`)

You can define a piiscrub.json file in your working directory to simplify your commands:

{
    "style": "hash",
    "entities": ["EMAIL", "PHONE_GENERIC"],
    "allowlist": ["support@mycompany.com"],
    "custom_patterns": {
        "ORDER_ID": "ORD-\\d{5}"
    }
}

Now just run:

piiscrub scrub --file data.txt

2. Parallel Processing

For large files, use multi-core processing:

piiscrub scrub --file large_dataset.txt --parallel --output cleaned.txt

[!TIP] Parallel mode automatically handles file I/O efficiently and defaults to using all available CPU cores.

3. Pre-Bundled Compliance Profiles

Quickly target common privacy standards without remembering every entity name:

# Scrub only PCI-DSS related data (Credit Cards)
piiscrub scrub --file transactions.txt --profile pci-dss

# Scrub HIPAA related data (SSN, Phone, Email, IP)
piiscrub scrub --file medical_records.txt --profile hipaa

Available profiles: pci-dss, hipaa, gdpr, strict.

4. Compliance Auditing & Metric Reports

Data compliance teams can generate a statistical summary of the scrubbing process as proof of redaction:

piiscrub scrub --file sensitive_data.txt --report audit.json

Sample audit.json output:

{
    "command": "scrub",
    "total_lines_processed": 5000,
    "execution_time_seconds": 1.25,
    "entities_redacted": {
        "EMAIL": 142,
        "CREDIT_CARD": 12,
        "PHONE_GENERIC": 5
    },
    "style": "tag"
}

Stream Processing

For extremely large files (e.g. LLM corpus data logs):

piiscrub scrub --file huge_dataset.jsonl --stream > scrubbed.jsonl
piiscrub extract --file huge_dataset.jsonl --stream > entities.json

Library Usage

from piiscrub.core import PiiScrub
import re

# Initialize with custom generic entities or pattern injection!
custom_patterns = {
    "INTERNAL_ID": re.compile(r"\bEMP-\d{5}\b")
}
cs = PiiScrub(
    entities=["EMAIL", "CREDIT_CARD", "INTERNAL_ID"], 
    custom_patterns=custom_patterns,
    allowlist=["public@example.com"]
)

code = "Contact test@example.com for info on EMP-12345."

# Extract entities
extracted = cs.extract_entities(code)
print(extracted)
# {'EMAIL': ['test@example.com'], 'INTERNAL_ID': ['EMP-12345']}

# Scrub entities using hashing
scrubbed_code = cs.scrub_text(code, replacement_style="hash")
print(scrubbed_code)
# Contact <EMAIL_a1517717> for info on <INTERNAL_ID_b5fb38c3>.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2

Mar 3, 2026

0.1.1

Mar 3, 2026

This version

0.1.0

Mar 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piiscrub-0.1.0.tar.gz (16.8 kB view details)

Uploaded Mar 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

piiscrub-0.1.0-py3-none-any.whl (12.7 kB view details)

Uploaded Mar 3, 2026 Python 3

File details

Details for the file piiscrub-0.1.0.tar.gz.

File metadata

Download URL: piiscrub-0.1.0.tar.gz
Upload date: Mar 3, 2026
Size: 16.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for piiscrub-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2a5386db7494a77d403f1cc30be9bb4aaa4e89ae532883d41a363b963fb8a5ad`
MD5	`5d692a85654164ead5ba2d514ab0aedb`
BLAKE2b-256	`f5f86235d31b8dddad54380cde6162e67f9d9122ae03ea44e4ebb891a624e854`

See more details on using hashes here.

File details

Details for the file piiscrub-0.1.0-py3-none-any.whl.

File metadata

Download URL: piiscrub-0.1.0-py3-none-any.whl
Upload date: Mar 3, 2026
Size: 12.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for piiscrub-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ab2b5e6b352265aebf2327baeca7198849ecb6b4d020d060b1b8302ec76fda4f`
MD5	`6a77bcc26e6203004ecd7e0c6f31e519`
BLAKE2b-256	`cccbf8c528e9cac98b296883cbcfcf3eb4b13622d05a03ebe1daab31313695ea`

See more details on using hashes here.

piiscrub 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PiiScrub

Features

Supported Entities

Installation

CLI Usage

Extract PII

Scrub PII

Advanced Features

1. Configuration File (`piiscrub.json`)

2. Parallel Processing

3. Pre-Bundled Compliance Profiles

4. Compliance Auditing & Metric Reports

Stream Processing

Library Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

piiscrub 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PiiScrub

Features

Supported Entities

Installation

CLI Usage

Extract PII

Scrub PII

Advanced Features

1. Configuration File (piiscrub.json)

2. Parallel Processing

3. Pre-Bundled Compliance Profiles

4. Compliance Auditing & Metric Reports

Stream Processing

Library Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. Configuration File (`piiscrub.json`)