Skip to main content

A blazing-fast, lightweight Python library and CLI tool designed to scrub Personally Identifiable Information (PII)

Project description

PiiScrub

A blazing-fast, lightweight Python library and CLI tool designed to scrub Personally Identifiable Information (PII) from datasets for LLM training and RAG pipelines.

Features

  • Maximum Speed & Zero Dependencies: Relies exclusively on Python's standard library. No pandas, spaCy, or other heavy external packages.
  • Deterministic Validation: Raw regex matches for high-risk entities (like credit cards and IPs) pass algorithmic checksums (e.g., Luhn algorithm, octet range checks) before being flagged to eliminate false positives.
  • Pre-compiled Regex: All regular expressions are compiled at the module level using re.compile() for O(1) setup time during execution.
  • Large Dataset Streaming: Features scrub_stream and extract_stream to process massive datasets chunk-by-chunk without hitting Out-Of-Memory limit.
  • Multi-Core Parallel Processing: Leverage multiple CPU cores to scrub large files at blazing speed using --parallel.
  • Pre-Bundled Compliance Profiles: Quickly target specific standards like hipaa, pci-dss, or gdpr using the --profile flag.
  • Compliance Auditing & Metric Reports: Generate detailed JSON reports with statistics on redacted entities and execution time using --report.
  • High-Value Secret Detection: Added parsing to locate critical assets like AWS Access Keys, GitHub Tokens, and RSA Private Keys out of the box.
  • Deterministic Hashing: Replace PII with deterministic SHA-256 hashes instead of generic tags to track uniqueness without leaking data.
  • Synthetic Data Generation: Replace real PII with realistic "fake" data using the faker library (beta).
  • Configuration File Support: Manage complex settings via piiscrub.json instead of long CLI commands.
  • Custom Pattern Injection: Dynamically inject your own regex patterns and validators directly into the engine without modifying the core library.
  • Allowlist Support: Explicitly bypass scrubbing for public figures, system emails, or company identifiers to prevent false positives.

Supported Entities

  • Global:
    • EMAIL
    • PHONE_GENERIC (international)
    • CREDIT_CARD (13-16 digits with Luhn algorithm validation)
    • IPV4 (validation ensuring all octets <= 255)
    • IPV6
  • US Specific:
    • US_SSN
  • India Specific:
    • IN_AADHAAR (12 digits, cannot start with 0 or 1)
    • IN_PAN (5 uppercase letters, 4 digits, 1 uppercase letter)
  • Secrets & Credentials (V2):
    • AWS_ACCESS_KEY
    • GITHUB_TOKEN
    • RSA_PRIVATE_KEY

Installation

pip install .

CLI Usage

Extract PII

piiscrub extract --text "My email is test@example.com"
piiscrub extract --file text.txt

Scrub PII

piiscrub scrub --text "My email is test@example.com"
piiscrub scrub --file text.txt

# Use deterministic hashing instead of standard tags
piiscrub scrub --text "My email is test@example.com" --style hash
# Output: My email is <EMAIL_a1517717>

# Bypass scrubbing for specific public strings
piiscrub scrub --text "Contact support@example.com or user@example.com" --allowlist support@example.com
# Output: Contact support@example.com or <EMAIL>

# Inject Custom Pattern from the CLI
piiscrub scrub --text "This is employee EMP-99881 and email a@b.com" --custom-pattern EMP_ID "\bEMP-\d{5}\b" --entities EMP_ID EMAIL
# Output: This is employee <EMP_ID> and email <EMAIL>

# Synthetic Data Generation
piiscrub scrub --text "Contact me at omkar@example.com" --style synthetic
# Output: Contact me at victoria12@gmail.com

Advanced Features

1. Configuration File (piiscrub.json)

You can define a piiscrub.json file in your working directory to simplify your commands:

{
    "style": "hash",
    "entities": ["EMAIL", "PHONE_GENERIC"],
    "allowlist": ["support@mycompany.com"],
    "custom_patterns": {
        "ORDER_ID": "ORD-\\d{5}"
    }
}

Now just run:

piiscrub scrub --file data.txt

2. Parallel Processing

For large files, use multi-core processing:

piiscrub scrub --file large_dataset.txt --parallel --output cleaned.txt

[!TIP] Parallel mode automatically handles file I/O efficiently and defaults to using all available CPU cores.

3. Pre-Bundled Compliance Profiles

Quickly target common privacy standards without remembering every entity name:

# Scrub only PCI-DSS related data (Credit Cards)
piiscrub scrub --file transactions.txt --profile pci-dss

# Scrub HIPAA related data (SSN, Phone, Email, IP)
piiscrub scrub --file medical_records.txt --profile hipaa

Available profiles: pci-dss, hipaa, gdpr, strict.

4. Compliance Auditing & Metric Reports

Data compliance teams can generate a statistical summary of the scrubbing process as proof of redaction:

piiscrub scrub --file sensitive_data.txt --report audit.json

Sample audit.json output:

{
    "command": "scrub",
    "total_lines_processed": 5000,
    "execution_time_seconds": 1.25,
    "entities_redacted": {
        "EMAIL": 142,
        "CREDIT_CARD": 12,
        "PHONE_GENERIC": 5
    },
    "style": "tag"
}

Stream Processing

For extremely large files (e.g. LLM corpus data logs):

piiscrub scrub --file huge_dataset.jsonl --stream > scrubbed.jsonl
piiscrub extract --file huge_dataset.jsonl --stream > entities.json

Library Usage

from piiscrub.core import PiiScrub
import re

# Initialize with custom generic entities or pattern injection!
custom_patterns = {
    "INTERNAL_ID": re.compile(r"\bEMP-\d{5}\b")
}
cs = PiiScrub(
    entities=["EMAIL", "CREDIT_CARD", "INTERNAL_ID"], 
    custom_patterns=custom_patterns,
    allowlist=["public@example.com"]
)

code = "Contact test@example.com for info on EMP-12345."

# Extract entities
extracted = cs.extract_entities(code)
print(extracted)
# {'EMAIL': ['test@example.com'], 'INTERNAL_ID': ['EMP-12345']}

# Scrub entities using hashing
scrubbed_code = cs.scrub_text(code, replacement_style="hash")
print(scrubbed_code)
# Contact <EMAIL_a1517717> for info on <INTERNAL_ID_b5fb38c3>.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piiscrub-0.1.1.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

piiscrub-0.1.1-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file piiscrub-0.1.1.tar.gz.

File metadata

  • Download URL: piiscrub-0.1.1.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for piiscrub-0.1.1.tar.gz
Algorithm Hash digest
SHA256 88aea28ded218be1eb991e7fa15546025a0c1d9cc6889b3153bdb3c5dc3b7044
MD5 b6548601f55dc9dfea6e4e6ed9ab59b8
BLAKE2b-256 82b71a66403b997a0b3e07649e8912dfb1cebe5ee94334c7449142ce704ce6d2

See more details on using hashes here.

File details

Details for the file piiscrub-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: piiscrub-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 13.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for piiscrub-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 86909dc24f20c3fa2693df6a1a61c62c3627715371bdbaa6cfb8a9addaba2652
MD5 a2ba2ee08a6bc87c88f89ed4caffa18f
BLAKE2b-256 ab15a62da7737833fc9a593a16ca377f79d686e5685b48d2e0620041c70ea43a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page