A blazing-fast, lightweight Python library and CLI tool designed to scrub Personally Identifiable Information (PII)
Project description
PiiScrub
A blazing-fast, lightweight Python library and CLI tool designed to scrub Personally Identifiable Information (PII) from datasets for LLM training and RAG pipelines.
Features
- Maximum Speed & Zero Dependencies: Relies exclusively on Python's standard library. No
pandas,spaCy, or other heavy external packages. - Deterministic Validation: Raw regex matches for high-risk entities (like credit cards and IPs) pass algorithmic checksums (e.g., Luhn algorithm, octet range checks) before being flagged to eliminate false positives.
- Pre-compiled Regex: All regular expressions are compiled at the module level using
re.compile()for O(1) setup time during execution. - Large Dataset Streaming: Features
scrub_streamandextract_streamto process massive datasets chunk-by-chunk without hitting Out-Of-Memory limit. - Multi-Core Parallel Processing: Leverage multiple CPU cores to scrub large files at blazing speed using
--parallel. - Pre-Bundled Compliance Profiles: Quickly target specific standards like
hipaa,pci-dss, orgdprusing the--profileflag. - Compliance Auditing & Metric Reports: Generate detailed JSON reports with statistics on redacted entities and execution time using
--report. - High-Value Secret Detection: Added parsing to locate critical assets like AWS Access Keys, GitHub Tokens, and RSA Private Keys out of the box.
- Deterministic Hashing: Replace PII with deterministic SHA-256 hashes instead of generic tags to track uniqueness without leaking data.
- Synthetic Data Generation: Replace real PII with realistic "fake" data using the
fakerlibrary (beta). - Configuration File Support: Manage complex settings via
piiscrub.jsoninstead of long CLI commands. - Custom Pattern Injection: Dynamically inject your own regex patterns and validators directly into the engine without modifying the core library.
- Allowlist Support: Explicitly bypass scrubbing for public figures, system emails, or company identifiers to prevent false positives.
Supported Entities
- Global:
EMAILPHONE_GENERIC(international)CREDIT_CARD(13-16 digits with Luhn algorithm validation)IPV4(validation ensuring all octets <= 255)IPV6
- US Specific:
US_SSN
- India Specific:
IN_AADHAAR(12 digits, cannot start with 0 or 1)IN_PAN(5 uppercase letters, 4 digits, 1 uppercase letter)
- Secrets & Credentials (V2):
AWS_ACCESS_KEYGITHUB_TOKENRSA_PRIVATE_KEY
Installation
pip install .
CLI Usage
Extract PII
piiscrub extract --text "My email is test@example.com"
piiscrub extract --file text.txt
Scrub PII
piiscrub scrub --text "My email is test@example.com"
piiscrub scrub --file text.txt
# Use deterministic hashing instead of standard tags
piiscrub scrub --text "My email is test@example.com" --style hash
# Output: My email is <EMAIL_a1517717>
# Bypass scrubbing for specific public strings
piiscrub scrub --text "Contact support@example.com or user@example.com" --allowlist support@example.com
# Output: Contact support@example.com or <EMAIL>
# Inject Custom Pattern from the CLI
piiscrub scrub --text "This is employee EMP-99881 and email a@b.com" --custom-pattern EMP_ID "\bEMP-\d{5}\b" --entities EMP_ID EMAIL
# Output: This is employee <EMP_ID> and email <EMAIL>
# Synthetic Data Generation
piiscrub scrub --text "Contact me at omkar@example.com" --style synthetic
# Output: Contact me at victoria12@gmail.com
Advanced Features
1. Configuration File (piiscrub.json)
You can define a piiscrub.json file in your working directory to simplify your commands:
{
"style": "hash",
"entities": ["EMAIL", "PHONE_GENERIC"],
"allowlist": ["support@mycompany.com"],
"custom_patterns": {
"ORDER_ID": "ORD-\\d{5}"
}
}
Now just run:
piiscrub scrub --file data.txt
2. Parallel Processing
For large files, use multi-core processing:
piiscrub scrub --file large_dataset.txt --parallel --output cleaned.txt
[!TIP] Parallel mode automatically handles file I/O efficiently and defaults to using all available CPU cores.
3. Pre-Bundled Compliance Profiles
Quickly target common privacy standards without remembering every entity name:
# Scrub only PCI-DSS related data (Credit Cards)
piiscrub scrub --file transactions.txt --profile pci-dss
# Scrub HIPAA related data (SSN, Phone, Email, IP)
piiscrub scrub --file medical_records.txt --profile hipaa
Available profiles: pci-dss, hipaa, gdpr, strict.
4. Compliance Auditing & Metric Reports
Data compliance teams can generate a statistical summary of the scrubbing process as proof of redaction:
piiscrub scrub --file sensitive_data.txt --report audit.json
Sample audit.json output:
{
"command": "scrub",
"total_lines_processed": 5000,
"execution_time_seconds": 1.25,
"entities_redacted": {
"EMAIL": 142,
"CREDIT_CARD": 12,
"PHONE_GENERIC": 5
},
"style": "tag"
}
Stream Processing
For extremely large files (e.g. LLM corpus data logs):
piiscrub scrub --file huge_dataset.jsonl --stream > scrubbed.jsonl
piiscrub extract --file huge_dataset.jsonl --stream > entities.json
Library Usage
from piiscrub.core import PiiScrub
import re
# Initialize with custom generic entities or pattern injection!
custom_patterns = {
"INTERNAL_ID": re.compile(r"\bEMP-\d{5}\b")
}
cs = PiiScrub(
entities=["EMAIL", "CREDIT_CARD", "INTERNAL_ID"],
custom_patterns=custom_patterns,
allowlist=["public@example.com"]
)
code = "Contact test@example.com for info on EMP-12345."
# Extract entities
extracted = cs.extract_entities(code)
print(extracted)
# {'EMAIL': ['test@example.com'], 'INTERNAL_ID': ['EMP-12345']}
# Scrub entities using hashing
scrubbed_code = cs.scrub_text(code, replacement_style="hash")
print(scrubbed_code)
# Contact <EMAIL_a1517717> for info on <INTERNAL_ID_b5fb38c3>.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file piiscrub-0.1.1.tar.gz.
File metadata
- Download URL: piiscrub-0.1.1.tar.gz
- Upload date:
- Size: 17.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88aea28ded218be1eb991e7fa15546025a0c1d9cc6889b3153bdb3c5dc3b7044
|
|
| MD5 |
b6548601f55dc9dfea6e4e6ed9ab59b8
|
|
| BLAKE2b-256 |
82b71a66403b997a0b3e07649e8912dfb1cebe5ee94334c7449142ce704ce6d2
|
File details
Details for the file piiscrub-0.1.1-py3-none-any.whl.
File metadata
- Download URL: piiscrub-0.1.1-py3-none-any.whl
- Upload date:
- Size: 13.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86909dc24f20c3fa2693df6a1a61c62c3627715371bdbaa6cfb8a9addaba2652
|
|
| MD5 |
a2ba2ee08a6bc87c88f89ed4caffa18f
|
|
| BLAKE2b-256 |
ab15a62da7737833fc9a593a16ca377f79d686e5685b48d2e0620041c70ea43a
|