Context-aware PII detection for LLM pipelines and data workflows
Project description
pii-shield
Context-aware PII detection for LLM pipelines and data workflows
A CLI tool that detects personally identifiable information (PII) in text, code, logs, and data files using context-aware pattern matching and statistical scoring. Unlike simple regex scanners, pii-shield analyzes surrounding context to reduce false positives, supports 18 PII types, and provides masked/redacted output formats.
Why pii-shield?
- Context-aware detection: Analyzes surrounding text to reduce false positives compared to regex-only tools
- Zero external dependencies: Runs entirely locally with no API calls - fast and privacy-preserving
- Built for developers: Integrates into CI/CD pipelines, pre-commit hooks, and LLM preprocessing
- Comprehensive coverage: Detects 18 PII pattern types including SSNs, credit cards, API keys, emails, phone numbers, and more
Installation
pip install pii-shield
Quick Start
Scan a file for PII:
pii-shield scan input.txt
Output:
Scanning: input.txt
[Line 12] EMAIL (confidence: 95)
john.doe@example.com
Context: "Contact me at john.doe@example.com for"
[Line 23] SSN (confidence: 87)
123-45-6789
Context: "SSN: 123-45-6789 was"
Summary: 2 PII instances found in 1 file
Mask PII and save to a new file:
pii-shield scan --mask partial --output clean.txt input.txt
Process stdin for pipeline integration:
echo 'Email: alice@company.com, SSN: 123-45-6789' | pii-shield scan --stdin --mask full
Output:
Email: [EMAIL_REDACTED], SSN: [SSN_REDACTED]
Scan a directory with JSON output for CI/CD:
pii-shield scan --format json --threshold 80 ./logs/
List all supported PII patterns:
pii-shield patterns --list
Features
- 18 PII pattern types: SSNs, credit cards, emails, phone numbers, passports, API keys (AWS, OpenAI, Stripe, GitHub), IBANs, medical IDs, and more
- Multiple masking strategies: Full redaction, partial masking (
***-**-1234), or hash replacement - Fast processing: Uses compiled regex patterns for efficient scanning
- Multiple output formats: Human-readable text, JSON, or masked output files
- Configurable thresholds: Balance precision/recall with adjustable confidence scores (0-100)
- CI/CD integration: Returns non-zero exit codes when PII detected, enabling automated pipeline failures
- Custom patterns: Load organization-specific patterns from YAML config files
Usage Examples
Scan with custom threshold
pii-shield scan --threshold 90 sensitive_data.txt
Batch processing
pii-shield scan --format json ./logs/ > pii_report.json
Pre-commit hook integration
Add to .pre-commit-config.yaml:
repos:
- repo: local
hooks:
- id: pii-shield
name: PII Detection
entry: pii-shield scan --format json --threshold 70
language: system
pass_filenames: true
GitHub Actions
- name: Scan for PII
run: |
pip install pii-shield
pii-shield scan --format json --threshold 80 ./src/
Supported PII Types
- Identification: SSN, Passport, Driver's License, National ID
- Financial: Credit Cards, IBAN, Routing Numbers, SWIFT/BIC
- Contact: Email, Phone Numbers, IP Addresses
- Credentials: API Keys, Password Hashes, JWTs
- Medical: Medical Record Numbers, NPI, DEA Numbers
- Personal: DOB, Addresses, ZIP Codes
Configuration
Create a pii-shield.yaml file:
threshold: 70
masking_strategy: partial
enabled_patterns:
- EMAIL
- SSN
- CREDIT_CARD
- API_KEY
custom_patterns:
- name: EMPLOYEE_ID
pattern: 'EMP-\d{6}'
confidence: 85
How It Works
pii-shield uses a multi-stage detection pipeline:
- Pattern tokenizer: Splits input into semantic chunks
- Regex matcher: Identifies 18 PII pattern types
- Context analyzer: Examines surrounding text windows before/after matches
- Validators: Applies Luhn algorithm, checksums, and format validation
- Statistical scorer: Combines pattern + context + validation confidence
- Threshold filter: Configurable cutoff to balance precision/recall
- Output formatter: Applies masking strategies while preserving structure
Comparison to Alternatives
| Tool | Setup | False Positives | Privacy |
|---|---|---|---|
| pii-shield | pip install |
Low (context-aware) | 100% local |
| Presidio | Complex (models + APIs) | Medium | Requires external calls |
| scrubadub | pip install |
High (regex only) | 100% local |
| Enterprise DLP | Hours of config | Low | SaaS/cloud-based |
License
MIT License - Copyright (c) 2026 Intellirim
Contributing
Issues and pull requests welcome! This is an open-source project maintained by Intellirim.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pii_shield-1.1.0.tar.gz.
File metadata
- Download URL: pii_shield-1.1.0.tar.gz
- Upload date:
- Size: 22.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6114bdf60e4f074ad7a0849f028ba8568f5cc1770c20cf542e2b3dbe231283b1
|
|
| MD5 |
504b283d41a96b53f546fd7553e0ef41
|
|
| BLAKE2b-256 |
20cfbc1bcc5e590aea89d921e6e5c8ca22b2d123717fd1afa45b2a8c2ec4f41a
|
File details
Details for the file pii_shield-1.1.0-py3-none-any.whl.
File metadata
- Download URL: pii_shield-1.1.0-py3-none-any.whl
- Upload date:
- Size: 17.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9a59dec40e423ef548cf802399ac5155c421d5d663766d569d0ea15e381fe9e
|
|
| MD5 |
35ef385aaee058cfe0ef24b92a13f4e0
|
|
| BLAKE2b-256 |
3237227c64267f9c9920a14b156cda7a02d5e8d860dc23c6305aebadac1fc3af
|