Skip to main content

Context-aware PII detection for LLM pipelines and data workflows

Project description

pii-shield

Context-aware PII detection for LLM pipelines and data workflows

License: MIT

A CLI tool that detects personally identifiable information (PII) in text, code, logs, and data files using context-aware pattern matching and statistical scoring. Unlike simple regex scanners, pii-shield analyzes surrounding context to reduce false positives, supports 18 PII types, and provides masked/redacted output formats.

Why pii-shield?

  • Context-aware detection: Analyzes surrounding text to reduce false positives compared to regex-only tools
  • Zero external dependencies: Runs entirely locally with no API calls - fast and privacy-preserving
  • Built for developers: Integrates into CI/CD pipelines, pre-commit hooks, and LLM preprocessing
  • Comprehensive coverage: Detects 18 PII pattern types including SSNs, credit cards, API keys, emails, phone numbers, and more

Installation

pip install pii-shield

Quick Start

Scan a file for PII:

pii-shield scan input.txt

Output:

Scanning: input.txt

[Line 12] EMAIL (confidence: 95)
  john.doe@example.com
  Context: "Contact me at john.doe@example.com for"

[Line 23] SSN (confidence: 87)
  123-45-6789
  Context: "SSN: 123-45-6789 was"

Summary: 2 PII instances found in 1 file

Mask PII and save to a new file:

pii-shield scan --mask partial --output clean.txt input.txt

Process stdin for pipeline integration:

echo 'Email: alice@company.com, SSN: 123-45-6789' | pii-shield scan --stdin --mask full

Output:

Email: [EMAIL_REDACTED], SSN: [SSN_REDACTED]

Scan a directory with JSON output for CI/CD:

pii-shield scan --format json --threshold 80 ./logs/

List all supported PII patterns:

pii-shield patterns --list

Features

  • 18 PII pattern types: SSNs, credit cards, emails, phone numbers, passports, API keys (AWS, OpenAI, Stripe, GitHub), IBANs, medical IDs, and more
  • Multiple masking strategies: Full redaction, partial masking (***-**-1234), or hash replacement
  • Fast processing: Uses compiled regex patterns for efficient scanning
  • Multiple output formats: Human-readable text, JSON, or masked output files
  • Configurable thresholds: Balance precision/recall with adjustable confidence scores (0-100)
  • CI/CD integration: Returns non-zero exit codes when PII detected, enabling automated pipeline failures
  • Custom patterns: Load organization-specific patterns from YAML config files

Usage Examples

Scan with custom threshold

pii-shield scan --threshold 90 sensitive_data.txt

Batch processing

pii-shield scan --format json ./logs/ > pii_report.json

Pre-commit hook integration

Add to .pre-commit-config.yaml:

repos:
  - repo: local
    hooks:
      - id: pii-shield
        name: PII Detection
        entry: pii-shield scan --format json --threshold 70
        language: system
        pass_filenames: true

GitHub Actions

- name: Scan for PII
  run: |
    pip install pii-shield
    pii-shield scan --format json --threshold 80 ./src/

Supported PII Types

  • Identification: SSN, Passport, Driver's License, National ID
  • Financial: Credit Cards, IBAN, Routing Numbers, SWIFT/BIC
  • Contact: Email, Phone Numbers, IP Addresses
  • Credentials: API Keys, Password Hashes, JWTs
  • Medical: Medical Record Numbers, NPI, DEA Numbers
  • Personal: DOB, Addresses, ZIP Codes

Configuration

Create a pii-shield.yaml file:

threshold: 70
masking_strategy: partial
enabled_patterns:
  - EMAIL
  - SSN
  - CREDIT_CARD
  - API_KEY
custom_patterns:
  - name: EMPLOYEE_ID
    pattern: 'EMP-\d{6}'
    confidence: 85

How It Works

pii-shield uses a multi-stage detection pipeline:

  1. Pattern tokenizer: Splits input into semantic chunks
  2. Regex matcher: Identifies 18 PII pattern types
  3. Context analyzer: Examines surrounding text windows before/after matches
  4. Validators: Applies Luhn algorithm, checksums, and format validation
  5. Statistical scorer: Combines pattern + context + validation confidence
  6. Threshold filter: Configurable cutoff to balance precision/recall
  7. Output formatter: Applies masking strategies while preserving structure

Comparison to Alternatives

Tool Setup False Positives Privacy
pii-shield pip install Low (context-aware) 100% local
Presidio Complex (models + APIs) Medium Requires external calls
scrubadub pip install High (regex only) 100% local
Enterprise DLP Hours of config Low SaaS/cloud-based

License

MIT License - Copyright (c) 2026 Intellirim

Contributing

Issues and pull requests welcome! This is an open-source project maintained by Intellirim.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pii_shield-1.1.0.tar.gz (22.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pii_shield-1.1.0-py3-none-any.whl (17.8 kB view details)

Uploaded Python 3

File details

Details for the file pii_shield-1.1.0.tar.gz.

File metadata

  • Download URL: pii_shield-1.1.0.tar.gz
  • Upload date:
  • Size: 22.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for pii_shield-1.1.0.tar.gz
Algorithm Hash digest
SHA256 6114bdf60e4f074ad7a0849f028ba8568f5cc1770c20cf542e2b3dbe231283b1
MD5 504b283d41a96b53f546fd7553e0ef41
BLAKE2b-256 20cfbc1bcc5e590aea89d921e6e5c8ca22b2d123717fd1afa45b2a8c2ec4f41a

See more details on using hashes here.

File details

Details for the file pii_shield-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: pii_shield-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for pii_shield-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e9a59dec40e423ef548cf802399ac5155c421d5d663766d569d0ea15e381fe9e
MD5 35ef385aaee058cfe0ef24b92a13f4e0
BLAKE2b-256 3237227c64267f9c9920a14b156cda7a02d5e8d860dc23c6305aebadac1fc3af

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page