Skip to main content

Local PII firewall for AI CLI tools. Tokenize before it leaves your machine.

Project description

pii-guard

Local PII firewall for AI coding tools. Tokenize before it leaves your machine.

When you ask any AI tool — Claude Code, Cursor, Aider, Codex, Continue.dev — to analyse data, raw PII travels to their servers. pii-guard intercepts it first: replaces real values with consistent tokens ([AADHAAR_1], [EMAIL_2]), lets the AI work on the safe version, and reverses it when you're done. The mapping key never leaves your machine.


Works with every AI tool

Tool How
Claude Code PostToolUse hooks — automatic, zero-touch per file read
Cursor Set OPENAI_BASE_URL=http://localhost:8111/openai/v1
Aider Set OPENAI_API_BASE=http://localhost:8111/openai/v1
OpenAI Codex CLI Set OPENAI_BASE_URL=http://localhost:8111/openai/v1
Continue.dev Set apiBase in ~/.continue/config.json
Any OpenAI-SDK app Set OPENAI_BASE_URL — no code changes
Any Anthropic-SDK app Set ANTHROPIC_BASE_URL — no code changes
Any tool, any LLM Manually: pii-guard tokenize file.csv before sharing

Integration guides: integrations/


How it works — three modes

┌─────────────────────────────────────────────────────────────────────┐
│  Mode 1 · CLI  (any tool, manual)                                   │
│  pii-guard tokenize file.csv → safe file → AI analyses → detokenize │
├─────────────────────────────────────────────────────────────────────┤
│  Mode 2 · Claude Code hooks  (automatic, zero-touch)                │
│  pii-guard install-hooks → hooks fire on every Read + Bash output   │
│  Claude never sees raw PII in the session                           │
├─────────────────────────────────────────────────────────────────────┤
│  Mode 3 · API proxy  (any OpenAI/Anthropic-compatible tool)         │
│  pii-guard proxy → sits between your tool and the upstream API      │
│  One env var. Zero code changes. Works with Cursor, Aider, Codex,  │
│  Continue.dev, LangChain, and any SDK that respects base URL vars.  │
└─────────────────────────────────────────────────────────────────────┘

All three modes use the same tokenization engine and session format. john@acme.com is always [EMAIL_1] within a session, regardless of which mode captured it.


Install

pip install piiwall            # core (plain text, CSV)
pip install 'piiwall[rich]'    # + PDF, Word (.docx), Excel (.xlsx)

Mode 1 — CLI (tool-agnostic, manual)

Works with any AI tool. Tokenize a file first, share the safe version, detokenize results when done.

# Scan — see what PII exists (exits 1 if found)
pii-guard scan customers.csv --show-values

# Tokenize — create customers.safe.csv with tokens
pii-guard tokenize customers.csv -p dpdp

# Analyse customers.safe.csv with whatever AI tool you use
# Then restore real values
pii-guard detokenize result.txt --session ~/.pii-guard/sessions/pii-guard-<timestamp>.json

Supported file formats

Format Scan Tokenize Notes
Plain text, CSV, JSON Core, no extra deps
PDF (.pdf) Output as .safe.txt; requires pii-guard[rich]
Word (.docx) Format preserved, paragraphs and tables tokenized in-place; requires pii-guard[rich]
Excel (.xlsx) Format preserved, all string cells tokenized in-place; requires pii-guard[rich]
pip install 'piiwall[rich]'                 # install format support
pii-guard scan report.docx -p dpdp            # scan a Word doc
pii-guard tokenize customer_data.xlsx -p dpdp # tokenize an Excel sheet → customer_data.safe.xlsx
pii-guard scan employees.pdf -p hipaa         # scan a PDF

Session stats

pii-guard stats ~/.pii-guard/sessions/pii-guard-<timestamp>.json
Session:  pii-guard-20240115-103000.json
Total tokens: 12

  Type                    Count
  ---------------------- ------
  EMAIL                       4
  AADHAAR                     3
  MOBILE_IN                   3
  PAN                         2

Export session as CSV (for Excel / VLOOKUP)

pii-guard export-session ~/.pii-guard/sessions/pii-guard-<timestamp>.json

Output (pii-guard-<timestamp>_mapping.csv):

token,pii_type,original_value
[EMAIL_1],EMAIL,john@acme.com
[EMAIL_2],EMAIL,jane@acme.com
[AADHAAR_1],AADHAAR,2345 6789 0123
[PAN_1],PAN,ABCDE1234F

Presets

Preset Covers
dpdp 🇮🇳 Aadhaar, PAN, Voter ID, Passport, IFSC, GSTIN, UPI VPA, mobile, PIN code
gdpr 🇪🇺 IBAN, BIC/SWIFT, VAT, EU phone, MAC address, GPS coordinates
hipaa 🇺🇸 SSN, NPI, DEA, MRN, health plan IDs, US phone, US dates
pci 💳 Visa, Mastercard, Amex, Discover, Rupay, CVV, card expiry
pii-guard tokenize file.csv -p dpdp -p pci   # combine presets
pii-guard config show-patterns dpdp           # inspect patterns in a preset

Mode 2 — Claude Code hooks (automatic, zero-touch)

One command installs hooks that fire on every file Claude reads and every bash command output:

pip install piiwall
pii-guard install-hooks --global

This writes two PostToolUse hooks into ~/.claude/settings.json. Claude never sees raw PII in any session.

Add the behavioral layer (tells Claude to proactively offer tokenization):

cp integrations/CLAUDE.md ~/.claude/CLAUDE.md

What the hooks do

Claude calls Read("customers.csv")
        ↓
post_read.py intercepts the tool response
        ↓
Scans for PII → finds 20 instances
        ↓
Replaces with tokens, saves session key → ~/.pii-guard/sessions/claude-<session-id>.json
        ↓
Claude sees [EMAIL_1], [AADHAAR_1] — never the real values

All Read and Bash calls in one Claude Code session share one session file. One detokenize pass restores everything.

Restore after Claude session

pii-guard detokenize result.txt --session ~/.pii-guard/sessions/claude-<session-id>.json
# or export as CSV
pii-guard export-session ~/.pii-guard/sessions/claude-<session-id>.json

Control via environment variables

export PII_GUARD_PRESETS=dpdp,pci   # comma-separated presets (default: dpdp)
export PII_GUARD_ENABLED=0          # disable hooks without removing them
export PII_GUARD_MAX_CHARS=200000   # cap bash output scan size (default: 200000)

Mode 3 — API proxy (Cursor, Aider, Codex, Continue.dev, any SDK)

The proxy sits between your tool and the upstream API. It tokenizes every outgoing prompt and detokenizes every response. Your tool and your code are unchanged.

pii-guard proxy --port 8111 --preset dpdp

Set the base URL in your tool

# Anthropic SDK / Claude Code / any Anthropic-compatible tool
export ANTHROPIC_BASE_URL=http://localhost:8111

# OpenAI SDK / Cursor / Aider / Codex CLI / Continue.dev / LangChain
export OPENAI_BASE_URL=http://localhost:8111/openai/v1

Your existing code works unchanged:

import anthropic
client = anthropic.Anthropic()   # routes through pii-guard automatically

response = client.messages.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Analyse rajesh@gmail.com, Aadhaar 2345 6789 0123"}]
)
# Anthropic receives: "Analyse [EMAIL_1], Aadhaar [AADHAAR_1]"
# Your app receives:  "Analyse rajesh@gmail.com, Aadhaar 2345 6789 0123"

What the proxy does

Your tool sends prompt with real PII
        ↓
pii-guard proxy on localhost:8111
        ↓
Tokenizes PII → [EMAIL_1], [AADHAAR_1], [PAN_1]
        ↓
Forwards to api.anthropic.com or api.openai.com
        ↓
Gets response with tokens
        ↓
Detokenizes → real values restored
        ↓
Your tool receives response with real values

Anthropic and OpenAI never see the real data.

Proxy options

pii-guard proxy --port 8111                        # default port
pii-guard proxy --preset dpdp,pci                  # multiple presets
pii-guard proxy --pattern "CUST_ID:CUST-\d{6}"    # custom pattern
pii-guard proxy --session session.json             # resume existing session
pii-guard proxy --quiet                            # suppress per-request logs

Restore after proxy session

pii-guard export-session ~/.pii-guard/sessions/<session-id>.json
pii-guard detokenize output.txt --session ~/.pii-guard/sessions/<session-id>.json

Per-tool guides


Custom patterns

Persistent — ~/.pii-guard/config.yaml

Loaded automatically by the CLI, hooks, and proxy:

custom_patterns:
  CUSTOMER_ID: 'CUST-\d{6}'
  EMPLOYEE_ID: 'EMP\d{5}'
  INTERNAL_REF: 'INT-[A-Z]{3}-\d{4}'
mkdir -p ~/.pii-guard
cp config/pii-guard.example.yaml ~/.pii-guard/config.yaml

Inline — --pattern / -P flag

pii-guard scan file.csv -P "CUSTOMER_ID:CUST-\d{6}" --show-values
pii-guard tokenize file.csv -P "CUSTOMER_ID:CUST-\d{6}" -P "EMPLOYEE_ID:EMP\d{5}"
pii-guard tokenize data.csv -p dpdp -p pci -P "ACCOUNT_REF:ACC-\d{8}"

CUST-123456 becomes [CUSTOMER_ID_1], fully reversible.


Use from Python

from pii_guard.presets import load_presets
from pii_guard.scanner.engine import Scanner
from pii_guard.scanner.patterns import BASE_PATTERNS
from pii_guard.tokenizer.engine import tokenize
from pii_guard.tokenizer.session import Session

patterns = {**BASE_PATTERNS, **load_presets(["dpdp"])}
scanner = Scanner(patterns)
session = Session.new()

safe_text, matches = tokenize(raw_text, scanner, session)
session.save()

print(f"Tokenized {len(matches)} PII instances.")
print(f"Session key: {session.path}")

How tokenization works

Same value → same token within a session. Different values → different tokens. Fully reversible.

john@acme.com   →  [EMAIL_1]     (always, within this session)
jane@acme.com   →  [EMAIL_2]
john@acme.com   →  [EMAIL_1]     ← same input, same token
2345 6789 0123  →  [AADHAAR_1]

Session key stays in ~/.pii-guard/sessions/. Never sent anywhere.


Limitations

  • Regex-based detection — structured formats (Aadhaar, PAN, IBAN, SSN) have near-zero false negatives. Free-form PII (names, addresses in prose) is not detected; combine with a dedicated NER model if needed.
  • DOCX formatting in PII-containing paragraphs — when a PII value spans multiple runs in a Word document (e.g., bold text adjacent to the value), the paragraph is collapsed to a single run after tokenization. Paragraphs with no PII are untouched.
  • Same-session tokens only — tokens from one session cannot be detokenized with a different session key. Keep the session file for as long as you need to reverse.
  • Streaming responses — the proxy detokenizes SSE streams line-by-line. A token that spans two SSE chunks will not be restored; rare but possible with large token strings.
  • Proxy is localhost-only — binds to 127.0.0.1. Not designed to be network-exposed. Treat the session key file as a secret.
  • No key management — session files are plain JSON on disk. Encrypt or delete when no longer needed.

CI/CD integration

GitHub Actions

Copy integrations/github-actions/pii-scan.yml into .github/workflows/ to fail PRs that introduce raw PII in CSV, JSON, TXT, or log files:

cp integrations/github-actions/pii-scan.yml .github/workflows/pii-scan.yml

pre-commit hook

Add to your .pre-commit-config.yaml:

repos:
  - repo: https://github.com/sunnypuli/pii-guard
    rev: main
    hooks:
      - id: pii-guard-scan
        args: [--preset, dpdp]

Then install hooks with pre-commit install. Commits that include files with detectable PII will be blocked.

Audit log

Every scan and tokenize run appends a line to ~/.pii-guard/audit.log:

2024-01-15T10:30:00  tokenize     customers.csv                   total=12  AADHAAR:3 EMAIL:4 PAN:2

Docker (proxy)

docker build -t pii-guard .
docker run -p 8111:8111 pii-guard --preset dpdp,pci

Then set ANTHROPIC_BASE_URL=http://localhost:8111 or OPENAI_BASE_URL=http://localhost:8111/openai/v1.


Contributing

Contributions welcome — especially:

  • New preset patterns (country-specific IDs, sector-specific formats)
  • False positive reports with reproducible examples
  • IDE and tool integrations
git clone https://github.com/sunnypuli/pii-guard
cd pii-guard
python -m venv venv && source venv/bin/activate
pip install -e ".[dev]"
pytest

Pattern PRs should include a test in tests/test_presets.py covering at least one valid and one invalid example.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piiwall-0.1.0.tar.gz (31.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

piiwall-0.1.0-py3-none-any.whl (27.7 kB view details)

Uploaded Python 3

File details

Details for the file piiwall-0.1.0.tar.gz.

File metadata

  • Download URL: piiwall-0.1.0.tar.gz
  • Upload date:
  • Size: 31.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for piiwall-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6bbe9a10798db7f4c71dcec4865c01146fe33ffbaabc99cf9e2d427aeb84e49d
MD5 92fa21a88163f30b356bfc23e92bb9f9
BLAKE2b-256 9b2db5d6bb5ebec657da812d95e0fad3ba17cd0817cecd0a52ff4849be0769a1

See more details on using hashes here.

File details

Details for the file piiwall-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: piiwall-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 27.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for piiwall-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 696ab94799b3f9bb75ef2a4bad5accdfddb0e7170cf55a24e6c5b90835389395
MD5 a2b5caf8cb499eb90ee27b91e9ca53c7
BLAKE2b-256 de45182b358f8b48b5369a4dde19546d2cd0b35ac52f3ecf5fc99aa45248e23b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page