Local PII firewall for AI CLI tools. Tokenize before it leaves your machine.
Project description
pii-guard
Local PII firewall for AI coding tools. Tokenize before it leaves your machine.
When you ask any AI tool — Claude Code, Cursor, Aider, Codex, Continue.dev — to analyse data, raw PII travels to their servers. pii-guard intercepts it first: replaces real values with consistent tokens ([AADHAAR_1], [EMAIL_2]), lets the AI work on the safe version, and reverses it when you're done. The mapping key never leaves your machine.
Works with every AI tool
| Tool | How |
|---|---|
| Claude Code | PostToolUse hooks — automatic, zero-touch per file read |
| Cursor | Set OPENAI_BASE_URL=http://localhost:8111/openai/v1 |
| Aider | Set OPENAI_API_BASE=http://localhost:8111/openai/v1 |
| OpenAI Codex CLI | Set OPENAI_BASE_URL=http://localhost:8111/openai/v1 |
| Continue.dev | Set apiBase in ~/.continue/config.json |
| Any OpenAI-SDK app | Set OPENAI_BASE_URL — no code changes |
| Any Anthropic-SDK app | Set ANTHROPIC_BASE_URL — no code changes |
| Any tool, any LLM | Manually: pii-guard tokenize file.csv before sharing |
Integration guides: integrations/
How it works — three modes
┌─────────────────────────────────────────────────────────────────────┐
│ Mode 1 · CLI (any tool, manual) │
│ pii-guard tokenize file.csv → safe file → AI analyses → detokenize │
├─────────────────────────────────────────────────────────────────────┤
│ Mode 2 · Claude Code hooks (automatic, zero-touch) │
│ pii-guard install-hooks → hooks fire on every Read + Bash output │
│ Claude never sees raw PII in the session │
├─────────────────────────────────────────────────────────────────────┤
│ Mode 3 · API proxy (any OpenAI/Anthropic-compatible tool) │
│ pii-guard proxy → sits between your tool and the upstream API │
│ One env var. Zero code changes. Works with Cursor, Aider, Codex, │
│ Continue.dev, LangChain, and any SDK that respects base URL vars. │
└─────────────────────────────────────────────────────────────────────┘
All three modes use the same tokenization engine and session format. john@acme.com is always [EMAIL_1] within a session, regardless of which mode captured it.
Install
pip install piiwall # core (plain text, CSV)
pip install 'piiwall[rich]' # + PDF, Word (.docx), Excel (.xlsx)
Mode 1 — CLI (tool-agnostic, manual)
Works with any AI tool. Tokenize a file first, share the safe version, detokenize results when done.
# Scan — see what PII exists (exits 1 if found)
pii-guard scan customers.csv --show-values
# Tokenize — create customers.safe.csv with tokens
pii-guard tokenize customers.csv -p dpdp
# Analyse customers.safe.csv with whatever AI tool you use
# Then restore real values
pii-guard detokenize result.txt --session ~/.pii-guard/sessions/pii-guard-<timestamp>.json
Supported file formats
| Format | Scan | Tokenize | Notes |
|---|---|---|---|
| Plain text, CSV, JSON | ✓ | ✓ | Core, no extra deps |
PDF (.pdf) |
✓ | ✓ | Output as .safe.txt; requires pii-guard[rich] |
Word (.docx) |
✓ | ✓ | Format preserved, paragraphs and tables tokenized in-place; requires pii-guard[rich] |
Excel (.xlsx) |
✓ | ✓ | Format preserved, all string cells tokenized in-place; requires pii-guard[rich] |
pip install 'piiwall[rich]' # install format support
pii-guard scan report.docx -p dpdp # scan a Word doc
pii-guard tokenize customer_data.xlsx -p dpdp # tokenize an Excel sheet → customer_data.safe.xlsx
pii-guard scan employees.pdf -p hipaa # scan a PDF
Session stats
pii-guard stats ~/.pii-guard/sessions/pii-guard-<timestamp>.json
Session: pii-guard-20240115-103000.json
Total tokens: 12
Type Count
---------------------- ------
EMAIL 4
AADHAAR 3
MOBILE_IN 3
PAN 2
Export session as CSV (for Excel / VLOOKUP)
pii-guard export-session ~/.pii-guard/sessions/pii-guard-<timestamp>.json
Output (pii-guard-<timestamp>_mapping.csv):
token,pii_type,original_value
[EMAIL_1],EMAIL,john@acme.com
[EMAIL_2],EMAIL,jane@acme.com
[AADHAAR_1],AADHAAR,2345 6789 0123
[PAN_1],PAN,ABCDE1234F
Presets
| Preset | Covers |
|---|---|
dpdp |
🇮🇳 Aadhaar, PAN, Voter ID, Passport, IFSC, GSTIN, UPI VPA, mobile, PIN code |
gdpr |
🇪🇺 IBAN, BIC/SWIFT, VAT, EU phone, MAC address, GPS coordinates |
hipaa |
🇺🇸 SSN, NPI, DEA, MRN, health plan IDs, US phone, US dates |
pci |
💳 Visa, Mastercard, Amex, Discover, Rupay, CVV, card expiry |
pii-guard tokenize file.csv -p dpdp -p pci # combine presets
pii-guard config show-patterns dpdp # inspect patterns in a preset
Mode 2 — Claude Code hooks (automatic, zero-touch)
One command installs hooks that fire on every file Claude reads and every bash command output:
pip install piiwall
pii-guard install-hooks --global
This writes two PostToolUse hooks into ~/.claude/settings.json. Claude never sees raw PII in any session.
Add the behavioral layer (tells Claude to proactively offer tokenization):
cp integrations/CLAUDE.md ~/.claude/CLAUDE.md
What the hooks do
Claude calls Read("customers.csv")
↓
post_read.py intercepts the tool response
↓
Scans for PII → finds 20 instances
↓
Replaces with tokens, saves session key → ~/.pii-guard/sessions/claude-<session-id>.json
↓
Claude sees [EMAIL_1], [AADHAAR_1] — never the real values
All Read and Bash calls in one Claude Code session share one session file. One detokenize pass restores everything.
Restore after Claude session
pii-guard detokenize result.txt --session ~/.pii-guard/sessions/claude-<session-id>.json
# or export as CSV
pii-guard export-session ~/.pii-guard/sessions/claude-<session-id>.json
Control via environment variables
export PII_GUARD_PRESETS=dpdp,pci # comma-separated presets (default: dpdp)
export PII_GUARD_ENABLED=0 # disable hooks without removing them
export PII_GUARD_MAX_CHARS=200000 # cap bash output scan size (default: 200000)
Mode 3 — API proxy (Cursor, Aider, Codex, Continue.dev, any SDK)
The proxy sits between your tool and the upstream API. It tokenizes every outgoing prompt and detokenizes every response. Your tool and your code are unchanged.
pii-guard proxy --port 8111 --preset dpdp
Set the base URL in your tool
# Anthropic SDK / Claude Code / any Anthropic-compatible tool
export ANTHROPIC_BASE_URL=http://localhost:8111
# OpenAI SDK / Cursor / Aider / Codex CLI / Continue.dev / LangChain
export OPENAI_BASE_URL=http://localhost:8111/openai/v1
Your existing code works unchanged:
import anthropic
client = anthropic.Anthropic() # routes through pii-guard automatically
response = client.messages.create(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": "Analyse rajesh@gmail.com, Aadhaar 2345 6789 0123"}]
)
# Anthropic receives: "Analyse [EMAIL_1], Aadhaar [AADHAAR_1]"
# Your app receives: "Analyse rajesh@gmail.com, Aadhaar 2345 6789 0123"
What the proxy does
Your tool sends prompt with real PII
↓
pii-guard proxy on localhost:8111
↓
Tokenizes PII → [EMAIL_1], [AADHAAR_1], [PAN_1]
↓
Forwards to api.anthropic.com or api.openai.com
↓
Gets response with tokens
↓
Detokenizes → real values restored
↓
Your tool receives response with real values
Anthropic and OpenAI never see the real data.
Proxy options
pii-guard proxy --port 8111 # default port
pii-guard proxy --preset dpdp,pci # multiple presets
pii-guard proxy --pattern "CUST_ID:CUST-\d{6}" # custom pattern
pii-guard proxy --session session.json # resume existing session
pii-guard proxy --quiet # suppress per-request logs
Restore after proxy session
pii-guard export-session ~/.pii-guard/sessions/<session-id>.json
pii-guard detokenize output.txt --session ~/.pii-guard/sessions/<session-id>.json
Per-tool guides
Custom patterns
Persistent — ~/.pii-guard/config.yaml
Loaded automatically by the CLI, hooks, and proxy:
custom_patterns:
CUSTOMER_ID: 'CUST-\d{6}'
EMPLOYEE_ID: 'EMP\d{5}'
INTERNAL_REF: 'INT-[A-Z]{3}-\d{4}'
mkdir -p ~/.pii-guard
cp config/pii-guard.example.yaml ~/.pii-guard/config.yaml
Inline — --pattern / -P flag
pii-guard scan file.csv -P "CUSTOMER_ID:CUST-\d{6}" --show-values
pii-guard tokenize file.csv -P "CUSTOMER_ID:CUST-\d{6}" -P "EMPLOYEE_ID:EMP\d{5}"
pii-guard tokenize data.csv -p dpdp -p pci -P "ACCOUNT_REF:ACC-\d{8}"
CUST-123456 becomes [CUSTOMER_ID_1], fully reversible.
Use from Python
from pii_guard.presets import load_presets
from pii_guard.scanner.engine import Scanner
from pii_guard.scanner.patterns import BASE_PATTERNS
from pii_guard.tokenizer.engine import tokenize
from pii_guard.tokenizer.session import Session
patterns = {**BASE_PATTERNS, **load_presets(["dpdp"])}
scanner = Scanner(patterns)
session = Session.new()
safe_text, matches = tokenize(raw_text, scanner, session)
session.save()
print(f"Tokenized {len(matches)} PII instances.")
print(f"Session key: {session.path}")
How tokenization works
Same value → same token within a session. Different values → different tokens. Fully reversible.
john@acme.com → [EMAIL_1] (always, within this session)
jane@acme.com → [EMAIL_2]
john@acme.com → [EMAIL_1] ← same input, same token
2345 6789 0123 → [AADHAAR_1]
Session key stays in ~/.pii-guard/sessions/. Never sent anywhere.
Limitations
- Regex-based detection — structured formats (Aadhaar, PAN, IBAN, SSN) have near-zero false negatives. Free-form PII (names, addresses in prose) is not detected; combine with a dedicated NER model if needed.
- DOCX formatting in PII-containing paragraphs — when a PII value spans multiple runs in a Word document (e.g., bold text adjacent to the value), the paragraph is collapsed to a single run after tokenization. Paragraphs with no PII are untouched.
- Same-session tokens only — tokens from one session cannot be detokenized with a different session key. Keep the session file for as long as you need to reverse.
- Streaming responses — the proxy detokenizes SSE streams line-by-line. A token that spans two SSE chunks will not be restored; rare but possible with large token strings.
- Proxy is localhost-only — binds to
127.0.0.1. Not designed to be network-exposed. Treat the session key file as a secret. - No key management — session files are plain JSON on disk. Encrypt or delete when no longer needed.
CI/CD integration
GitHub Actions
Copy integrations/github-actions/pii-scan.yml into .github/workflows/ to fail PRs that introduce raw PII in CSV, JSON, TXT, or log files:
cp integrations/github-actions/pii-scan.yml .github/workflows/pii-scan.yml
pre-commit hook
Add to your .pre-commit-config.yaml:
repos:
- repo: https://github.com/sunnypuli/pii-guard
rev: main
hooks:
- id: pii-guard-scan
args: [--preset, dpdp]
Then install hooks with pre-commit install. Commits that include files with detectable PII will be blocked.
Audit log
Every scan and tokenize run appends a line to ~/.pii-guard/audit.log:
2024-01-15T10:30:00 tokenize customers.csv total=12 AADHAAR:3 EMAIL:4 PAN:2
Docker (proxy)
docker build -t pii-guard .
docker run -p 8111:8111 pii-guard --preset dpdp,pci
Then set ANTHROPIC_BASE_URL=http://localhost:8111 or OPENAI_BASE_URL=http://localhost:8111/openai/v1.
Contributing
Contributions welcome — especially:
- New preset patterns (country-specific IDs, sector-specific formats)
- False positive reports with reproducible examples
- IDE and tool integrations
git clone https://github.com/sunnypuli/pii-guard
cd pii-guard
python -m venv venv && source venv/bin/activate
pip install -e ".[dev]"
pytest
Pattern PRs should include a test in tests/test_presets.py covering at least one valid and one invalid example.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file piiwall-0.1.0.tar.gz.
File metadata
- Download URL: piiwall-0.1.0.tar.gz
- Upload date:
- Size: 31.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6bbe9a10798db7f4c71dcec4865c01146fe33ffbaabc99cf9e2d427aeb84e49d
|
|
| MD5 |
92fa21a88163f30b356bfc23e92bb9f9
|
|
| BLAKE2b-256 |
9b2db5d6bb5ebec657da812d95e0fad3ba17cd0817cecd0a52ff4849be0769a1
|
File details
Details for the file piiwall-0.1.0-py3-none-any.whl.
File metadata
- Download URL: piiwall-0.1.0-py3-none-any.whl
- Upload date:
- Size: 27.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
696ab94799b3f9bb75ef2a4bad5accdfddb0e7170cf55a24e6c5b90835389395
|
|
| MD5 |
a2b5caf8cb499eb90ee27b91e9ca53c7
|
|
| BLAKE2b-256 |
de45182b358f8b48b5369a4dde19546d2cd0b35ac52f3ecf5fc99aa45248e23b
|