Arabic/GCC PII detection, tokenization, and streaming-interception gateway.
Project description
arabic-pii-py · apii
pip install apii
Keep Arabic & GCC personal data off the LLM — without changing how you work. Names, IBANs, national IDs, phones, emails, addresses get swapped for reversible tokens on your machine, before anything reaches Claude / GPT. You keep seeing the real values. The model never does.
Everything runs locally. No cloud, no account, no data leaves your laptop (the only network call is a one-time, optional model download you can replace).
Why this exists
If you work with GCC customer data — banks, telcos, government, clinics — you
often legally cannot send that PII to a US-hosted LLM. apii lets you use
Claude / GPT on that data anyway: the personal data stays on your machine, the
model only ever sees placeholders like EMAIL_C7E2…, and the real values are
restored locally for you.
- 🇸🇦 Built for Arabic & the GCC — Saudi / Emirati / Qatari / Kuwaiti / Bahraini / Omani shapes, IBAN ISO-7064 (MOD-97), national-ID checksums, and on-device Arabic + English NER for names & organizations.
- 💻 100% local — no service to run, no API key required, nothing uploaded.
- 🪶 Lightweight — pure Python, no PyTorch; NER runs as int8 ONNX.
- 🔁 Reversible & stable — the same value always maps to the same token, and only your secret can turn it back.
⭐ The headline: transparent PII protection inside Claude Code
This is the part most tools can't do. With one command, apii wires two hooks
into Claude Code:
| what happens | result | |
|---|---|---|
| redact-on-read | when Claude reads a file, PII in it is tokenized before the model sees it | Claude only ever sees EMAIL_…, IBAN_…, PERSON_… |
| restore-on-write | when Claude writes/edits a file, the tokens are turned back into real values before the bytes hit disk | your files come out correct, the chat stays tokens |
apii watch |
a side pane restores Claude's tokenized replies locally | you read the real values; Anthropic still only got tokens |
It's non-blocking — you work normally; Claude just never receives the PII.
# in the project where you keep customer data:
apii install-claude-hook # one time — wires both hooks
# open a fresh Claude Code session there (it loads the hook at startup)
claude
# and, in another terminal pane IN THE SAME PROJECT FOLDER, watch real values:
apii watch # follows this folder's session; --once dumps it so far
Now ask Claude to work on a file with PII. In the chat you'll see tokens; in the
apii watch pane you'll see the real data; and your customer's information never
left your machine.
🤖 Or just hand it to your coding agent
Don't want to wire anything up? Give your agent the skill and it learns to
use apii on its own — redacting PII before it reads a file or calls a model.
skills/apii/SKILL.md is the portable
Agent Skills format, so the same file works in
Claude Code, Codex, Cursor, and any tool that reads it.
# Claude Code — every project:
cp -r skills/apii ~/.claude/skills/ # then just talk to it (auto-loads), or run /apii
# or only this project:
cp -r skills/apii .claude/skills/
# any other agent: paste skills/apii/SKILL.md into its context or skills folder.
Which path do I pick? They stack — use as many as you like:
| what it is | best when | |
|---|---|---|
Skill skills/apii/ |
teaches the agent to deliberately redact / restore | the agent drives; any tool; zero setup |
Hook apii install-claude-hook |
automatic redact-on-read + restore-on-write | you want it invisible & enforced — Claude Code only |
Proxy apii serve |
a hard transport boundary — the provider literally can't see PII | you don't control the client, or want it provider-wide |
My take: skill + hook is the sweet spot for daily Claude Code work — the hook guarantees protection even when the agent forgets, the skill makes the agent smart about using it. Reach for the proxy when the guarantee has to hold at the wire, for clients you don't control.
Install
pip install "apii[all]" # the whole tool: CLI + NER + proxy + documents
Requires Python 3.10+. apii[all] is what you want to use apii — the CLI,
the Claude Code hook, the proxy. Embedding it as a library instead? Stay lean
and add only what you touch:
| install | what you get |
|---|---|
pip install apii |
core detection (regex + checksums) + the apii CLI |
pip install "apii[ner]" |
+ on-device PERSON / ORGANIZATION (names & orgs) |
pip install "apii[cli]" |
+ encrypted-vault persistence (--vault) |
pip install "apii[proxy]" |
+ the streaming apii serve gateway |
pip install "apii[documents]" |
+ PDF text (docx / xlsx / csv / json are built in) |
pip install "apii[all]" |
everything above |
NER models (names & organizations) auto-download once (~210 MB, int8 ONNX)
from Hugging Face and cache under ~/.cache/huggingface. Without them, every
structured kind (email, phone, IBAN, ID, CR, VAT, address) still works — only
PERSON / ORGANIZATION need the models. Point at your own copy any time with
APII_NER_MODEL / APII_NER_EN_MODEL, or change the source repo with
APII_NER_HF_REPO.
To hack on it instead, clone the repo and pip install -e ".[all]" — see
Make it your own below.
Ways to use it — pick what fits
1. Claude Code (above) — the transparent, zero-friction path.
2. CLI — text, files, and folders
# free text or a .txt file → tokens (mapping saved to a vault), then restore:
echo "call 0501234567, email omar@aajil.sa" | apii redact --vault demo.vault
apii restore answer.txt --vault demo.vault # the model's tokens → real values
apii detect notes.txt # audit only: detections as JSON
# whole folders, format-aware (csv / json / docx / xlsx / pdf→txt):
apii scan-dir ./statements --ext csv --out audit.jsonl
apii redact-dir ./statements --out-dir ./masked --ext csv --vault s.vault
apii redact <file>reads the file as text. For documents (pdf/docx/xlsx/json) useredact-diror the UI — they preserve layout.
3. Local UI — paste-in / paste-out (+ file upload)
apii ui # opens http://127.0.0.1:8765 — paste text or drop a CSV/Excel,
# take the tokens to any LLM, paste the reply back to restore.
4. As a library — embed it in your own app
from apii.anonymizer import Anonymizer
a = Anonymizer(secret="your-secret", tenant="acme")
r = a.anonymize("Email omar@aajil.sa, IBAN SA0380000000608010167519")
send_to_llm(r.text) # the model sees only tokens
show_user(a.deanonymize(model_reply)) # real values restored locally
5. Drop-in proxy — one local endpoint, every provider 🔌
Run one gateway and point any LLM client at it. apii tokenizes each
request, sends only tokens upstream, and restores the (streamed) reply — the
client never changes, the provider never sees PII, and your own API key just
passes through (apii never stores it).
pip install "apii[proxy]"
apii serve # → http://127.0.0.1:8720 (--host / --port to change)
One port speaks three wire formats, so the same gateway fronts OpenAI, Anthropic, Codex, and anything OpenAI-compatible — OpenRouter, LiteLLM, Together, vLLM, …:
| your client | point it at | upstream env |
|---|---|---|
| OpenAI SDK / chat apps | base_url = http://127.0.0.1:8720/v1 |
APII_OPENAI_BASE (default api.openai.com) |
| Codex CLI (Responses API) | a custom model-provider → :8720/v1, wire_api = "responses" |
APII_OPENAI_BASE |
| Claude Code / Anthropic SDK | ANTHROPIC_BASE_URL = http://127.0.0.1:8720 |
APII_ANTHROPIC_BASE (default api.anthropic.com) |
| OpenRouter · LiteLLM · any OpenAI-compatible | the OpenAI base_url above |
set APII_OPENAI_BASE to that provider |
# e.g. route OpenAI-style traffic through OpenRouter, PII-safe:
APII_OPENAI_BASE=https://openrouter.ai/api/v1 apii serve
# your app: base_url = http://127.0.0.1:8720/v1 + your OpenRouter key, as usual
# e.g. point the real Codex CLI at apii — ~/.codex/config.toml
model = "gpt-4o-mini" # any model your upstream serves
model_provider = "apii"
[model_providers.apii]
base_url = "http://127.0.0.1:8720/v1"
wire_api = "responses"
env_key = "OPENAI_API_KEY" # OPENROUTER_API_KEY if APII_OPENAI_BASE → OpenRouter
Every route is verified end-to-end — streaming and non-streaming — against a live provider, including the real Codex CLI: the provider gets only tokens, your client gets the real values back.
What it detects
| kind | how |
|---|---|
EMAIL |
format |
PHONE |
GCC country codes, Saudi 05X, intl shapes |
IBAN |
ISO-7064 MOD-97 checksum (all 6 GCC countries) |
TAX_NUMBER |
15-digit Saudi / GCC VAT |
COMMERCIAL_REGISTRATION |
10-digit CR, label-cued |
NATIONAL_ID |
UAE-784 / Saudi-Iqama / GCC |
PERSON |
on-device NER (no name lists, no regex) |
ORGANIZATION |
on-device NER |
ADDRESS |
PO-box / street regex + NER locations |
Quality is measured against a 1,340-span corpus of real, publicly-sourced
values (tests/eval/) — pytest tests/python -q runs it.
Command reference
| command | what it does |
|---|---|
apii redact [file] |
Anonymize text (stdin or a text file) → stdout; save the token↔value map to --vault. |
apii restore [file] --vault V |
Reverse it: tokens → real values, using the vault. |
apii detect [file] |
Audit mode — list detections as JSON, redact nothing. |
apii scan-dir DIR --out F |
Detect across a folder; write per-file JSONL summaries + totals. |
apii redact-dir DIR --out-dir D |
Redact every matching file (format-aware) into --out-dir, merging records into one --vault. |
apii ui |
Local paste-in / paste-out web UI + file upload (127.0.0.1:8765). |
apii serve |
Local anonymizing LLM proxy — /v1/messages, /v1/chat/completions, /v1/responses (needs [proxy]). |
apii watch |
Side-viewer: tail the current folder's Claude session, restoring tokens for your screen. --once dumps the session so far. |
apii install-claude-hook |
Wire redact-on-read + restore-on-write into Claude Code in one command (--global for all projects). |
apii hook |
The per-event hook itself (stdin event JSON → response JSON); used by the installed hooks. |
apii daemon |
Long-lived local hook daemon (POST /hook) — avoids a process spawn per event. |
apii hook-client |
Thin bridge that relays a hook event to a running daemon. |
Common flags: --secret (or $APII_SECRET), --tenant, --vault,
--policy strict|balanced|audit, --no-ner. Run apii <cmd> --help for the rest.
Environment variables
| var | purpose |
|---|---|
APII_SECRET |
Vault HMAC / encryption key. Falls back to the managed ~/.apii/secret (auto-created, chmod 600). |
APII_HOME |
Config + vault directory (default ~/.apii). |
APII_POLICY |
Default policy: strict (default) / balanced / audit. |
APII_NER_CASE_AUG |
Lowercase-name recovery: auto (default — fires on fully-lowercase input) / always (mixed-case too) / off. |
APII_NER_THRESHOLD |
NER minimum confidence (default 0.85). |
APII_NER_MODEL / APII_NER_EN_MODEL |
Use your own local Arabic / English ONNX model dirs (override the auto-download). |
APII_NER_HF_REPO |
Hugging Face repo to fetch models from (default aajil-labs-sa/arabic-pii-ner). |
APII_NER_NO_DOWNLOAD |
Set to disable the model auto-download (fully offline). |
APII_ANTHROPIC_BASE / APII_OPENAI_BASE |
Upstream targets for apii serve. |
APII_SUPPRESS_PHRASES |
Path to a phrase file of structural vocabulary to never tokenize. |
APII_GEO_GAZETTEER |
Path to an optional geo gazetteer for address detection. |
How it stays private (the model)
Two separate boundaries — that's the whole trick:
- Privacy boundary = what the LLM receives → only tokens, always.
- Display boundary = what you see → real values, because it's your data on your machine.
The bridge is a local, encrypted vault (~/.apii/default.vault, ChaCha20)
plus a secret (~/.apii/secret, chmod 600). Tokens are
HMAC-SHA256(secret, value) — deterministic, and irreversible without your
secret. Restoration is applied at the last mile (your screen, your files) and
never re-enters the model's context.
Make it your own
This is a normal, self-contained Python package — it's yours to run, change, and ship privately. You never have to publish it anywhere or run a server.
- Customize detection: the recognizers live in
apii/recognizers/— edit a regex, tune a checksum, add a country shape. - Swap the NER models: point
APII_NER_MODEL/APII_NER_EN_MODELat your own ONNX models, or setAPII_NER_HF_REPOto your own Hugging Face repo. - Change token formats, policy, vault location (
APII_HOME), tenants, etc. - Stay fully offline: clone,
pip install -e ., bring the NER models locally — no cloud, no PyPI, no service, ever.
It's built to be forked and made internal. Keep it private; it's yours.
NER models & credit
The bundled models are int8-ONNX redistributions of two open models — please keep crediting the original authors:
- Arabic —
hatmimoha/arabic-nerby Hatim Mohamed (onasafaya/bert-base-arabicby Ali Safaya). - English —
dslim/bert-base-NERby David S. Lim (MIT, CoNLL-2003).
Hosted (quantized) at
aajil-labs-sa/arabic-pii-ner
with full provenance + SHAs.
License
© Aajil Labs. Dual-licensed — your choice of MIT or Apache-2.0
(see LICENSE-MIT and LICENSE-APACHE).
You may use, modify, and redistribute this software (including privately and commercially) under either license. You must keep the copyright and license notices in copies and substantial portions. The bundled NER models are redistributed under their original authors' terms — credit them as above.
This is yours to build on — just respect the license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file apii-0.1.1.tar.gz.
File metadata
- Download URL: apii-0.1.1.tar.gz
- Upload date:
- Size: 267.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9754a9adc707f8e77e5335f16916dd36fbcecc1847b572139843994a4c467ecb
|
|
| MD5 |
945689518ba6791541d31a9900aaa8a7
|
|
| BLAKE2b-256 |
127202cf9ef515a46260227d33f53d2477f22e5763506b32f5dc66981c8fa030
|
File details
Details for the file apii-0.1.1-py3-none-any.whl.
File metadata
- Download URL: apii-0.1.1-py3-none-any.whl
- Upload date:
- Size: 122.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74b194d27ce809e37d1bba99fb04dd3f2889205669c3a43608a30b1b4c8a8a20
|
|
| MD5 |
685fd618de243581c34c67c4f2d69a1f
|
|
| BLAKE2b-256 |
5c66b5a5db0e9a338f6063467b62db43a46a5740205d50685a19508994bfd4a1
|