Arabic/GCC PII detection, tokenization, and streaming-interception gateway.
Project description
arabic-pii-py · apii
Keep Arabic & GCC personal data off the LLM — without changing how you work. Names, IBANs, national IDs, phones, emails, addresses get swapped for reversible tokens on your machine, before anything reaches Claude / GPT. You keep seeing the real values. The model never does.
Everything runs locally. No cloud, no account, no data leaves your laptop (the only network call is a one-time, optional model download you can replace).
Why this exists
If you work with GCC customer data — banks, telcos, government, clinics — you
often legally cannot send that PII to a US-hosted LLM. apii lets you use
Claude / GPT on that data anyway: the personal data stays on your machine, the
model only ever sees placeholders like EMAIL_C7E2…, and the real values are
restored locally for you.
- 🇸🇦 Built for Arabic & the GCC — Saudi / Emirati / Qatari / Kuwaiti / Bahraini / Omani shapes, IBAN ISO-7064 (MOD-97), national-ID checksums, and on-device Arabic + English NER for names & organizations.
- 💻 100% local — no service to run, no API key required, nothing uploaded.
- 🪶 Lightweight — pure Python, no PyTorch; NER runs as int8 ONNX.
- 🔁 Reversible & stable — the same value always maps to the same token, and only your secret can turn it back.
⭐ The headline: transparent PII protection inside Claude Code
This is the part most tools can't do. With one command, apii wires two hooks
into Claude Code:
| what happens | result | |
|---|---|---|
| redact-on-read | when Claude reads a file, PII in it is tokenized before the model sees it | Claude only ever sees EMAIL_…, IBAN_…, PERSON_… |
| restore-on-write | when Claude writes/edits a file, the tokens are turned back into real values before the bytes hit disk | your files come out correct, the chat stays tokens |
apii watch |
a side pane restores Claude's tokenized replies locally | you read the real values; Anthropic still only got tokens |
It's non-blocking — you work normally; Claude just never receives the PII.
# in the project where you keep customer data:
apii install-claude-hook # one time — wires both hooks
# open a fresh Claude Code session there (it loads the hook at startup)
claude
# and, in another terminal pane IN THE SAME PROJECT FOLDER, watch real values:
apii watch # follows this folder's session; --once dumps it so far
Now ask Claude to work on a file with PII. In the chat you'll see tokens; in the
apii watch pane you'll see the real data; and your customer's information never
left your machine.
Install (from source — it's yours, no PyPI needed)
Requires Python 3.10+.
git clone https://github.com/Aajil-Labs/arabic-pii-py.git
cd arabic-pii-py
python3 -m venv .venv && source .venv/bin/activate # a 3.10+ interpreter
pip install -e ".[ner,cli,proxy,documents]" # editable, all features
That gives you the apii command. To use it outside the venv, either keep the
venv active, or symlink it onto your PATH:
ln -sf "$PWD/.venv/bin/apii" ~/.local/bin/apii # if ~/.local/bin is on PATH
NER models (names & organizations) auto-download once (~210 MB, int8 ONNX)
from Hugging Face and cache under ~/.cache/huggingface. Without them, every
structured kind (email, phone, IBAN, ID, CR, VAT, address) still works — only
PERSON / ORGANIZATION need the models. Point at your own copy any time with
APII_NER_MODEL / APII_NER_EN_MODEL, or change the source repo with
APII_NER_HF_REPO.
Extras you can pick: ner, cli, proxy (streaming gateway), documents
(pdf/docx/xlsx). Core (pip install -e .) is just regex + checksums.
Ways to use it — pick what fits
1. Claude Code (above) — the transparent, zero-friction path.
2. CLI — text, files, and folders
# free text or a .txt file → tokens (mapping saved to a vault), then restore:
echo "call 0501234567, email omar@aajil.sa" | apii redact --vault demo.vault
apii restore answer.txt --vault demo.vault # the model's tokens → real values
apii detect notes.txt # audit only: detections as JSON
# whole folders, format-aware (csv / json / docx / xlsx / pdf→txt):
apii scan-dir ./statements --ext csv --out audit.jsonl
apii redact-dir ./statements --out-dir ./masked --ext csv --vault s.vault
apii redact <file>reads the file as text. For documents (pdf/docx/xlsx/json) useredact-diror the UI — they preserve layout.
3. Local UI — paste-in / paste-out (+ file upload)
apii ui # opens http://127.0.0.1:8765 — paste text or drop a CSV/Excel,
# take the tokens to any LLM, paste the reply back to restore.
4. As a library — embed it in your own app
from apii.anonymizer import Anonymizer
a = Anonymizer(secret="your-secret", tenant="acme")
r = a.anonymize("Email omar@aajil.sa, IBAN SA0380000000608010167519")
send_to_llm(r.text) # the model sees only tokens
show_user(a.deanonymize(model_reply)) # real values restored locally
5. Drop-in proxy — protect an app you can't modify
pip install "apii[proxy]"
apii serve # local OpenAI-/Anthropic-compatible gateway on 127.0.0.1:8720
# point your client's base URL at it (e.g. ANTHROPIC_BASE_URL=http://127.0.0.1:8720);
# it anonymizes the request and de-anonymizes the streamed response, transparently.
What it detects
| kind | how |
|---|---|
EMAIL |
format |
PHONE |
GCC country codes, Saudi 05X, intl shapes |
IBAN |
ISO-7064 MOD-97 checksum (all 6 GCC countries) |
TAX_NUMBER |
15-digit Saudi / GCC VAT |
COMMERCIAL_REGISTRATION |
10-digit CR, label-cued |
NATIONAL_ID |
UAE-784 / Saudi-Iqama / GCC |
PERSON |
on-device NER (no name lists, no regex) |
ORGANIZATION |
on-device NER |
ADDRESS |
PO-box / street regex + NER locations |
Quality is measured against a 1,340-span corpus of real, publicly-sourced
values (tests/eval/) — pytest tests/python -q runs it.
Command reference
| command | what it does |
|---|---|
apii redact [file] |
Anonymize text (stdin or a text file) → stdout; save the token↔value map to --vault. |
apii restore [file] --vault V |
Reverse it: tokens → real values, using the vault. |
apii detect [file] |
Audit mode — list detections as JSON, redact nothing. |
apii scan-dir DIR --out F |
Detect across a folder; write per-file JSONL summaries + totals. |
apii redact-dir DIR --out-dir D |
Redact every matching file (format-aware) into --out-dir, merging records into one --vault. |
apii ui |
Local paste-in / paste-out web UI + file upload (127.0.0.1:8765). |
apii serve |
Local anonymizing LLM proxy — /v1/messages, /v1/chat/completions, /v1/responses (needs [proxy]). |
apii watch |
Side-viewer: tail the current folder's Claude session, restoring tokens for your screen. --once dumps the session so far. |
apii install-claude-hook |
Wire redact-on-read + restore-on-write into Claude Code in one command (--global for all projects). |
apii hook |
The per-event hook itself (stdin event JSON → response JSON); used by the installed hooks. |
apii daemon |
Long-lived local hook daemon (POST /hook) — avoids a process spawn per event. |
apii hook-client |
Thin bridge that relays a hook event to a running daemon. |
Common flags: --secret (or $APII_SECRET), --tenant, --vault,
--policy strict|balanced|audit, --no-ner. Run apii <cmd> --help for the rest.
Environment variables
| var | purpose |
|---|---|
APII_SECRET |
Vault HMAC / encryption key. Falls back to the managed ~/.apii/secret (auto-created, chmod 600). |
APII_HOME |
Config + vault directory (default ~/.apii). |
APII_POLICY |
Default policy: strict (default) / balanced / audit. |
APII_NER_CASE_AUG |
Lowercase-name recovery: auto (default — fires on fully-lowercase input) / always (mixed-case too) / off. |
APII_NER_THRESHOLD |
NER minimum confidence (default 0.85). |
APII_NER_MODEL / APII_NER_EN_MODEL |
Use your own local Arabic / English ONNX model dirs (override the auto-download). |
APII_NER_HF_REPO |
Hugging Face repo to fetch models from (default aajil-labs-sa/arabic-pii-ner). |
APII_NER_NO_DOWNLOAD |
Set to disable the model auto-download (fully offline). |
APII_ANTHROPIC_BASE / APII_OPENAI_BASE |
Upstream targets for apii serve. |
APII_SUPPRESS_PHRASES |
Path to a phrase file of structural vocabulary to never tokenize. |
APII_GEO_GAZETTEER |
Path to an optional geo gazetteer for address detection. |
How it stays private (the model)
Two separate boundaries — that's the whole trick:
- Privacy boundary = what the LLM receives → only tokens, always.
- Display boundary = what you see → real values, because it's your data on your machine.
The bridge is a local, encrypted vault (~/.apii/default.vault, ChaCha20)
plus a secret (~/.apii/secret, chmod 600). Tokens are
HMAC-SHA256(secret, value) — deterministic, and irreversible without your
secret. Restoration is applied at the last mile (your screen, your files) and
never re-enters the model's context.
Make it your own
This is a normal, self-contained Python package — it's yours to run, change, and ship privately. You never have to publish it anywhere or run a server.
- Customize detection: the recognizers live in
apii/recognizers/— edit a regex, tune a checksum, add a country shape. - Swap the NER models: point
APII_NER_MODEL/APII_NER_EN_MODELat your own ONNX models, or setAPII_NER_HF_REPOto your own Hugging Face repo. - Change token formats, policy, vault location (
APII_HOME), tenants, etc. - Stay fully offline: clone,
pip install -e ., bring the NER models locally — no cloud, no PyPI, no service, ever.
It's built to be forked and made internal. Keep it private; it's yours.
NER models & credit
The bundled models are int8-ONNX redistributions of two open models — please keep crediting the original authors:
- Arabic —
hatmimoha/arabic-nerby Hatim Mohamed (onasafaya/bert-base-arabicby Ali Safaya). - English —
dslim/bert-base-NERby David S. Lim (MIT, CoNLL-2003).
Hosted (quantized) at
aajil-labs-sa/arabic-pii-ner
with full provenance + SHAs.
License
© Aajil Labs. Dual-licensed — your choice of MIT or Apache-2.0
(see LICENSE-MIT and LICENSE-APACHE).
You may use, modify, and redistribute this software (including privately and commercially) under either license. You must keep the copyright and license notices in copies and substantial portions. The bundled NER models are redistributed under their original authors' terms — credit them as above.
This is yours to build on — just respect the license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file apii-0.1.0rc1.tar.gz.
File metadata
- Download URL: apii-0.1.0rc1.tar.gz
- Upload date:
- Size: 258.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
65cf51f2535ce24c4c3b2e4a3e70bc82b4b092c343a7fa6aca5156482b129d95
|
|
| MD5 |
925fad1891f8dc32c829695d03adea40
|
|
| BLAKE2b-256 |
cec65fcd042fea4a70ea6448b58cba8018be0b8250ebe72450812c441fb0e98a
|
File details
Details for the file apii-0.1.0rc1-py3-none-any.whl.
File metadata
- Download URL: apii-0.1.0rc1-py3-none-any.whl
- Upload date:
- Size: 118.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8c61657fac1ea9d661da676f4eb88bc8d98dd54b07ea25e86a48f76694b87b7
|
|
| MD5 |
ee5d11c8c7ccf8b9128db71a4a1593a0
|
|
| BLAKE2b-256 |
5655d6b273ebc3f365e731c30eec4271b1c102910ba75c257c5dba4c286870b1
|