Skip to main content

Arabic/GCC PII detection, tokenization, and streaming-interception gateway.

Project description

arabic-pii-py · apii

Keep Arabic & GCC personal data off the LLM — without changing how you work. Names, IBANs, national IDs, phones, emails, addresses get swapped for reversible tokens on your machine, before anything reaches Claude / GPT. You keep seeing the real values. The model never does.

Everything runs locally. No cloud, no account, no data leaves your laptop (the only network call is a one-time, optional model download you can replace).


Why this exists

If you work with GCC customer data — banks, telcos, government, clinics — you often legally cannot send that PII to a US-hosted LLM. apii lets you use Claude / GPT on that data anyway: the personal data stays on your machine, the model only ever sees placeholders like EMAIL_C7E2…, and the real values are restored locally for you.

  • 🇸🇦 Built for Arabic & the GCC — Saudi / Emirati / Qatari / Kuwaiti / Bahraini / Omani shapes, IBAN ISO-7064 (MOD-97), national-ID checksums, and on-device Arabic + English NER for names & organizations.
  • 💻 100% local — no service to run, no API key required, nothing uploaded.
  • 🪶 Lightweight — pure Python, no PyTorch; NER runs as int8 ONNX.
  • 🔁 Reversible & stable — the same value always maps to the same token, and only your secret can turn it back.

⭐ The headline: transparent PII protection inside Claude Code

This is the part most tools can't do. With one command, apii wires two hooks into Claude Code:

what happens result
redact-on-read when Claude reads a file, PII in it is tokenized before the model sees it Claude only ever sees EMAIL_…, IBAN_…, PERSON_…
restore-on-write when Claude writes/edits a file, the tokens are turned back into real values before the bytes hit disk your files come out correct, the chat stays tokens
apii watch a side pane restores Claude's tokenized replies locally you read the real values; Anthropic still only got tokens

It's non-blocking — you work normally; Claude just never receives the PII.

# in the project where you keep customer data:
apii install-claude-hook          # one time — wires both hooks

# open a fresh Claude Code session there (it loads the hook at startup)
claude

# and, in another terminal pane IN THE SAME PROJECT FOLDER, watch real values:
apii watch          # follows this folder's session; --once dumps it so far

Now ask Claude to work on a file with PII. In the chat you'll see tokens; in the apii watch pane you'll see the real data; and your customer's information never left your machine.


Install (from source — it's yours, no PyPI needed)

Requires Python 3.10+.

git clone https://github.com/Aajil-Labs/arabic-pii-py.git
cd arabic-pii-py

python3 -m venv .venv && source .venv/bin/activate     # a 3.10+ interpreter
pip install -e ".[ner,cli,proxy,documents]"            # editable, all features

That gives you the apii command. To use it outside the venv, either keep the venv active, or symlink it onto your PATH:

ln -sf "$PWD/.venv/bin/apii" ~/.local/bin/apii         # if ~/.local/bin is on PATH

NER models (names & organizations) auto-download once (~210 MB, int8 ONNX) from Hugging Face and cache under ~/.cache/huggingface. Without them, every structured kind (email, phone, IBAN, ID, CR, VAT, address) still works — only PERSON / ORGANIZATION need the models. Point at your own copy any time with APII_NER_MODEL / APII_NER_EN_MODEL, or change the source repo with APII_NER_HF_REPO.

Extras you can pick: ner, cli, proxy (streaming gateway), documents (pdf/docx/xlsx). Core (pip install -e .) is just regex + checksums.


Ways to use it — pick what fits

1. Claude Code (above) — the transparent, zero-friction path.

2. CLI — text, files, and folders

# free text or a .txt file → tokens (mapping saved to a vault), then restore:
echo "call 0501234567, email omar@aajil.sa" | apii redact --vault demo.vault
apii restore answer.txt --vault demo.vault   # the model's tokens → real values

apii detect notes.txt                        # audit only: detections as JSON

# whole folders, format-aware (csv / json / docx / xlsx / pdf→txt):
apii scan-dir ./statements --ext csv --out audit.jsonl
apii redact-dir ./statements --out-dir ./masked --ext csv --vault s.vault

apii redact <file> reads the file as text. For documents (pdf/docx/xlsx/json) use redact-dir or the UI — they preserve layout.

3. Local UI — paste-in / paste-out (+ file upload)

apii ui    # opens http://127.0.0.1:8765 — paste text or drop a CSV/Excel,
           # take the tokens to any LLM, paste the reply back to restore.

4. As a library — embed it in your own app

from apii.anonymizer import Anonymizer

a = Anonymizer(secret="your-secret", tenant="acme")
r = a.anonymize("Email omar@aajil.sa, IBAN SA0380000000608010167519")
send_to_llm(r.text)                      # the model sees only tokens
show_user(a.deanonymize(model_reply))    # real values restored locally

5. Drop-in proxy — protect an app you can't modify

pip install "apii[proxy]"
apii serve    # local OpenAI-/Anthropic-compatible gateway on 127.0.0.1:8720
# point your client's base URL at it (e.g. ANTHROPIC_BASE_URL=http://127.0.0.1:8720);
# it anonymizes the request and de-anonymizes the streamed response, transparently.

What it detects

kind how
EMAIL format
PHONE GCC country codes, Saudi 05X, intl shapes
IBAN ISO-7064 MOD-97 checksum (all 6 GCC countries)
TAX_NUMBER 15-digit Saudi / GCC VAT
COMMERCIAL_REGISTRATION 10-digit CR, label-cued
NATIONAL_ID UAE-784 / Saudi-Iqama / GCC
PERSON on-device NER (no name lists, no regex)
ORGANIZATION on-device NER
ADDRESS PO-box / street regex + NER locations

Quality is measured against a 1,340-span corpus of real, publicly-sourced values (tests/eval/) — pytest tests/python -q runs it.


Command reference

command what it does
apii redact [file] Anonymize text (stdin or a text file) → stdout; save the token↔value map to --vault.
apii restore [file] --vault V Reverse it: tokens → real values, using the vault.
apii detect [file] Audit mode — list detections as JSON, redact nothing.
apii scan-dir DIR --out F Detect across a folder; write per-file JSONL summaries + totals.
apii redact-dir DIR --out-dir D Redact every matching file (format-aware) into --out-dir, merging records into one --vault.
apii ui Local paste-in / paste-out web UI + file upload (127.0.0.1:8765).
apii serve Local anonymizing LLM proxy — /v1/messages, /v1/chat/completions, /v1/responses (needs [proxy]).
apii watch Side-viewer: tail the current folder's Claude session, restoring tokens for your screen. --once dumps the session so far.
apii install-claude-hook Wire redact-on-read + restore-on-write into Claude Code in one command (--global for all projects).
apii hook The per-event hook itself (stdin event JSON → response JSON); used by the installed hooks.
apii daemon Long-lived local hook daemon (POST /hook) — avoids a process spawn per event.
apii hook-client Thin bridge that relays a hook event to a running daemon.

Common flags: --secret (or $APII_SECRET), --tenant, --vault, --policy strict|balanced|audit, --no-ner. Run apii <cmd> --help for the rest.


Environment variables

var purpose
APII_SECRET Vault HMAC / encryption key. Falls back to the managed ~/.apii/secret (auto-created, chmod 600).
APII_HOME Config + vault directory (default ~/.apii).
APII_POLICY Default policy: strict (default) / balanced / audit.
APII_NER_CASE_AUG Lowercase-name recovery: auto (default — fires on fully-lowercase input) / always (mixed-case too) / off.
APII_NER_THRESHOLD NER minimum confidence (default 0.85).
APII_NER_MODEL / APII_NER_EN_MODEL Use your own local Arabic / English ONNX model dirs (override the auto-download).
APII_NER_HF_REPO Hugging Face repo to fetch models from (default aajil-labs-sa/arabic-pii-ner).
APII_NER_NO_DOWNLOAD Set to disable the model auto-download (fully offline).
APII_ANTHROPIC_BASE / APII_OPENAI_BASE Upstream targets for apii serve.
APII_SUPPRESS_PHRASES Path to a phrase file of structural vocabulary to never tokenize.
APII_GEO_GAZETTEER Path to an optional geo gazetteer for address detection.

How it stays private (the model)

Two separate boundaries — that's the whole trick:

  • Privacy boundary = what the LLM receives → only tokens, always.
  • Display boundary = what you see → real values, because it's your data on your machine.

The bridge is a local, encrypted vault (~/.apii/default.vault, ChaCha20) plus a secret (~/.apii/secret, chmod 600). Tokens are HMAC-SHA256(secret, value) — deterministic, and irreversible without your secret. Restoration is applied at the last mile (your screen, your files) and never re-enters the model's context.


Make it your own

This is a normal, self-contained Python package — it's yours to run, change, and ship privately. You never have to publish it anywhere or run a server.

  • Customize detection: the recognizers live in apii/recognizers/ — edit a regex, tune a checksum, add a country shape.
  • Swap the NER models: point APII_NER_MODEL / APII_NER_EN_MODEL at your own ONNX models, or set APII_NER_HF_REPO to your own Hugging Face repo.
  • Change token formats, policy, vault location (APII_HOME), tenants, etc.
  • Stay fully offline: clone, pip install -e ., bring the NER models locally — no cloud, no PyPI, no service, ever.

It's built to be forked and made internal. Keep it private; it's yours.


NER models & credit

The bundled models are int8-ONNX redistributions of two open models — please keep crediting the original authors:

Hosted (quantized) at aajil-labs-sa/arabic-pii-ner with full provenance + SHAs.


License

© Aajil Labs. Dual-licensed — your choice of MIT or Apache-2.0 (see LICENSE-MIT and LICENSE-APACHE).

You may use, modify, and redistribute this software (including privately and commercially) under either license. You must keep the copyright and license notices in copies and substantial portions. The bundled NER models are redistributed under their original authors' terms — credit them as above.

This is yours to build on — just respect the license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apii-0.1.0rc1.tar.gz (258.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

apii-0.1.0rc1-py3-none-any.whl (118.5 kB view details)

Uploaded Python 3

File details

Details for the file apii-0.1.0rc1.tar.gz.

File metadata

  • Download URL: apii-0.1.0rc1.tar.gz
  • Upload date:
  • Size: 258.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for apii-0.1.0rc1.tar.gz
Algorithm Hash digest
SHA256 65cf51f2535ce24c4c3b2e4a3e70bc82b4b092c343a7fa6aca5156482b129d95
MD5 925fad1891f8dc32c829695d03adea40
BLAKE2b-256 cec65fcd042fea4a70ea6448b58cba8018be0b8250ebe72450812c441fb0e98a

See more details on using hashes here.

File details

Details for the file apii-0.1.0rc1-py3-none-any.whl.

File metadata

  • Download URL: apii-0.1.0rc1-py3-none-any.whl
  • Upload date:
  • Size: 118.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for apii-0.1.0rc1-py3-none-any.whl
Algorithm Hash digest
SHA256 e8c61657fac1ea9d661da676f4eb88bc8d98dd54b07ea25e86a48f76694b87b7
MD5 ee5d11c8c7ccf8b9128db71a4a1593a0
BLAKE2b-256 5655d6b273ebc3f365e731c30eec4271b1c102910ba75c257c5dba4c286870b1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page