Skip to main content

Arabic/GCC PII detection, tokenization, and streaming-interception gateway.

Project description

arabic-pii-py · apii

PyPI Python License

pip install apii

Keep Arabic & GCC personal data off the LLM — without changing how you work. Names, IBANs, national IDs, phones, emails, addresses get swapped for reversible tokens on your machine, before anything reaches Claude / GPT. You keep seeing the real values. The model never does.

Everything runs locally. No cloud, no account, no data leaves your laptop (the only network call is a one-time, optional model download you can replace).


Why this exists

If you work with GCC customer data — banks, telcos, government, clinics — you often legally cannot send that PII to a US-hosted LLM. apii lets you use Claude / GPT on that data anyway: the personal data stays on your machine, the model only ever sees placeholders like EMAIL_C7E2…, and the real values are restored locally for you.

  • 🇸🇦 Built for Arabic & the GCC — Saudi / Emirati / Qatari / Kuwaiti / Bahraini / Omani shapes, IBAN ISO-7064 (MOD-97), national-ID checksums, and on-device Arabic + English NER for names & organizations.
  • 💻 100% local — no service to run, no API key required, nothing uploaded.
  • 🪶 Lightweight — pure Python, no PyTorch; NER runs as int8 ONNX.
  • 🔁 Reversible & stable — the same value always maps to the same token, and only your secret can turn it back.

⭐ The headline: transparent PII protection inside Claude Code

This is the part most tools can't do. With one command, apii wires two hooks into Claude Code:

what happens result
redact-on-read when Claude reads a file, PII in it is tokenized before the model sees it Claude only ever sees EMAIL_…, IBAN_…, PERSON_…
restore-on-write when Claude writes/edits a file, the tokens are turned back into real values before the bytes hit disk your files come out correct, the chat stays tokens
apii watch a side pane restores Claude's tokenized replies locally you read the real values; Anthropic still only got tokens

It's non-blocking — you work normally; Claude just never receives the PII.

# in the project where you keep customer data:
apii install-claude-hook          # one time — wires both hooks

# open a fresh Claude Code session there (it loads the hook at startup)
claude

# and, in another terminal pane IN THE SAME PROJECT FOLDER, watch real values:
apii watch          # follows this folder's session; --once dumps it so far

Now ask Claude to work on a file with PII. In the chat you'll see tokens; in the apii watch pane you'll see the real data; and your customer's information never left your machine.


🤖 Or just hand it to your coding agent

Don't want to wire anything up? Give your agent the skill and it learns to use apii on its own — redacting PII before it reads a file or calls a model. skills/apii/SKILL.md is the portable Agent Skills format, so the same file works in Claude Code, Codex, Cursor, and any tool that reads it.

# Claude Code — every project:
cp -r skills/apii ~/.claude/skills/     # then just talk to it (auto-loads), or run /apii
# or only this project:
cp -r skills/apii .claude/skills/
# any other agent: paste skills/apii/SKILL.md into its context or skills folder.

Which path do I pick? They stack — use as many as you like:

what it is best when
Skill skills/apii/ teaches the agent to deliberately redact / restore the agent drives; any tool; zero setup
Hook apii install-claude-hook automatic redact-on-read + restore-on-write you want it invisible & enforced — Claude Code only
Proxy apii serve a hard transport boundary — the provider literally can't see PII you don't control the client, or want it provider-wide

My take: skill + hook is the sweet spot for daily Claude Code work — the hook guarantees protection even when the agent forgets, the skill makes the agent smart about using it. Reach for the proxy when the guarantee has to hold at the wire, for clients you don't control.


Install

pip install "apii[all]"     # the whole tool: CLI + NER + proxy + documents

Requires Python 3.10+. apii[all] is what you want to use apii — the CLI, the Claude Code hook, the proxy. Embedding it as a library instead? Stay lean and add only what you touch:

install what you get
pip install apii core detection (regex + checksums) + the apii CLI
pip install "apii[ner]" + on-device PERSON / ORGANIZATION (names & orgs)
pip install "apii[cli]" + encrypted-vault persistence (--vault)
pip install "apii[proxy]" + the streaming apii serve gateway
pip install "apii[documents]" + PDF text (docx / xlsx / csv / json are built in)
pip install "apii[all]" everything above

NER models (names & organizations) auto-download once (~210 MB, int8 ONNX) from Hugging Face and cache under ~/.cache/huggingface. Without them, every structured kind (email, phone, IBAN, ID, CR, VAT, address) still works — only PERSON / ORGANIZATION need the models. Point at your own copy any time with APII_NER_MODEL / APII_NER_EN_MODEL, or change the source repo with APII_NER_HF_REPO.

To hack on it instead, clone the repo and pip install -e ".[all]" — see Make it your own below.


Ways to use it — pick what fits

1. Claude Code (above) — the transparent, zero-friction path.

2. CLI — text, files, and folders

# free text or a .txt file → tokens (mapping saved to a vault), then restore:
echo "call 0501234567, email omar@aajil.sa" | apii redact --vault demo.vault
apii restore answer.txt --vault demo.vault   # the model's tokens → real values

apii detect notes.txt                        # audit only: detections as JSON

# whole folders, format-aware (csv / json / docx / xlsx / pdf→txt):
apii scan-dir ./statements --ext csv --out audit.jsonl
apii redact-dir ./statements --out-dir ./masked --ext csv --vault s.vault

apii redact <file> reads the file as text. For documents (pdf/docx/xlsx/json) use redact-dir or the UI — they preserve layout.

3. Local UI — paste-in / paste-out (+ file upload)

apii ui    # opens http://127.0.0.1:8765 — paste text or drop a CSV/Excel,
           # take the tokens to any LLM, paste the reply back to restore.

4. As a library — embed it in your own app

from apii.anonymizer import Anonymizer

a = Anonymizer(secret="your-secret", tenant="acme")
r = a.anonymize("Email omar@aajil.sa, IBAN SA0380000000608010167519")
send_to_llm(r.text)                      # the model sees only tokens
show_user(a.deanonymize(model_reply))    # real values restored locally

5. Drop-in proxy — one local endpoint, every provider 🔌

Run one gateway and point any LLM client at it. apii tokenizes each request, sends only tokens upstream, and restores the (streamed) reply — the client never changes, the provider never sees PII, and your own API key just passes through (apii never stores it).

pip install "apii[proxy]"
apii serve                 # → http://127.0.0.1:8720   (--host / --port to change)

One port speaks three wire formats, so the same gateway fronts OpenAI, Anthropic, Codex, and anything OpenAI-compatible — OpenRouter, LiteLLM, Together, vLLM, …:

your client point it at upstream env
OpenAI SDK / chat apps base_url = http://127.0.0.1:8720/v1 APII_OPENAI_BASE (default api.openai.com)
Codex CLI (Responses API) a custom model-provider → :8720/v1, wire_api = "responses" APII_OPENAI_BASE
Claude Code / Anthropic SDK ANTHROPIC_BASE_URL = http://127.0.0.1:8720 APII_ANTHROPIC_BASE (default api.anthropic.com)
OpenRouter · LiteLLM · any OpenAI-compatible the OpenAI base_url above set APII_OPENAI_BASE to that provider
# e.g. route OpenAI-style traffic through OpenRouter, PII-safe:
APII_OPENAI_BASE=https://openrouter.ai/api/v1 apii serve
# your app: base_url = http://127.0.0.1:8720/v1  + your OpenRouter key, as usual
# e.g. point the real Codex CLI at apii — ~/.codex/config.toml
model = "gpt-4o-mini"           # any model your upstream serves
model_provider = "apii"
[model_providers.apii]
base_url = "http://127.0.0.1:8720/v1"
wire_api = "responses"
env_key  = "OPENAI_API_KEY"     # OPENROUTER_API_KEY if APII_OPENAI_BASE → OpenRouter

Every route is verified end-to-end — streaming and non-streaming — against a live provider, including the real Codex CLI: the provider gets only tokens, your client gets the real values back.


What it detects

kind how
EMAIL format
PHONE GCC country codes, Saudi 05X, intl shapes
IBAN ISO-7064 MOD-97 checksum (all 6 GCC countries)
TAX_NUMBER 15-digit Saudi / GCC VAT
COMMERCIAL_REGISTRATION 10-digit CR, label-cued
NATIONAL_ID UAE-784 / Saudi-Iqama / GCC
PERSON on-device NER (no name lists, no regex)
ORGANIZATION on-device NER
ADDRESS PO-box / street regex + NER locations

Quality is measured against a 1,340-span corpus of real, publicly-sourced values (tests/eval/) — pytest tests/python -q runs it.


Command reference

command what it does
apii redact [file] Anonymize text (stdin or a text file) → stdout; save the token↔value map to --vault.
apii restore [file] --vault V Reverse it: tokens → real values, using the vault.
apii detect [file] Audit mode — list detections as JSON, redact nothing.
apii scan-dir DIR --out F Detect across a folder; write per-file JSONL summaries + totals.
apii redact-dir DIR --out-dir D Redact every matching file (format-aware) into --out-dir, merging records into one --vault.
apii ui Local paste-in / paste-out web UI + file upload (127.0.0.1:8765).
apii serve Local anonymizing LLM proxy — /v1/messages, /v1/chat/completions, /v1/responses (needs [proxy]).
apii watch Side-viewer: tail the current folder's Claude session, restoring tokens for your screen. --once dumps the session so far.
apii install-claude-hook Wire redact-on-read + restore-on-write into Claude Code in one command (--global for all projects).
apii hook The per-event hook itself (stdin event JSON → response JSON); used by the installed hooks.
apii daemon Long-lived local hook daemon (POST /hook) — avoids a process spawn per event.
apii hook-client Thin bridge that relays a hook event to a running daemon.

Common flags: --secret (or $APII_SECRET), --tenant, --vault, --policy strict|balanced|audit, --no-ner. Run apii <cmd> --help for the rest.


Environment variables

var purpose
APII_SECRET Vault HMAC / encryption key. Falls back to the managed ~/.apii/secret (auto-created, chmod 600).
APII_HOME Config + vault directory (default ~/.apii).
APII_POLICY Default policy: strict (default) / balanced / audit.
APII_NER_CASE_AUG Lowercase-name recovery: auto (default — fires on fully-lowercase input) / always (mixed-case too) / off.
APII_NER_THRESHOLD NER minimum confidence (default 0.85).
APII_NER_MODEL / APII_NER_EN_MODEL Use your own local Arabic / English ONNX model dirs (override the auto-download).
APII_NER_HF_REPO Hugging Face repo to fetch models from (default aajil-labs-sa/arabic-pii-ner).
APII_NER_NO_DOWNLOAD Set to disable the model auto-download (fully offline).
APII_ANTHROPIC_BASE / APII_OPENAI_BASE Upstream targets for apii serve.
APII_SUPPRESS_PHRASES Path to a phrase file of structural vocabulary to never tokenize.
APII_GEO_GAZETTEER Path to an optional geo gazetteer for address detection.

How it stays private (the model)

Two separate boundaries — that's the whole trick:

  • Privacy boundary = what the LLM receives → only tokens, always.
  • Display boundary = what you see → real values, because it's your data on your machine.

The bridge is a local, encrypted vault (~/.apii/default.vault, ChaCha20) plus a secret (~/.apii/secret, chmod 600). Tokens are HMAC-SHA256(secret, value) — deterministic, and irreversible without your secret. Restoration is applied at the last mile (your screen, your files) and never re-enters the model's context.


Make it your own

This is a normal, self-contained Python package — it's yours to run, change, and ship privately. You never have to publish it anywhere or run a server.

  • Customize detection: the recognizers live in apii/recognizers/ — edit a regex, tune a checksum, add a country shape.
  • Swap the NER models: point APII_NER_MODEL / APII_NER_EN_MODEL at your own ONNX models, or set APII_NER_HF_REPO to your own Hugging Face repo.
  • Change token formats, policy, vault location (APII_HOME), tenants, etc.
  • Stay fully offline: clone, pip install -e ., bring the NER models locally — no cloud, no PyPI, no service, ever.

It's built to be forked and made internal. Keep it private; it's yours.


NER models & credit

The bundled models are int8-ONNX redistributions of two open models — please keep crediting the original authors:

Hosted (quantized) at aajil-labs-sa/arabic-pii-ner with full provenance + SHAs.


License

© Aajil Labs. Dual-licensed — your choice of MIT or Apache-2.0 (see LICENSE-MIT and LICENSE-APACHE).

You may use, modify, and redistribute this software (including privately and commercially) under either license. You must keep the copyright and license notices in copies and substantial portions. The bundled NER models are redistributed under their original authors' terms — credit them as above.

This is yours to build on — just respect the license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apii-0.1.1.tar.gz (267.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

apii-0.1.1-py3-none-any.whl (122.7 kB view details)

Uploaded Python 3

File details

Details for the file apii-0.1.1.tar.gz.

File metadata

  • Download URL: apii-0.1.1.tar.gz
  • Upload date:
  • Size: 267.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for apii-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9754a9adc707f8e77e5335f16916dd36fbcecc1847b572139843994a4c467ecb
MD5 945689518ba6791541d31a9900aaa8a7
BLAKE2b-256 127202cf9ef515a46260227d33f53d2477f22e5763506b32f5dc66981c8fa030

See more details on using hashes here.

File details

Details for the file apii-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: apii-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 122.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for apii-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 74b194d27ce809e37d1bba99fb04dd3f2889205669c3a43608a30b1b4c8a8a20
MD5 685fd618de243581c34c67c4f2d69a1f
BLAKE2b-256 5c66b5a5db0e9a338f6063467b62db43a46a5740205d50685a19508994bfd4a1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page