Skip to main content

Saudi-aware PII detection & redaction for LLM pipelines. Local-first, zero telemetry.

Project description

Tabayyan — Saudi-first PII detection & redaction for LLM pipelines

PyPI Python tests License: Apache-2.0

Install · Quick start · Middleware & audit · Plugins · API stability · Changelog

16 detectors · 240+ tests (property · golden-regression · contract · fuzz) · Python 3.9–3.13 · zero-dependency, offline core

Generic PII libraries target international identifiers and either miss Saudi ones or flag them with no validation. Tabayyan adds first-class Saudi & Arabic identifier support — backed by real checksums — while staying offline, extensible, and production-friendly.

pip install tabayyan

[!NOTE] A detection aid, not a compliance guarantee — deploy it as one layer in a defense-in-depth strategy, with human review for LOW-confidence findings. See Scope and honest limits.

How it works

Pipeline: input → Unicode normalization → detection → checksum validation → classification → redact/hash/tokenize → safe output

InputUnicode normalization (strip zero-width/bidi, fold Arabic-Indic & fullwidth digits) → detectionchecksum validationclassification (type · confidence · NDMO level) → redact / hash / tokenizesafe output. Pure-ASCII input is unchanged; matches map back to the original offsets.

Before / after

scan_and_redact(text, "mask") — detects, then rewrites the original text in place:

InputOutput
المريض محمد بن عبدالله
National ID: 1158813996
Phone: +966512345678
المريض [ARABIC_NAME]
National ID: [SAUDI_NATIONAL_ID]
Phone: [SAUDI_MOBILE]

Why Tabayyan?

Capability Tabayyan Generic PII tools
Saudi National ID / Iqama ✅ checksum-validated ❌ missed / unvalidated
Saudi IBAN (mod-97) ⚠️ partial
Saudi VAT · CR · passport · National Address
Arabic-Indic digits & Unicode-evasion aware ✅ normalized ⚠️ often missed
Checksum validation (not just format) ⚠️ rare
Offline · zero-dependency core ⚠️ varies
LLM guard + PDPL cross-border audit
Homograph / Arabic+Latin lookalike domains ✅ (opt-in) ⚠️ rare

Works with

Built-in, auto-detecting adapters for OpenAI, Azure OpenAI, and Anthropic; a validated recognizer pack for Microsoft Presidio; and a provider-agnostic building block (Guard.protect_messages / Guard.wrap) that drops into any stack — FastAPI, LangChain, batch jobs, or your own SDK.

Playground (demo web UI)

Tabayyan Playground — Saudi PII highlighted, classified, and listed as cards, with redaction preview

Try Tabayyan in the browser — paste text, see PII highlighted, classified, and redacted — with zero external calls. It's a demo that consumes the public API only (no core changes):

pip install -e . && pip install -r playground/requirements.txt
uvicorn playground.app:app          # http://127.0.0.1:8000

Two-column editor, per-category highlighting, detection cards, JSON view, redaction preview (mask / remove / partial / hash / tokenize), synthetic Arabic samples, .txt upload, and light/dark themes. See playground/README.md.

Status

Public release (v0.8.0). The pre-1.0 version numbers track development milestones — the CHANGELOG documents each. Expect the API to stabilise toward 1.0. What's covered by versioning and what's still experimental is spelled out in docs/api-stability.md.

Install

pip install tabayyan             # core (zero dependencies)
pip install "tabayyan[crypto]"   # + encrypted tokenize vault
pip install "tabayyan[presidio]" # + Microsoft Presidio recognizers
# from source (dev): pip install -e ".[dev]"

Quick start

from tabayyan import scan, scan_and_redact, RedactionMode

for m in scan("call +966512345678 — National ID 1010864542 on file"):
    print(m.entity_type.value, m.confidence.value, m.category.value)

# Redact in one step
result = scan_and_redact("National ID 1158813996", RedactionMode.MASK)
print(result.text)  # National ID [SAUDI_NATIONAL_ID]

Each result is a Match with entity_type, category, confidence (HIGH / MEDIUM / LOW), character start/end, the matched value, and a .redacted() placeholder.

Windows: if printing Arabic raises UnicodeEncodeError, set PYTHONIOENCODING=utf-8 (a console limitation, not the library) — see the FAQ.

CLI

# detect (table or --json); reads stdin, files, or directories
echo "National ID 1158813996" | tabayyan scan -
tabayyan scan ./docs --json --min-confidence high

# redact: mask | remove | hash | partial
cat note.txt | tabayyan redact - --mode mask
cat note.txt | tabayyan redact - --mode partial --keep-last 4
cat note.txt | tabayyan redact - --mode hash --salt "$SALT"

# CI / pre-commit gate: non-zero exit if anything is found
tabayyan scan ./src --fail-on-find

Filters: --min-confidence {low,medium,high}, --only TYPE..., --exclude TYPE....

Redaction modes

Mode Output for a National ID Use case
mask [SAUDI_NATIONAL_ID] default; keeps text readable
remove (deleted) strip entirely
hash [HASH:f999c93a6934] keyed (HMAC), deterministic; correlate without exposing
partial ******8153 keep last N for debugging

hash is HMAC-SHA256 keyed by --salt and requires a non-empty salt — a bare digest of a 10-digit identifier is reversible by brute force, so the key is what makes the token non-reversible. The same value maps to the same token under a given salt, so you can correlate occurrences without revealing the value; change the salt to break correlation across datasets. Treat hash output as pseudonymous, not anonymous.

In code:

from tabayyan import scan_and_redact, RedactionMode

result = scan_and_redact(text, RedactionMode.MASK)
print(result.text)   # sanitised
print(result.count)  # entities redacted
print(result.items)  # per-entity mapping

Confidence model

  • HIGH — passes a published checksum (National ID, Iqama, IBAN, credit card). Very low false-positive rate.
  • MEDIUM — strong, specific format match with no checksum available (+966 mobile, email).
  • LOW — format/context only, meaningful false-positive potential (CR, MRN). Confirm before acting.

Lookalike / homoglyph domains (opt-in)

Beyond PII, Tabayyan can flag domains that impersonate a watchlist using confusable characters (IDN homograph attacks), mixed scripts (including Arabic+Latin), or edit-distance typosquats.

tabayyan domains email.eml --watchlist my-domains.txt
from tabayyan.homoglyph import scan_text

scan_text("login at ex\u0430mple.com", ["example.com"])
# -> impersonation (Cyrillic 'a'), target example.com, HIGH

This is not in the default PII detector set — construct LookalikeDomainDetector(watchlist=...) or use the domains command.

Benchmarks

Reproducible on a synthetic corpus with hard negatives:

python benchmarks/run.py --write  # writes benchmarks/RESULTS.md

The headline is the false-positive contrast against a naive format-only regex — checksum validation removes the entire decoy class:

Entity type Naive regex FP Tabayyan FP
saudi_national_id 300 0
saudi_iqama 300 0
saudi_iban 300 0
credit_card 300 0

(300 invalid-checksum decoys per type. Synthetic data measures detectors against their design assumptions, not real-world traffic — see the honest caveat below.)

The run also reports an evasion-robustness section: recall on identifiers hidden behind zero-width, Arabic-Indic, or fullwidth characters, with the normalization pre-pass on vs off — recall stays 1.000 normalized and collapses without it. Full tables in benchmarks/RESULTS.md.

Validators are independently cross-checked: National ID against alhazmy13/Saudi-ID-Validator, and IBAN + Luhn against python-stdnum plus official card-network test PANs. See REFERENCES.md.

Docker & pre-commit

# Docker
docker build -t tabayyan:local .
echo "National ID 1158813996" | docker run --rm -i tabayyan:local scan -

# pre-commit: block accidental PII in commits
# add this repo to .pre-commit-config.yaml (see the file in this repo)

Middleware & audit (Azure / OpenAI)

Put a guard in front of your LLM endpoint: redact personal data before it leaves, and emit an audit trail — including cross-border transfer flagging (PDPL Art. 29) for endpoints outside the Kingdom.

from tabayyan import Guard, AuditLog, RedactionMode

guard = Guard(in_kingdom_hosts=["llm.myhospital.health.sa"],
              audit=AuditLog(path="audit.jsonl"))
pr = guard.protect("الهوية 1158813996", destination="https://contoso.openai.azure.com")
pr.text                         # redacted before send
pr.audit.cross_border_transfer  # True for external endpoints with personal data

Wrap any LLM client — OpenAI/Azure or Anthropic, auto-detected — with guard.wrap(client, destination=...), then call .create(...); PII is redacted before the request leaves. See docs/middleware.md.

Use it inside Presidio

Already on Microsoft Presidio? Add Tabayyan's validated Saudi/Arabic recognizers with one import:

pip install "tabayyan[presidio]"
from presidio_analyzer import AnalyzerEngine
from tabayyan.integrations.presidio import register_saudi_recognizers

analyzer = AnalyzerEngine()
register_saudi_recognizers(analyzer)  # SA_NATIONAL_ID, SA_IQAMA, SA_IBAN, ...

It complements Presidio (adds what it lacks, no duplication) and is parity-tested against the standalone engine. See docs/presidio.md.

Performance

Single-threaded, default detector set, on synthetic text:

python benchmarks/perf.py

Overlap resolution sorts in O(n log n) and accepts each match with two bisect lookups; keeping the disjoint set ordered uses list.insert, so the worst case is O(n²) for pathologically dense input (n = matches, not bytes). In practice n is tiny: a dense 5 MB sample (one entity per ~57 bytes) still scans in under 2 seconds on a typical CPU, and real prose is far sparser. For very large files, use streaming so memory stays flat:

tabayyan scan huge.log --stream

Reversible redaction (tokenize)

from tabayyan import scan_and_redact, restore, RedactionMode

r = scan_and_redact("ID 1158813996, again 1158813996", RedactionMode.TOKENIZE)
# "ID <SAUDI_NATIONAL_ID_1>, again <SAUDI_NATIONAL_ID_1>"  (repeats share a token)
assert restore(r.text, r.vault) == "ID 1158813996, again 1158813996"

The vault (token → original) is the reversal key — store it as securely as the source data.

Extending via config

{ "disable": ["saudi_cr"],
  "custom_detectors": [
    {"label": "employee_id", "pattern": "EMP-\\d{6}",
     "category": "organisation", "confidence": "medium"}] }
tabayyan scan note.txt --config my-config.json

See docs/config.md, docs/faq.md, docs/threat-model.md, and REFERENCES.md for algorithm provenance.

Scope and honest limits

Tabayyan is a detection aid, not a compliance guarantee.

  • Passing a checksum means a value is structurally valid, not that it was ever issued or belongs to a real person.
  • The National ID validator uses the de-facto community Luhn variant, cross-validated against an independent reference (100% agreement on 50k+ samples) but not an authoritative government spec. Confirm before production reliance (see docs/REFERENCES.md).
  • Arabic name detection is a heuristic, not ML NER: recall is limited by design to protect precision.
  • CR has no public checksum; detection is format + keyword context only.
  • MRN has no national format; detection is keyword-context only and is inherently lower precision. It is still tagged as health data, which carries the strictest handling obligations under PDPL/NDMO — weight it accordingly even at LOW detection confidence.
  • False negatives exist. Do not make this your sole control for personal or health data.

Roadmap

  • v0.1 — detection core + Saudi/generic detectors + tests.
  • v0.2 — redaction modes (mask/remove/hash/partial) + CLI.
  • v0.3 — homoglyph/lookalike-domain detection, benchmark suite, Docker / pre-commit / PyPI / docs.
  • v0.4 — Arabic name detection, streaming large files, reversible tokenize redaction, JSON config + custom detectors, faster bisect-based overlap resolution, references + FAQ + threat-model docs.
  • v0.5 — middleware + audit (cross-border flagging) and Presidio integration (validated Saudi recognizers).
  • v0.6 — six new Saudi entities (VAT, landline, passport, border/visa, National Address, unified 700); offset-preserving anti-evasion normalization; provider-agnostic adapter layer (OpenAI + Anthropic); NDMO data classification in the audit; password-encrypted tokenize vault; expanded precision/recall + evasion-robustness benchmarks; and security hardening (HMAC-keyed hash, block-path leak fix, timezone-aware audit timestamps).
  • v0.7 — detector plugin system (register_detector + opt-in entry_points discovery); verification & governance: property-based tests, golden corpus + contract tests, frozen public-API + SemVer/deprecation policy; expanded threat model; scheduled fuzzing; and release-engineering docs (RELEASE, compatibility matrix, ADRs, detector guide).
  • v0.7.1 (current) — fixes from a community hands-on review: README quick-start National ID checksum fix (+ regression test), keep_last alias for the CLI's --keep-last, and docs for Windows console encoding and Arabic-name detection scope.
  • Toward 1.0 — the verification, API-stability, and governance foundations are in place; 1.0 is a stabilization milestone rather than a feature one.

After 1.0

A short list of priorities (not a wishlist):

  • improved homoglyph / letter-confusable handling in free text;
  • additional regional identifiers;
  • enterprise integrations;
  • performance and streaming improvements;
  • optional static typing (mypy) in CI;
  • optional prompt-injection heuristics (isolated module).

Contributing

See CONTRIBUTING.md and the detector guide. One hard rule: synthetic data only — never commit real personal data. Releases follow RELEASE.md; supported environments are listed in docs/compatibility.md, and the design rationale lives in the ADRs.

License

Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabayyan-0.8.0.tar.gz (121.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tabayyan-0.8.0-py3-none-any.whl (52.1 kB view details)

Uploaded Python 3

File details

Details for the file tabayyan-0.8.0.tar.gz.

File metadata

  • Download URL: tabayyan-0.8.0.tar.gz
  • Upload date:
  • Size: 121.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tabayyan-0.8.0.tar.gz
Algorithm Hash digest
SHA256 3fd3bb557054e6291a1a015a7dba3ea628192b07503fc7a4ff806d6de104f920
MD5 e3882610e9c81034f5af8d95805f12f4
BLAKE2b-256 4448364bbc3a01b54597c9d58cd16384e592d12868d032af37a6ab58f487354d

See more details on using hashes here.

Provenance

The following attestation bundles were made for tabayyan-0.8.0.tar.gz:

Publisher: release.yml on nasser-gh/Tabayyan

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tabayyan-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: tabayyan-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 52.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tabayyan-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 736f610fea7e327f168a4b039f5dca745d4e60034cfd965677a75fa66c505e84
MD5 69bc5bda2f9526ab938cea7d4d8cb23b
BLAKE2b-256 4243088097f089c2d854491537927fe2db9ae1634838ebd676e86f8c198d7043

See more details on using hashes here.

Provenance

The following attestation bundles were made for tabayyan-0.8.0-py3-none-any.whl:

Publisher: release.yml on nasser-gh/Tabayyan

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page