Saudi-aware PII detection & redaction for LLM pipelines. Local-first, zero telemetry.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nasser_gh

These details have not been verified by PyPI

Project description

تبيّن · Tabayyan

Saudi-aware PII detection & redaction for LLM pipelines. Local-first, zero telemetry.

🇸🇦 اقرأ هذا الملف بالعربية (README.ar.md)

Generic PII scanners are built around Western identifiers and miss Saudi ones — or flag them with no validation. Tabayyan detects Saudi-specific personal data (National ID, Iqama, Saudi IBAN, CR, +966 mobile, medical record numbers) with real checksum validation, then tags each finding by data category and confidence so you can redact or block before text leaves your environment for an LLM endpoint.

It runs fully offline: no network calls, no telemetry, no external dependencies in the detection core.

تبيّن أداة للكشف عن البيانات الشخصية الحساسة في النصوص قبل إرسالها إلى نماذج اللغة (LLM). تركّز على المعرّفات السعودية (الهوية الوطنية، الإقامة، الآيبان السعودي، السجل التجاري، الجوال، رقم الملف الطبي) مع تحقّق فعلي من checksum. تعمل محلياً بالكامل — بدون أي اتصال خارجي أو telemetry.

Why it's different

	Generic PII tools	Tabayyan
Saudi National ID / Iqama	missed or unvalidated	checksum-validated (HIGH)
Saudi IBAN	partial	ISO 13616 mod-97 (HIGH)
Arabic-Indic digits (٠-٩)	usually missed	normalised + detected
Medical Record Number	generic	health-category, PDPL/NDMO-aware
Arabic personal names	usually missed	heuristic detector (opt-precision)
Homograph / lookalike domains	rare	Arabic+Latin aware (opt-in)
Network calls	sometimes	never

Status

Initial public release. The pre-1.0 version numbers (0.1–0.5) track internal development milestones, not separate public releases — the CHANGELOG documents each. Expect the API to stabilise toward 1.0.

Install

pip install tabayyan        # once published to PyPI
# or, from source:
pip install -e ".[dev]"

Quick start

from tabayyan import scan

for m in scan("call +966512345678 — National ID 1010864543 on file"):
    print(m.entity_type.value, m.confidence.value, m.category.value)

Each result is a Match with entity_type, category, confidence (HIGH / MEDIUM / LOW), character start/end, the matched value, and a .redacted() placeholder.

CLI

# detect (table or --json); reads stdin, files, or directories
echo "National ID 1158813996" | tabayyan scan -
tabayyan scan ./docs --json --min-confidence high

# redact: mask | remove | hash | partial
cat note.txt | tabayyan redact - --mode mask
cat note.txt | tabayyan redact - --mode partial --keep-last 4
cat note.txt | tabayyan redact - --mode hash --salt "$SALT"

# CI / pre-commit gate: non-zero exit if anything is found
tabayyan scan ./src --fail-on-find

Filters: --min-confidence {low,medium,high}, --only TYPE..., --exclude TYPE....

Redaction modes

Mode	Output for a National ID	Use case
`mask`	`[SAUDI_NATIONAL_ID]`	default; keeps text readable
`remove`	(deleted)	strip entirely
`hash`	`[HASH:f999c93a6934]`	deterministic, irreversible; correlate without exposing
`partial`	`******8153`	keep last N for debugging

hash is deterministic per --salt: the same value maps to the same token, so you can correlate occurrences without revealing the value. Change the salt to break correlation across datasets.

In code:

from tabayyan import scan_and_redact, RedactionMode

result = scan_and_redact(text, RedactionMode.MASK)
print(result.text)     # sanitised
print(result.count)    # entities redacted
print(result.items)    # per-entity mapping

Confidence model

HIGH — passes a published checksum (National ID, Iqama, IBAN, credit card). Very low false-positive rate.
MEDIUM — strong, specific format match with no checksum available (+966 mobile, email).
LOW — format/context only, meaningful false-positive potential (CR, MRN). Confirm before acting.

Lookalike / homoglyph domains (opt-in)

Beyond PII, Tabayyan can flag domains that impersonate a watchlist using confusable characters (IDN homograph attacks), mixed scripts (including Arabic+Latin), or edit-distance typosquats.

tabayyan domains email.eml --watchlist my-domains.txt

from tabayyan.homoglyph import scan_text

scan_text("login at ex\u0430mple.com", ["example.com"])
# -> impersonation (Cyrillic 'a'), target example.com, HIGH

This is not in the default PII detector set — construct LookalikeDomainDetector(watchlist=...) or use the domains command.

Benchmarks

Reproducible on a synthetic corpus with hard negatives:

python benchmarks/run.py --write   # writes benchmarks/RESULTS.md

The headline is the false-positive contrast against a naive format-only regex — checksum validation removes the entire decoy class:

Entity type	Naive regex FP	Tabayyan FP
saudi_national_id	300	0
saudi_iqama	300	0
saudi_iban	300	0
credit_card	300	0

(300 invalid-checksum decoys per type. Synthetic data measures detectors against their design assumptions, not real-world traffic — see the honest caveat below.)

Validators are independently cross-checked: National ID against alhazmy13/Saudi-ID-Validator, and IBAN + Luhn against python-stdnum plus official card-network test PANs. See REFERENCES.md.

Docker & pre-commit

# Docker
docker build -t tabayyan:local .
echo "National ID 1158813996" | docker run --rm -i tabayyan:local scan -

# pre-commit: block accidental PII in commits
#   add this repo to .pre-commit-config.yaml (see the file in this repo)

Middleware & audit (Azure / OpenAI)

Put a guard in front of your LLM endpoint: redact personal data before it leaves, and emit an audit trail — including cross-border transfer flagging (PDPL Art. 29) for endpoints outside the Kingdom.

from tabayyan import Guard, AuditLog, RedactionMode

guard = Guard(in_kingdom_hosts=["llm.myhospital.health.sa"],
              audit=AuditLog(path="audit.jsonl"))
pr = guard.protect("الهوية 1158813996", destination="https://contoso.openai.azure.com")
pr.text                      # redacted before send
pr.audit.cross_border_transfer  # True for external endpoints with personal data

Wrap an OpenAI/Azure client directly with guard.guard_openai(client, destination=...). See docs/middleware.md.

Use it inside Presidio

Already on Microsoft Presidio? Add Tabayyan's validated Saudi/Arabic recognizers with one import:

pip install "tabayyan[presidio]"

from presidio_analyzer import AnalyzerEngine
from tabayyan.integrations.presidio import register_saudi_recognizers

analyzer = AnalyzerEngine()
register_saudi_recognizers(analyzer)   # SA_NATIONAL_ID, SA_IQAMA, SA_IBAN, ...

It complements Presidio (adds what it lacks, no duplication) and is parity-tested against the standalone engine. See docs/presidio.md.

Performance

Single-threaded, default detector set, on synthetic text:

python benchmarks/perf.py

Overlap resolution is O(n log n); a pathologically dense 5 MB sample (one entity per ~57 bytes) scans in under 2 seconds on a typical CPU. Real prose is far sparser and proportionally faster. For very large files, use streaming so memory stays flat:

tabayyan scan huge.log --stream

Reversible redaction (tokenize)

from tabayyan import scan_and_redact, restore, RedactionMode

r = scan_and_redact("ID 1158813996, again 1158813996", RedactionMode.TOKENIZE)
# "ID <SAUDI_NATIONAL_ID_1>, again <SAUDI_NATIONAL_ID_1>"  (repeats share a token)
assert restore(r.text, r.vault) == "ID 1158813996, again 1158813996"

The vault (token → original) is the reversal key — store it as securely as the source data.

Extending via config

{ "disable": ["saudi_cr"],
  "custom_detectors": [
    {"label": "employee_id", "pattern": "EMP-\\d{6}",
     "category": "organisation", "confidence": "medium"}] }

tabayyan scan note.txt --config my-config.json

See docs/config.md, docs/faq.md, docs/threat-model.md, and REFERENCES.md for algorithm provenance.

Scope and honest limits

Tabayyan is a detection aid, not a compliance guarantee.

Passing a checksum means a value is structurally valid, not that it was ever issued or belongs to a real person.
The National ID validator uses the de-facto community Luhn variant, cross-validated against an independent reference (100% agreement on 50k+ samples) but not an authoritative government spec. Confirm before production reliance (see REFERENCES.md).
Arabic name detection is a heuristic, not ML NER: recall is limited by design to protect precision.
CR has no public checksum; detection is format + keyword context only.
MRN has no national format; detection is keyword-context only and is inherently lower precision. It is still tagged as health data, which carries the strictest handling obligations under PDPL/NDMO — weight it accordingly even at LOW detection confidence.
False negatives exist. Do not make this your sole control for personal or health data.

Roadmap

v0.1: detection core + Saudi/generic detectors + tests.
v0.2 (this release): redaction modes (mask/remove/hash/partial) + CLI.
v0.5 (this release): middleware + audit (cross-border flagging) and Presidio integration (validated Saudi recognizers).
v0.3: homoglyph/lookalike-domain detection, benchmark suite, Docker / pre-commit / PyPI / docs.
v0.4 (this release): Arabic name detection, streaming large files, reversible tokenize redaction, JSON config + custom detectors, O(n log n) engine, references + FAQ + threat-model docs.
v0.5 (this release): middleware + audit (cross-border flagging) and Presidio integration (validated Saudi recognizers).
Optional prompt-injection heuristics (isolated module).

Contributing

See CONTRIBUTING.md. One hard rule: synthetic data only — never commit real personal data.

License

Apache-2.0.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nasser_gh

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.5.1

Jun 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabayyan-0.5.1.tar.gz (62.5 kB view details)

Uploaded Jun 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tabayyan-0.5.1-py3-none-any.whl (37.5 kB view details)

Uploaded Jun 28, 2026 Python 3

File details

Details for the file tabayyan-0.5.1.tar.gz.

File metadata

Download URL: tabayyan-0.5.1.tar.gz
Upload date: Jun 28, 2026
Size: 62.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tabayyan-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`42f97e51fd63bc8bb608398c9e04d896c5639a457dee6f0f9ee96c47017530e0`
MD5	`bf96f85e6cfe995594cddf2c36de117e`
BLAKE2b-256	`1ada155bb15c97775d3a2c802258a8e278e26532a4a824901b29d29c738dd6d6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tabayyan-0.5.1.tar.gz:

Publisher: release.yml on nasser-gh/Tabayyan

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tabayyan-0.5.1.tar.gz
- Subject digest: 42f97e51fd63bc8bb608398c9e04d896c5639a457dee6f0f9ee96c47017530e0
- Sigstore transparency entry: 1999470923
- Sigstore integration time: Jun 28, 2026
Source repository:
- Permalink: nasser-gh/Tabayyan@ce074a0b087898a8a42d7dc8fc81d20cec0d247e
- Branch / Tag: refs/tags/v0.5.1
- Owner: https://github.com/nasser-gh
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ce074a0b087898a8a42d7dc8fc81d20cec0d247e
- Trigger Event: push

File details

Details for the file tabayyan-0.5.1-py3-none-any.whl.

File metadata

Download URL: tabayyan-0.5.1-py3-none-any.whl
Upload date: Jun 28, 2026
Size: 37.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tabayyan-0.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`710495bdfe579789bfee59047e34823a8a2628262de884247cf057a3f66f022e`
MD5	`629029c4f47e58640468dac5f2b10714`
BLAKE2b-256	`9c81467b2c946a477e0787e29958c6151a7a0ec70cd1bb3c87ae75032aeba730`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tabayyan-0.5.1-py3-none-any.whl:

Publisher: release.yml on nasser-gh/Tabayyan

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tabayyan-0.5.1-py3-none-any.whl
- Subject digest: 710495bdfe579789bfee59047e34823a8a2628262de884247cf057a3f66f022e
- Sigstore transparency entry: 1999470949
- Sigstore integration time: Jun 28, 2026
Source repository:
- Permalink: nasser-gh/Tabayyan@ce074a0b087898a8a42d7dc8fc81d20cec0d247e
- Branch / Tag: refs/tags/v0.5.1
- Owner: https://github.com/nasser-gh
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ce074a0b087898a8a42d7dc8fc81d20cec0d247e
- Trigger Event: push

tabayyan 0.5.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

تبيّن · Tabayyan

Why it's different

Status

Install

Quick start

CLI

Redaction modes

Confidence model

Lookalike / homoglyph domains (opt-in)

Benchmarks

Docker & pre-commit

Middleware & audit (Azure / OpenAI)

Use it inside Presidio

Performance

Reversible redaction (tokenize)

Extending via config

Scope and honest limits

Roadmap

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance