Offline Arabic name transliteration and similarity — no LLM, no API calls, no network. Convert English names to Arabic and match Arabic name variants (hamza, tashkeel, taa-marbuta) 100% on your own machine. Built for KYC, compliance, and any workflow where names must never leave your infrastructure.

These details have not been verified by PyPI

Project links

Project description

arabnamer — Arabic name transliteration & similarity

Offline Arabic name transliteration and similarity — no LLM, no API calls, no network.

from arabnamer import translit, similarity

translit("Mohammed Ali").arabic        # → 'محمد علي'
translit("Layla Al Saleh").arabic      # → 'ليلى الصالح'

similarity("أحمد حسن", "احمد حسن")      # → (True, 100)

Why offline matters

Names are personal data. Shipping them to Google Translate, OpenAI, Claude, or any cloud API means exposing PII to a third party — a hard compliance problem for finance, legal, healthcare, government, and MENA-region institutions bound by data-residency laws.

arabnamer solves both major Arabic-name problems on your own machine:

Transliteration — Mohammed Ali → محمد علي (the tricky ones: hamza variants, compound articles like El/Al/Abd, silent vowels, dialect spellings)
Matching — أحمد حسن ≡ احمد حسن ≡ أحمد حسن (scoring insensitive to hamza, tashkeel, taa-marbuta, and alef-maksura variants)

No model server, no internet, no API key, no request logs. Install once via pip, run anywhere — including air-gapped environments. The 38 MB pruned XGBoost model and 22,798-pair dictionary are bundled inside the wheel.

	Cloud APIs (Google, OpenAI, Claude)	`arabnamer`
Network required	✅ yes	❌ no — 100% offline
Names leave your infrastructure	✅ yes	❌ no
Per-call cost	metered	zero
Works in air-gapped / on-prem	❌ no	✅ yes
Model audit / replacement	❌ opaque	✅ open weights + retrainable
Accuracy (25-name MENA benchmark)	varies by model	98.0 avg, 23/25 pass ≥ 90

Built for: KYC / sanctions screening, compliance-gated entity resolution, library & archive cataloguing, Arabic NLP preprocessing, on-premise search relevance — any workflow where names must never leave your infrastructure.

Install

pip install arabnamer

Works offline after install — the model is bundled (gzipped, ~38 MB). No external services, no API keys.

Quick start

1. Transliterate English → Arabic

from arabnamer import translit, translit_batch

result = translit("Mohammed Ali")
print(result.arabic)    # 'محمد علي'
print(result.score)     # 0.0 (no reference supplied)
print(result.engine)    # 'xgboost'

# batch
results = translit_batch(["Ahmad Hassan", "Fatima Mansour", "Layla Al Saleh"])
for r in results:
    print(f"{r.input:<25} -> {r.arabic}")

2. Score with a reference (lenient, normalized)

from arabnamer import translit

r = translit("Ahmad Hassan", reference="أحمد حسن")
print(r.score)       # 100.0
print(r.accepted)    # True  (>= default threshold 85)

3. Arabic ↔ Arabic similarity (tashkeel / hamza / taa-marbuta insensitive)

from arabnamer import similarity

similarity("أحمد حسن", "احمد حسن")          # (True, 100)   hamza variant
similarity("مروة فرج", "مروه فرج")           # (True, 100)   taa-marbuta vs haa
similarity("محمد", "أحمد", threshold=75)     # (True, 75) — different names

4. Configurable: threshold + engine

from arabnamer import Transliterator

t = Transliterator(engine="model", threshold=90)   # stricter pass bar
t.translit("Mohammed Ali", reference="محمد علي")

# Rule-based engine (deterministic, dict-first)
t_rules = Transliterator(engine="rules")
t_rules.translit("Mohammed Ali")

# Hybrid: dict -> model -> rules fallback
t_hybrid = Transliterator(engine="hybrid")

How accurate is it?

Benchmarked on 25 generic Arab-name pairs (common first + last combinations covering compound articles, hamza variants, feminine endings):

Metric	Score
Average lenient similarity	98.0
Pass rate (≥ 70)	25 / 25
Pass rate (≥ 90)	23 / 25
Exact match (= 100)	20 / 25

Model: XGBoost, 386 boosting rounds × 335 output classes, 34 input features per character (char IDs + position + phonetic class + bigram/trigram IDs). See benchmarks/REPORT.md for the full breakdown.

What's inside this repo

arabnamer/
├── src/arabnamer/           # pip-installable library
│   ├── core/                # Transliterator orchestrator + Result
│   ├── prediction/          # XGBoost model loader + featurize
│   ├── rules/               # deterministic rule-based walker
│   ├── scoring/             # fuzzy match + Arabic normalizer
│   └── utils/               # tokenizer + dict lookup
│
├── model/                   # gzipped model + labels (ships in wheel)
├── dataset/                 # 22,798 EN -> AR name pairs (dict_FINAL.json)
├── training/                # scripts to retrain from scratch
├── tests/                   # smoke + benchmark eval
├── benchmarks/              # eval results + REPORT.md
└── docs/                    # data sources + architecture + API details

Data sources

Training data comes from three open sources plus a manual audit pass. Full provenance in docs/data_sources.md. In summary:

Source	License	Contribution
JRC-Names (European Commission, multilingual name gazetteer)	EU open-data	primary EN/AR name pairs
Google Translate (via `deep_translator`)	generated text	fill-in for names absent from JRC
Claude (Anthropic) LLM fill	generated text	supplementary fill for DI-thesis-specific names
Manual audit + rule-based cleanup	—	phonetic-compatibility filter, hamza repair, outlier removal

Model and dataset are released under CC-BY-4.0 (attribution required). Library code is MIT.

Reproducing the model

git clone https://github.com/sayedyousef/arabnamer
cd arabnamer
pip install -e ".[dev]"
cd training
python train_xgboost.py      # ~5 min on CPU, produces ~285 MB UBJ
python prune_trees.py        # ~1 min, saves pruned variants
python find_min_k.py         # ~30 sec, finds the smallest-identical model

XGBoost with multi-threaded histogram training is approximately (not bit-perfectly) deterministic. See training/README.md for the full note.

Why this library exists

Arabic name handling is a real gap in open NLP tooling. Most Arabic libraries (PyArabic, arabic-reshaper, AraBERT, CAMeL Tools) focus on text — rendering, tokenisation, stemming. Proper-noun transliteration and fuzzy-matching of name variants (أحمد vs احمد vs Ahmad) is underserved.

arabnamer is the first open library specifically for:

Arabic ↔ English name transliteration at the word level
Arabic name similarity that's insensitive to hamza, tashkeel, taa-marbuta, and alef-maksura variants
Offline, pip-installable — no API keys, no network, no LLM dependency

Typical use cases:

KYC & sanctions screening — match Arabic names against English watchlists
Library & archive cataloguing — normalize author names across MENA languages
Search relevance — expand queries to cover Arabic-script variants
Data integration — reconcile person entities across EN/AR CRMs

Citation

If you use arabnamer in a paper, product, or research tool, please cite:

@software{yousef_arabnamer_2026,
  author  = {Yousef, Elsayed},
  title   = {arabnamer: Arabic name transliteration and similarity},
  year    = {2026},
  version = {0.1.0},
  url     = {https://github.com/sayedyousef/arabnamer},
}

A CITATION.cff file is included so GitHub's "Cite this repository" button works.

Commercial support

For production deployments, custom models trained on your domain corpus, on-premise integration, or paid support:

elsayed.yousef@gmail.com

License

Source code (src/arabnamer/, training/): MIT License
Dataset, model weights, test data: CC-BY-4.0 (attribution required)

Acknowledgements

JRC-Names — European Commission Joint Research Centre name gazetteer (primary training data)
XGBoost — gradient-boosted tree library
RapidFuzz — fast fuzzy string matching
The thesis work this library was extracted from (Doha Institute for Graduate Studies)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.3

Apr 20, 2026

0.1.2

Apr 20, 2026

0.1.1

Apr 20, 2026

0.1.0

Apr 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arabnamer-0.1.3.tar.gz (40.0 MB view details)

Uploaded Apr 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arabnamer-0.1.3-py3-none-any.whl (40.0 MB view details)

Uploaded Apr 20, 2026 Python 3

File details

Details for the file arabnamer-0.1.3.tar.gz.

File metadata

Download URL: arabnamer-0.1.3.tar.gz
Upload date: Apr 20, 2026
Size: 40.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for arabnamer-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`726124c5328ac6549f1060244c772ca42f652b145beb616c94955a04392a9130`
MD5	`a521581aa4ec75d960ce21ca2dcdec88`
BLAKE2b-256	`9550fb7f63eb78451ea18c876b866edaf53d9a9a048a598d4b916167755af428`

See more details on using hashes here.

File details

Details for the file arabnamer-0.1.3-py3-none-any.whl.

File metadata

Download URL: arabnamer-0.1.3-py3-none-any.whl
Upload date: Apr 20, 2026
Size: 40.0 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for arabnamer-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`78d4921a47707420780dcdd24a256cd4f9cdf0fb49b4b8df22b612db14c8bbdc`
MD5	`2a94dd7e019b758325797b46c1baa417`
BLAKE2b-256	`f927597d5aea4856f01cf3242ba770ee51e65e7193a4c90800b84c858f1463df`

See more details on using hashes here.

arabnamer 0.1.3

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

arabnamer — Arabic name transliteration & similarity

Why offline matters

Install

Quick start

1. Transliterate English → Arabic

2. Score with a reference (lenient, normalized)

3. Arabic ↔ Arabic similarity (tashkeel / hamza / taa-marbuta insensitive)

4. Configurable: threshold + engine

How accurate is it?

What's inside this repo

Data sources

Reproducing the model

Why this library exists

Citation

Commercial support

License

Acknowledgements

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes