Offline Arabic name transliteration and similarity — no LLM, no API calls, no network. Convert English names to Arabic and match Arabic name variants (hamza, tashkeel, taa-marbuta) 100% on your own machine. Built for KYC, compliance, and any workflow where names must never leave your infrastructure.
Project description
arabnamer — Arabic name transliteration & similarity
Offline Arabic name transliteration and similarity — no LLM, no API calls, no network.
from arabnamer import translit, similarity
translit("Mohammed Ali").arabic # → 'محمد علي'
translit("Ayman El Desouky").arabic # → 'أيمن الدسوقي'
similarity("أحمد حسن", "احمد حسن") # → (True, 100)
Why offline matters
Names are personal data. Shipping them to Google Translate, OpenAI, Claude, or any cloud API means exposing PII to a third party — a hard compliance problem for finance, legal, healthcare, government, and MENA-region institutions bound by data-residency laws.
arabnamer solves both major Arabic-name problems on your own machine:
- Transliteration —
Mohammed Ali→محمد علي(the tricky ones: hamza variants, compound articles likeEl/Al/Abd, silent vowels, dialect spellings) - Matching —
أحمد حسن≡احمد حسن≡أحمد حسن(scoring insensitive to hamza, tashkeel, taa-marbuta, and alef-maksura variants)
No model server, no internet, no API key, no request logs. Install once via pip,
run anywhere — including air-gapped environments. The 38 MB pruned XGBoost model and
22,798-pair dictionary are bundled inside the wheel.
| Cloud APIs (Google, OpenAI, Claude) | arabnamer |
|
|---|---|---|
| Network required | ✅ yes | ❌ no — 100% offline |
| Names leave your infrastructure | ✅ yes | ❌ no |
| Per-call cost | metered | zero |
| Works in air-gapped / on-prem | ❌ no | ✅ yes |
| Model audit / replacement | ❌ opaque | ✅ open weights + retrainable |
| Accuracy (25-name MENA benchmark) | varies by model | 98.4 avg, 24/25 pass ≥ 90 |
Built for: KYC / sanctions screening, compliance-gated entity resolution, library & archive cataloguing, Arabic NLP preprocessing, on-premise search relevance — any workflow where names must never leave your infrastructure.
Install
pip install arabnamer
Works offline after install — the model is bundled (gzipped, ~38 MB). No external services, no API keys.
Quick start
1. Transliterate English → Arabic
from arabnamer import translit, translit_batch
result = translit("Mohammed Ali")
print(result.arabic) # 'محمد علي'
print(result.score) # 0.0 (no reference supplied)
print(result.engine) # 'xgboost'
# batch
results = translit_batch(["Ahmad Hassan", "Marwa Farag", "Ayman El Desouky"])
for r in results:
print(f"{r.input:<25} -> {r.arabic}")
2. Score with a reference (lenient, normalized)
from arabnamer import translit
r = translit("Adham Saouli", reference="أدهم ساولي")
print(r.score) # 100.0
print(r.accepted) # True (>= default threshold 85)
3. Arabic ↔ Arabic similarity (tashkeel / hamza / taa-marbuta insensitive)
from arabnamer import similarity
similarity("أحمد حسن", "احمد حسن") # (True, 100) hamza variant
similarity("مروة فرج", "مروه فرج") # (True, 100) taa-marbuta vs haa
similarity("محمد", "أحمد", threshold=75) # (True, 75) — different names
4. Configurable: threshold + engine
from arabnamer import Transliterator
t = Transliterator(engine="model", threshold=90) # stricter pass bar
t.translit("Mohammed Ali", reference="محمد علي")
# Rule-based engine (deterministic, dict-first)
t_rules = Transliterator(engine="rules")
t_rules.translit("Mohammed Ali")
# Hybrid: dict -> model -> rules fallback
t_hybrid = Transliterator(engine="hybrid")
How accurate is it?
Benchmarked on 25 MENA-region names (authors, journalists, public figures):
| Metric | Score |
|---|---|
| Average lenient similarity | 98.4 |
| Pass rate (≥ 70) | 25 / 25 |
| Pass rate (≥ 90) | 24 / 25 |
| Exact match (= 100) | 21 / 25 |
Model: XGBoost, 386 boosting rounds × 335 output classes, 34 input features per character
(char IDs + position + phonetic class + bigram/trigram IDs). See benchmarks/REPORT.md
for the full breakdown.
What's inside this repo
arabnamer/
├── src/arabnamer/ # pip-installable library
│ ├── core/ # Transliterator orchestrator + Result
│ ├── prediction/ # XGBoost model loader + featurize
│ ├── rules/ # deterministic rule-based walker
│ ├── scoring/ # fuzzy match + Arabic normalizer
│ └── utils/ # tokenizer + dict lookup
│
├── model/ # gzipped model + labels (ships in wheel)
├── dataset/ # 22,798 EN -> AR name pairs (dict_FINAL.json)
├── training/ # scripts to retrain from scratch
├── tests/ # smoke + benchmark eval
├── benchmarks/ # eval results + REPORT.md
└── docs/ # data sources + architecture + API details
Data sources
Training data comes from three open sources plus a manual audit pass. Full provenance in
docs/data_sources.md. In summary:
| Source | License | Contribution |
|---|---|---|
| JRC-Names (European Commission, multilingual name gazetteer) | EU open-data | primary EN/AR name pairs |
Google Translate (via deep_translator) |
generated text | fill-in for names absent from JRC |
| Claude (Anthropic) LLM fill | generated text | supplementary fill for DI-thesis-specific names |
| Manual audit + rule-based cleanup | — | phonetic-compatibility filter, hamza repair, outlier removal |
Model and dataset are released under CC-BY-4.0 (attribution required). Library code is MIT.
Reproducing the model
git clone https://github.com/sayedyousef/arabnamer
cd arabnamer
pip install -e ".[dev]"
cd training
python train_xgboost.py # ~5 min on CPU, produces ~285 MB UBJ
python prune_trees.py # ~1 min, saves pruned variants
python find_min_k.py # ~30 sec, finds the smallest-identical model
XGBoost with multi-threaded histogram training is approximately (not bit-perfectly) deterministic.
See training/README.md for the full note.
Why this library exists
Arabic name handling is a real gap in open NLP tooling. Most Arabic libraries
(PyArabic, arabic-reshaper, AraBERT, CAMeL Tools) focus on text — rendering,
tokenisation, stemming. Proper-noun transliteration and fuzzy-matching of name variants
(أحمد vs احمد vs Ahmad) is underserved.
arabnamer is the first open library specifically for:
- Arabic ↔ English name transliteration at the word level
- Arabic name similarity that's insensitive to hamza, tashkeel, taa-marbuta, and alef-maksura variants
- Offline, pip-installable — no API keys, no network, no LLM dependency
Typical use cases:
- KYC & sanctions screening — match Arabic names against English watchlists
- Library & archive cataloguing — normalize author names across MENA languages
- Search relevance — expand queries to cover Arabic-script variants
- Data integration — reconcile person entities across EN/AR CRMs
Citation
If you use arabnamer in a paper, product, or research tool, please cite:
@software{yousef_arabnamer_2026,
author = {Yousef, Elsayed},
title = {arabnamer: Arabic name transliteration and similarity},
year = {2026},
version = {0.1.0},
url = {https://github.com/sayedyousef/arabnamer},
}
A CITATION.cff file is included so GitHub's "Cite this repository" button works.
Commercial support
For production deployments, custom models trained on your domain corpus, on-premise integration, or paid support:
License
- Source code (
src/arabnamer/,training/): MIT License - Dataset, model weights, test data: CC-BY-4.0 (attribution required)
Acknowledgements
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arabnamer-0.1.2.tar.gz.
File metadata
- Download URL: arabnamer-0.1.2.tar.gz
- Upload date:
- Size: 40.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c7c03f224cccac09fa79c816c81df58a4fd48015a9eee2216198583ce2a40d4
|
|
| MD5 |
d31a980f1bfa8c425a438830986c04dc
|
|
| BLAKE2b-256 |
7ba52836d9ddd777acb035bef4aa735ef944e66cc73f5f4219a94a9578a4f4d0
|
File details
Details for the file arabnamer-0.1.2-py3-none-any.whl.
File metadata
- Download URL: arabnamer-0.1.2-py3-none-any.whl
- Upload date:
- Size: 40.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c76f4d895ffb29e074b35259b741b8ab2040e37554097acadbad1f260550cb1d
|
|
| MD5 |
bad2773f74365435b1231184208a1e82
|
|
| BLAKE2b-256 |
6b9a3b4a4af8c50d2ad53833ba7145b0282ea4a254501c41f00921c94a962a6f
|