Arabic diacritization using BiLSTM + attention with sentence caching — 6.6% DER, beating GPT-5.3

These details have not been verified by PyPI

Project links

Project description

Arabic Diacritizer

Automatic Arabic diacritization (tashkeel) using a BiLSTM + Bahdanau attention model with a sentence cache — achieves 6.6% DER, beating GPT-5.3's 20.9%.

Features

6.6% Diacritic Error Rate — outperforms GPT-5.3 (20.9% DER) on the Tashkeela benchmark
Sentence cache + model hybrid — cached sentences return instantly, BiLSTM handles the rest
15-class diacritic prediction — fatha, damma, kasra, sukun, shadda, tanween, and shadda compounds
CLI + Python API — use from the terminal or import as a library
Lightweight — 18MB model + 28MB cache, runs on CPU (no GPU required)
Pipe-friendly — reads from arguments, files, or stdin

Quick Start

# Install from PyPI
pip install arabic-diacritizer

# Or install from source
git clone https://github.com/Z-Mahmood/arabic-diacritizer-public-release.git
cd arabic-diacritizer-public-release
pip install -e .

# Diacritize text
diacritize "بسم الله الرحمن الرحيم"

# Pipe from stdin
echo "محمد رسول الله" | diacritize

# Diacritize a file
diacritize --file input.txt --output output.txt

# Model-only mode (skip cache)
diacritize --no-cache "هذا نص عربي"

Python API

from diacritize import Diacritizer

d = Diacritizer.from_pretrained()
print(d.diacritize("بسم الله الرحمن الرحيم"))
# بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

Example Output

$ diacritize "الحمد لله رب العالمين"
الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ

$ diacritize "هذا كتاب مفيد"
هَٰذَا كِتَابٌ مُفِيدٌ

Architecture

graph LR
    A[Arabic Text] --> B{Sentence Cache}
    B -->|Hit| C[Cached Result]
    B -->|Miss| D[BiLSTM + Attention]
    D --> E[Character Tokenizer]
    E --> F["Embedding(128)"]
    F --> G["BiLSTM(256×2, 3 layers)"]
    G --> H[Bahdanau Attention]
    H --> I["Linear(15 classes)"]
    I --> J[Apply Diacritics]
    C --> K[Output]
    J --> K

Inference flow:

Input text is stripped of existing diacritics and normalized (NFKC)
Sentence cache checks for an exact match — if found, returns instantly
On cache miss, the BiLSTM processes the full sentence character-by-character
Bahdanau attention lets each position attend to the full sequence context
The classifier predicts one of 15 diacritic classes per character
Predicted diacritics are re-applied to the stripped text

Project Structure

arabic-diacritizer/
├── src/diacritize/
│   ├── __init__.py         # Exports Diacritizer, BiLSTMDiacritizer
│   ├── __main__.py         # python -m diacritize support
│   ├── cli.py              # Click CLI (diacritize command)
│   ├── pipeline.py         # Cache + model hybrid pipeline
│   ├── config.py           # Unicode constants, label map, model defaults
│   ├── unicode_utils.py    # Diacritic stripping, extraction, application
│   ├── cache.py            # Sentence cache with multi-variant support
│   ├── tokenizer.py        # Character-level tokenizer (53 tokens)
│   ├── evaluate.py         # DER, WER, per-diacritic accuracy metrics
│   ├── assets/
│   │   ├── bilstm_best.pt      # Trained weights (18MB)
│   │   └── word_cache.json.gz   # Sentence cache (29MB)
│   └── baseline/
│       └── model.py        # BiLSTM + Bahdanau Attention (~15M params)
└── tests/                   # Unit tests for all modules

How It Works

BiLSTM + Attention

The model reads Arabic text as a sequence of characters. A 3-layer bidirectional LSTM processes the sequence in both directions, producing context-aware representations. Bahdanau (additive) attention then lets each position attend to the full sequence — critical for Arabic where diacritics depend on grammatical context spanning the entire sentence.

The final linear layer predicts one of 15 diacritic classes per character:

0: No diacritic
1–8: Individual marks (fatha, damma, kasra, sukun, shadda, fathatan, dammatan, kasratan)
9–14: Shadda compounds (shadda + vowel, predicted as a single class)

Sentence Cache

For common phrases (Quranic verses, frequent expressions), a sentence-level cache provides instant diacritization without model inference. The cache supports multi-variant lookups — handling different Quran editions (Hafs, Warsh) and orthographic variations under normalized keys.

Why This Beats GPT

GPT-5.3 achieves 20.9% DER on Arabic diacritization — it's a general-purpose model that treats diacritization as a text transformation task. Our BiLSTM is purpose-built: character-level tokenization preserves the one-to-one mapping between base characters and diacritics that word-piece tokenizers destroy. The sentence cache handles the long tail of common phrases where even specialized models make mistakes.

Benchmarks

System	DER	WER
This model (BiLSTM + cache)	6.6%	18.6%
GPT-5.3	20.9%	39.5%
CATT (2024 SOTA)	3.4%	—
Mishkal (rule-based)	15.2%	—

Evaluated on the Tashkeela test set. CATT uses a much larger transformer architecture with pre-training on 10x more data.

Limitations

Modern Standard Arabic focus — trained on Tashkeela dataset (classical + MSA). Performance on dialectal Arabic (Egyptian, Gulf, Levantine) is untested.
Sentence-level context — the model processes one sentence at a time. Cross-sentence disambiguation (e.g., referencing a noun from a previous sentence) is not supported.
No case endings for ambiguous words — some Arabic words have genuinely ambiguous diacritization without full syntactic parsing. The model picks the most common form.

Running Tests

pip install -e ".[dev]"
pytest tests/ -v

Author

Zain Mahmood — LinkedIn | X/Twitter

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Mar 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arabic_diacritizer-1.0.0.tar.gz (46.9 MB view details)

Uploaded Mar 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arabic_diacritizer-1.0.0-py3-none-any.whl (46.9 MB view details)

Uploaded Mar 13, 2026 Python 3

File details

Details for the file arabic_diacritizer-1.0.0.tar.gz.

File metadata

Download URL: arabic_diacritizer-1.0.0.tar.gz
Upload date: Mar 13, 2026
Size: 46.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for arabic_diacritizer-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`b056b0d48baed4c547127575de48b9359cb0b472a3b37f06bfaa3310d43425b6`
MD5	`fe1491057d70a0300b8815452eb5872a`
BLAKE2b-256	`07cb523131fba0e8d48de94bec020f5162a850e4a33428c046ad71a67b394c2f`

See more details on using hashes here.

File details

Details for the file arabic_diacritizer-1.0.0-py3-none-any.whl.

File metadata

Download URL: arabic_diacritizer-1.0.0-py3-none-any.whl
Upload date: Mar 13, 2026
Size: 46.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for arabic_diacritizer-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c4a62e8563baec1a206b3594710880557220f9b67287b102043531dd670dd45f`
MD5	`cdd3b947d5091a4c9fba352fbb4562e0`
BLAKE2b-256	`a8caa994b226db61e78fc93d89cc28d865be699814ef5fc2bfff1f13c660f265`

See more details on using hashes here.

arabic-diacritizer 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Arabic Diacritizer

Features

Quick Start

Python API

Example Output

Architecture

Project Structure

How It Works

BiLSTM + Attention

Sentence Cache

Why This Beats GPT

Benchmarks

Limitations

Running Tests

Author

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes