Skip to main content

Arabic diacritization using BiLSTM + attention with sentence caching — 6.6% DER, beating GPT-5.3

Project description

Arabic Diacritizer

PyPI version Python versions License: MIT

Automatic Arabic diacritization (tashkeel) using a BiLSTM + Bahdanau attention model with a sentence cache — achieves 6.6% DER, beating GPT-5.3's 20.9%.

Features

  • 6.6% Diacritic Error Rate — outperforms GPT-5.3 (20.9% DER) on the Tashkeela benchmark
  • Sentence cache + model hybrid — cached sentences return instantly, BiLSTM handles the rest
  • 15-class diacritic prediction — fatha, damma, kasra, sukun, shadda, tanween, and shadda compounds
  • CLI + Python API — use from the terminal or import as a library
  • Lightweight — 18MB model + 28MB cache, runs on CPU (no GPU required)
  • Pipe-friendly — reads from arguments, files, or stdin

Quick Start

# Install from PyPI
pip install arabic-diacritizer

# Or install from source
git clone https://github.com/Z-Mahmood/arabic-diacritizer-public-release.git
cd arabic-diacritizer-public-release
pip install -e .

# Diacritize text
diacritize "بسم الله الرحمن الرحيم"

# Pipe from stdin
echo "محمد رسول الله" | diacritize

# Diacritize a file
diacritize --file input.txt --output output.txt

# Model-only mode (skip cache)
diacritize --no-cache "هذا نص عربي"

Python API

from diacritize import Diacritizer

d = Diacritizer.from_pretrained()
print(d.diacritize("بسم الله الرحمن الرحيم"))
# بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

Example Output

$ diacritize "الحمد لله رب العالمين"
الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ

$ diacritize "هذا كتاب مفيد"
هَٰذَا كِتَابٌ مُفِيدٌ

Architecture

graph LR
    A[Arabic Text] --> B{Sentence Cache}
    B -->|Hit| C[Cached Result]
    B -->|Miss| D[BiLSTM + Attention]
    D --> E[Character Tokenizer]
    E --> F["Embedding(128)"]
    F --> G["BiLSTM(256×2, 3 layers)"]
    G --> H[Bahdanau Attention]
    H --> I["Linear(15 classes)"]
    I --> J[Apply Diacritics]
    C --> K[Output]
    J --> K

Inference flow:

  1. Input text is stripped of existing diacritics and normalized (NFKC)
  2. Sentence cache checks for an exact match — if found, returns instantly
  3. On cache miss, the BiLSTM processes the full sentence character-by-character
  4. Bahdanau attention lets each position attend to the full sequence context
  5. The classifier predicts one of 15 diacritic classes per character
  6. Predicted diacritics are re-applied to the stripped text

Project Structure

arabic-diacritizer/
├── src/diacritize/
│   ├── __init__.py         # Exports Diacritizer, BiLSTMDiacritizer
│   ├── __main__.py         # python -m diacritize support
│   ├── cli.py              # Click CLI (diacritize command)
│   ├── pipeline.py         # Cache + model hybrid pipeline
│   ├── config.py           # Unicode constants, label map, model defaults
│   ├── unicode_utils.py    # Diacritic stripping, extraction, application
│   ├── cache.py            # Sentence cache with multi-variant support
│   ├── tokenizer.py        # Character-level tokenizer (53 tokens)
│   ├── evaluate.py         # DER, WER, per-diacritic accuracy metrics
│   ├── assets/
│   │   ├── bilstm_best.pt      # Trained weights (18MB)
│   │   └── word_cache.json.gz   # Sentence cache (29MB)
│   └── baseline/
│       └── model.py        # BiLSTM + Bahdanau Attention (~15M params)
└── tests/                   # Unit tests for all modules

How It Works

BiLSTM + Attention

The model reads Arabic text as a sequence of characters. A 3-layer bidirectional LSTM processes the sequence in both directions, producing context-aware representations. Bahdanau (additive) attention then lets each position attend to the full sequence — critical for Arabic where diacritics depend on grammatical context spanning the entire sentence.

The final linear layer predicts one of 15 diacritic classes per character:

  • 0: No diacritic
  • 1–8: Individual marks (fatha, damma, kasra, sukun, shadda, fathatan, dammatan, kasratan)
  • 9–14: Shadda compounds (shadda + vowel, predicted as a single class)

Sentence Cache

For common phrases (Quranic verses, frequent expressions), a sentence-level cache provides instant diacritization without model inference. The cache supports multi-variant lookups — handling different Quran editions (Hafs, Warsh) and orthographic variations under normalized keys.

Why This Beats GPT

GPT-5.3 achieves 20.9% DER on Arabic diacritization — it's a general-purpose model that treats diacritization as a text transformation task. Our BiLSTM is purpose-built: character-level tokenization preserves the one-to-one mapping between base characters and diacritics that word-piece tokenizers destroy. The sentence cache handles the long tail of common phrases where even specialized models make mistakes.

Benchmarks

System DER WER
This model (BiLSTM + cache) 6.6% 18.6%
GPT-5.3 20.9% 39.5%
CATT (2024 SOTA) 3.4%
Mishkal (rule-based) 15.2%

Evaluated on the Tashkeela test set. CATT uses a much larger transformer architecture with pre-training on 10x more data.

Limitations

  • Modern Standard Arabic focus — trained on Tashkeela dataset (classical + MSA). Performance on dialectal Arabic (Egyptian, Gulf, Levantine) is untested.
  • Sentence-level context — the model processes one sentence at a time. Cross-sentence disambiguation (e.g., referencing a noun from a previous sentence) is not supported.
  • No case endings for ambiguous words — some Arabic words have genuinely ambiguous diacritization without full syntactic parsing. The model picks the most common form.

Running Tests

pip install -e ".[dev]"
pytest tests/ -v

Author

Zain MahmoodLinkedIn | X/Twitter

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arabic_diacritizer-1.0.0.tar.gz (46.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arabic_diacritizer-1.0.0-py3-none-any.whl (46.9 MB view details)

Uploaded Python 3

File details

Details for the file arabic_diacritizer-1.0.0.tar.gz.

File metadata

  • Download URL: arabic_diacritizer-1.0.0.tar.gz
  • Upload date:
  • Size: 46.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for arabic_diacritizer-1.0.0.tar.gz
Algorithm Hash digest
SHA256 b056b0d48baed4c547127575de48b9359cb0b472a3b37f06bfaa3310d43425b6
MD5 fe1491057d70a0300b8815452eb5872a
BLAKE2b-256 07cb523131fba0e8d48de94bec020f5162a850e4a33428c046ad71a67b394c2f

See more details on using hashes here.

File details

Details for the file arabic_diacritizer-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for arabic_diacritizer-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c4a62e8563baec1a206b3594710880557220f9b67287b102043531dd670dd45f
MD5 cdd3b947d5091a4c9fba352fbb4562e0
BLAKE2b-256 a8caa994b226db61e78fc93d89cc28d865be699814ef5fc2bfff1f13c660f265

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page