Skip to main content

Industrial-strength Deterministic Phonemizer for Quranic Arabic (Tajweed-aware)

Project description

📖 quran-phonemizer

PyPI version License: MIT Streamlit App

quran-phonemizer is an industrial-strength, Tajweed-aware phonemization library designed specifically for Quranic Arabic.

It uses a Hybrid Architecture combining a "Golden Source" database (82,000+ expert-verified words) with a robust rule-based fallback engine. This ensures 100% accuracy for Quranic text while gracefully handling Hadith, poetry, or imperfect input.

It is specifically optimized for training Neural TTS models (VITS, FastSpeech2) in the style of reciters like Mishary Al-Afasy.


🚀 Live Demo

Try the library instantly in your browser: 👉 Click here to open the Live App

Or run it locally:

pip install streamlit
streamlit run demo_app.py

⚡ Quick Start

from quran_phonemizer import QuranPhonemizer

# 1. Initialize Engine (Loads DB & FTS5 Index)
qp = QuranPhonemizer()

# 2. Phonemize Quranic Text (Database Mode)
# "Allah is the Eternal Refuge"
text = "ٱللَّهُ ٱلصَّمَدُ"
phonemes = qp.phonemize_text(text)

print(phonemes)
# Output: 2aLLaahu SSamadu_Q
# (Note: '2a' = Glottal Stop, 'LL' = Heavy Lam, '_Q' = Qalqalah on Stop)

# 3. Get Atomic Tokens for ML Training (VITS format)
tokens = qp.tokenize_to_atomic(phonemes)
print(tokens)
# Output: ['2', 'a', 'L_H', 'aa', 'h', 'u', 'SP', 'S', 'S', 'a', 'm', 'a', 'd', 'u', 'QK']

🌟 Key Features

1. 🕌 High-Fidelity Tajweed

Captures nuances that standard Arabic G2P tools miss:

  • Heavy/Light Letters (Tafkhim/Tarqiq):
    • Distinguishes Heavy Lam (LL) in "Allah" vs Light Lam (ll).
    • Distinguishes Heavy Ra (R) vs Light Ra (r).
  • Madd (Elongation): Numeric markers for duration (: = 2, :: = 4, ::: = 6 counts).
  • Qalqalah (Echo): Automatically adds _Q when stopping on Qaf, Taa, Ba, Jim, Dal.
  • Ghunnah (Nasal): Marks nasalization (ŋ) for Noon/Meem Shadda.
  • Idgham/Iqlab: Merges sounds across word boundaries (e.g. min ba'di -> mim ba'di).

2. 🧠 ML-Ready Tokenization

Includes a tokenizer that converts human-readable strings into Atomic Tokens for model training.

  • Separated Phonemes: in becomes ['i', 'n'].
  • Symbol Mapping: _Q becomes QK, LL becomes L_H.
  • Word Boundaries: Inserts SP tokens automatically.

3. 🛡️ Robust Search & Fallback

  • FTS5 Search: Instantly finds verses even with missing diacritics or spelling variations.
  • Smart Normalization: Handles Tatweel (ـ), Alif Khanjareeya (ٰ), and Hamza forms transparently.
  • Fallback Engine: If text is not in the Quran, a sophisticated rule-based engine generates phonetically accurate approximations, ensuring your pipeline never crashes.

📦 Installation & Setup

  1. Clone the repository:

    git clone [https://github.com/RazwanSiktany/quran-phonemizer.git](https://github.com/RazwanSiktany/quran-phonemizer.git)
    cd quran-phonemizer
    
  2. Install dependencies:

    pip install .
    
  3. Build the Database (First Run):

    python src/quran_phonemizer/db_builder.py
    

(Note: When installing via pip in the future, the database will be included automatically)


👨‍💻 Author

Razwan M. Haji

📄 License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quran_phonemizer-1.0.0.tar.gz (4.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quran_phonemizer-1.0.0-py3-none-any.whl (3.5 MB view details)

Uploaded Python 3

File details

Details for the file quran_phonemizer-1.0.0.tar.gz.

File metadata

  • Download URL: quran_phonemizer-1.0.0.tar.gz
  • Upload date:
  • Size: 4.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for quran_phonemizer-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e41cb9497bf1568dbf08541a77b5e86d51030fd12a3a4cb1099b3a79916aadac
MD5 daa2673da60c23384b6a4b9b4c307105
BLAKE2b-256 950a270fdf892d01a980bbd41e9fc745cba9d9c4192871743405f618e76287fe

See more details on using hashes here.

File details

Details for the file quran_phonemizer-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for quran_phonemizer-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b29d846581a2a3288e372787e9d5df78439e318f9309071b63906d0682354f71
MD5 589322ecb6ff00213a3d14bd8703c2f3
BLAKE2b-256 a2e3fb5c956e704bb253e18849089e5192c992c72f3527520210cd9e41e4600c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page