Skip to main content

Industrial-strength Deterministic Phonemizer for Quranic Arabic (Tajweed-aware)

Project description

๐Ÿ“– quran-phonemizer

PyPI version License: MIT Streamlit App

quran-phonemizer is an industrial-strength, Tajweed-aware phonemization library designed specifically for Quranic Arabic.

It uses a Hybrid Architecture combining a "Golden Source" database (82,000+ expert-verified words) with a robust rule-based fallback engine. This ensures the accuracy for Quranic text while gracefully handling Hadith, poetry, or imperfect input.

It is specifically optimized for training Neural TTS models (VITS, FastSpeech2) in the style of reciters like Mishary Al-Afasy.


๐Ÿš€ Live Demo

Try the library instantly in your browser: ๐Ÿ‘‰ Click here to open the Live App

Or run it locally:

pip install streamlit
streamlit run demo_app.py

โšก Quick Start

from quran_phonemizer import QuranPhonemizer

# 1. Initialize Engine (Loads bundled DB)
qp = QuranPhonemizer()

# 2. Phonemize Quranic Text (Database Mode)
text = "ูฑู„ู„ูŽู‘ู‡ู ูฑู„ุตูŽู‘ู…ูŽุฏู"
phonemes = qp.phonemize_text(text)

print(phonemes)
# Output: 2aLLaahu SSamad_Q
# (Note: '2a' = Glottal Stop, 'LL' = Heavy Lam, '_Q' = Qalqalah on Stop)

# 3. Get Atomic Tokens for ML Training (VITS format)
tokens = qp.tokenize_to_atomic(phonemes)
print(tokens)
# Output: ['2', 'a', 'L_H', 'aa', 'h', 'u', 'SP', 'S', 'S', 'a', 'm', 'a', 'd', 'u', 'QK']

๐ŸŽ›๏ธ Advanced Configuration

You can customize how verses are connected or separated.

# Multi-Verse Input
text = "ุจูุณู’ู…ู ูฑู„ู„ูŽู‘ู‡ู ูฑู„ุฑูŽู‘ุญู’ู…ูŽู€ูฐู†ู ูฑู„ุฑูŽู‘ุญููŠู…ู ูก ูฑู„ู’ุญูŽู…ู’ุฏู ู„ูู„ูŽู‘ู‡ู ุฑูŽุจูู‘ ูฑู„ู’ุนูŽู€ูฐู„ูŽู…ููŠู†ูŽ ูข"

# Option A: Standard Stop (Waqf) at each Ayah
print(qp.phonemize_text(text, segment_separator=" | "))
# Output: ...RRaHiym: | 2alHamdu...

# Option B: Continuous Recitation (Wasl)
print(qp.phonemize_text(text, apply_stopping_rules=False))
# Output: ...RRaHiymi lHamdu...
# (Note: Preserves vowel 'i' and merges Hamzat Wasl)

๐ŸŒŸ Key Features

1. ๐Ÿ•Œ High-Fidelity Tajweed

Captures nuances that standard Arabic G2P tools miss:

  • Heavy/Light Letters (Tafkhim/Tarqiq):
    • Distinguishes Heavy Lam (LL) in "Allah" vs Light Lam (ll).
    • Distinguishes Heavy Ra (R) vs Light Ra (r).
  • Madd (Elongation): Numeric markers for duration (: = 2, :: = 4, ::: = 6 counts).
  • Qalqalah (Echo): Automatically adds _Q when stopping on Qaf, Taa, Ba, Jim, Dal.
  • Ghunnah (Nasal): Marks nasalization (ล‹) for Noon/Meem Shadda.
  • Idgham/Iqlab: Merges sounds across word boundaries (e.g. min ba'di -> mim ba'di).

2. ๐Ÿง  ML-Ready Tokenization

Includes a tokenizer that converts human-readable strings into Atomic Tokens for model training.

  • Separated Phonemes: in becomes ['i', 'n'].
  • Symbol Mapping: _Q becomes QK, LL becomes L_H.
  • Word Boundaries: Inserts SP tokens automatically.

3. ๐Ÿ›ก๏ธ Robust Search & Fallback

  • FTS5 Search: Instantly finds verses even with missing diacritics or spelling variations.
  • Smart Normalization: Handles Tatweel (ู€), Alif Khanjareeya (ูฐ), and Hamza forms transparently.
  • Fallback Engine: If text is not in the Quran, a sophisticated rule-based engine generates phonetically accurate approximations.

๐Ÿ“ฆ Installation & Setup

pip install quran-phonemizer

(The Golden Source database is included automatically)


๐Ÿ‘จโ€๐Ÿ’ป Author

Razwan M. Haji

๐Ÿ“„ License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quran_phonemizer-1.0.1.tar.gz (4.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quran_phonemizer-1.0.1-py3-none-any.whl (3.5 MB view details)

Uploaded Python 3

File details

Details for the file quran_phonemizer-1.0.1.tar.gz.

File metadata

  • Download URL: quran_phonemizer-1.0.1.tar.gz
  • Upload date:
  • Size: 4.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for quran_phonemizer-1.0.1.tar.gz
Algorithm Hash digest
SHA256 6b1d1f1f70e53b79286c9c6893440bead54d1c56246211fb311a3038d8f46475
MD5 4bdaee6f50cd4fb1aa64bbb735d093ed
BLAKE2b-256 cdc11b5ec6d283e184db897960eefdc764bf10548e4c7b3203da86aa878e63e6

See more details on using hashes here.

File details

Details for the file quran_phonemizer-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for quran_phonemizer-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 88618a70940c24cb45b7d9df61ebad341c55e288c35e3706006160f56c27824a
MD5 2010d2e3e36d56cf2add5008fde7d886
BLAKE2b-256 48163727e6ca168b98fb175384956a76de0dd219381c11c1dab1bd32dce879ba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page