Industrial-strength Deterministic Phonemizer for Quranic Arabic (Tajweed-aware)
Project description
๐ quran-phonemizer
quran-phonemizer is an industrial-strength, Tajweed-aware phonemization library designed specifically for Quranic Arabic.
It uses a Hybrid Architecture combining a "Golden Source" database (82,000+ expert-verified words) with a robust rule-based fallback engine. This ensures the accuracy for Quranic text while gracefully handling Hadith, poetry, or imperfect input.
It is specifically optimized for training Neural TTS models (VITS, FastSpeech2) in the style of reciters like Mishary Al-Afasy.
๐ Live Demo
Try the library instantly in your browser: ๐ Click here to open the Live App
Or run it locally:
pip install streamlit
streamlit run demo_app.py
โก Quick Start
from quran_phonemizer import QuranPhonemizer
# 1. Initialize Engine (Loads bundled DB)
qp = QuranPhonemizer()
# 2. Phonemize Quranic Text (Database Mode)
text = "ูฑูููููู ูฑูุตููู
ูุฏู"
phonemes = qp.phonemize_text(text)
print(phonemes)
# Output: 2aLLaahu SSamad_Q
# (Note: '2a' = Glottal Stop, 'LL' = Heavy Lam, '_Q' = Qalqalah on Stop)
# 3. Get Atomic Tokens for ML Training (VITS format)
tokens = qp.tokenize_to_atomic(phonemes)
print(tokens)
# Output: ['2', 'a', 'L_H', 'aa', 'h', 'u', 'SP', 'S', 'S', 'a', 'm', 'a', 'd', 'u', 'QK']
๐๏ธ Advanced Configuration
You can customize how verses are connected or separated.
# Multi-Verse Input
text = "ุจูุณูู
ู ูฑูููููู ูฑูุฑููุญูู
ูููฐูู ูฑูุฑููุญููู
ู ูก ูฑููุญูู
ูุฏู ููููููู ุฑูุจูู ูฑููุนูููฐููู
ูููู ูข"
# Option A: Standard Stop (Waqf) at each Ayah
print(qp.phonemize_text(text, segment_separator=" | "))
# Output: ...RRaHiym: | 2alHamdu...
# Option B: Continuous Recitation (Wasl)
print(qp.phonemize_text(text, apply_stopping_rules=False))
# Output: ...RRaHiymi lHamdu...
# (Note: Preserves vowel 'i' and merges Hamzat Wasl)
๐ Key Features
1. ๐ High-Fidelity Tajweed
Captures nuances that standard Arabic G2P tools miss:
- Heavy/Light Letters (Tafkhim/Tarqiq):
- Distinguishes Heavy Lam (
LL) in "Allah" vs Light Lam (ll). - Distinguishes Heavy Ra (
R) vs Light Ra (r).
- Distinguishes Heavy Lam (
- Madd (Elongation): Numeric markers for duration (
:= 2,::= 4,:::= 6 counts). - Qalqalah (Echo): Automatically adds
_Qwhen stopping on Qaf, Taa, Ba, Jim, Dal. - Ghunnah (Nasal): Marks nasalization (
ล) for Noon/Meem Shadda. - Idgham/Iqlab: Merges sounds across word boundaries (e.g.
min ba'di->mim ba'di).
2. ๐ง ML-Ready Tokenization
Includes a tokenizer that converts human-readable strings into Atomic Tokens for model training.
- Separated Phonemes:
inbecomes['i', 'n']. - Symbol Mapping:
_QbecomesQK,LLbecomesL_H. - Word Boundaries: Inserts
SPtokens automatically.
3. ๐ก๏ธ Robust Search & Fallback
- FTS5 Search: Instantly finds verses even with missing diacritics or spelling variations.
- Smart Normalization: Handles
Tatweel(ู),Alif Khanjareeya(ูฐ), andHamzaforms transparently. - Fallback Engine: If text is not in the Quran, a sophisticated rule-based engine generates phonetically accurate approximations.
๐ฆ Installation & Setup
pip install quran-phonemizer
(The Golden Source database is included automatically)
๐จโ๐ป Author
Razwan M. Haji
- GitHub: RazwanSiktany
- PyPI: quran-phonemizer
๐ License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file quran_phonemizer-1.0.1.tar.gz.
File metadata
- Download URL: quran_phonemizer-1.0.1.tar.gz
- Upload date:
- Size: 4.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b1d1f1f70e53b79286c9c6893440bead54d1c56246211fb311a3038d8f46475
|
|
| MD5 |
4bdaee6f50cd4fb1aa64bbb735d093ed
|
|
| BLAKE2b-256 |
cdc11b5ec6d283e184db897960eefdc764bf10548e4c7b3203da86aa878e63e6
|
File details
Details for the file quran_phonemizer-1.0.1-py3-none-any.whl.
File metadata
- Download URL: quran_phonemizer-1.0.1-py3-none-any.whl
- Upload date:
- Size: 3.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88618a70940c24cb45b7d9df61ebad341c55e288c35e3706006160f56c27824a
|
|
| MD5 |
2010d2e3e36d56cf2add5008fde7d886
|
|
| BLAKE2b-256 |
48163727e6ca168b98fb175384956a76de0dd219381c11c1dab1bd32dce879ba
|