Industrial-strength Deterministic Phonemizer for Quranic Arabic (Tajweed-aware)
Project description
📖 quran-phonemizer
quran-phonemizer is an industrial-strength, Tajweed-aware phonemization library designed specifically for Quranic Arabic.
It uses a Hybrid Architecture combining a "Golden Source" database (82,000+ expert-verified words) with a robust rule-based fallback engine. This ensures 100% accuracy for Quranic text while gracefully handling Hadith, poetry, or imperfect input.
It is specifically optimized for training Neural TTS models (VITS, FastSpeech2) in the style of reciters like Mishary Al-Afasy.
🚀 Live Demo
Try the library instantly in your browser: 👉 Click here to open the Live App
Or run it locally:
pip install streamlit
streamlit run demo_app.py
⚡ Quick Start
from quran_phonemizer import QuranPhonemizer
# 1. Initialize Engine (Loads DB & FTS5 Index)
qp = QuranPhonemizer()
# 2. Phonemize Quranic Text (Database Mode)
# "Allah is the Eternal Refuge"
text = "ٱللَّهُ ٱلصَّمَدُ"
phonemes = qp.phonemize_text(text)
print(phonemes)
# Output: 2aLLaahu SSamadu_Q
# (Note: '2a' = Glottal Stop, 'LL' = Heavy Lam, '_Q' = Qalqalah on Stop)
# 3. Get Atomic Tokens for ML Training (VITS format)
tokens = qp.tokenize_to_atomic(phonemes)
print(tokens)
# Output: ['2', 'a', 'L_H', 'aa', 'h', 'u', 'SP', 'S', 'S', 'a', 'm', 'a', 'd', 'u', 'QK']
🌟 Key Features
1. 🕌 High-Fidelity Tajweed
Captures nuances that standard Arabic G2P tools miss:
- Heavy/Light Letters (Tafkhim/Tarqiq):
- Distinguishes Heavy Lam (
LL) in "Allah" vs Light Lam (ll). - Distinguishes Heavy Ra (
R) vs Light Ra (r).
- Distinguishes Heavy Lam (
- Madd (Elongation): Numeric markers for duration (
:= 2,::= 4,:::= 6 counts). - Qalqalah (Echo): Automatically adds
_Qwhen stopping on Qaf, Taa, Ba, Jim, Dal. - Ghunnah (Nasal): Marks nasalization (
ŋ) for Noon/Meem Shadda. - Idgham/Iqlab: Merges sounds across word boundaries (e.g.
min ba'di->mim ba'di).
2. 🧠 ML-Ready Tokenization
Includes a tokenizer that converts human-readable strings into Atomic Tokens for model training.
- Separated Phonemes:
inbecomes['i', 'n']. - Symbol Mapping:
_QbecomesQK,LLbecomesL_H. - Word Boundaries: Inserts
SPtokens automatically.
3. 🛡️ Robust Search & Fallback
- FTS5 Search: Instantly finds verses even with missing diacritics or spelling variations.
- Smart Normalization: Handles
Tatweel(ـ),Alif Khanjareeya(ٰ), andHamzaforms transparently. - Fallback Engine: If text is not in the Quran, a sophisticated rule-based engine generates phonetically accurate approximations, ensuring your pipeline never crashes.
📦 Installation & Setup
-
Clone the repository:
git clone [https://github.com/RazwanSiktany/quran-phonemizer.git](https://github.com/RazwanSiktany/quran-phonemizer.git) cd quran-phonemizer
-
Install dependencies:
pip install .
-
Build the Database (First Run):
python src/quran_phonemizer/db_builder.py
(Note: When installing via pip in the future, the database will be included automatically)
👨💻 Author
Razwan M. Haji
- GitHub: RazwanSiktany
- PyPI: quran-phonemizer
📄 License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file quran_phonemizer-1.0.0.tar.gz.
File metadata
- Download URL: quran_phonemizer-1.0.0.tar.gz
- Upload date:
- Size: 4.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e41cb9497bf1568dbf08541a77b5e86d51030fd12a3a4cb1099b3a79916aadac
|
|
| MD5 |
daa2673da60c23384b6a4b9b4c307105
|
|
| BLAKE2b-256 |
950a270fdf892d01a980bbd41e9fc745cba9d9c4192871743405f618e76287fe
|
File details
Details for the file quran_phonemizer-1.0.0-py3-none-any.whl.
File metadata
- Download URL: quran_phonemizer-1.0.0-py3-none-any.whl
- Upload date:
- Size: 3.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b29d846581a2a3288e372787e9d5df78439e318f9309071b63906d0682354f71
|
|
| MD5 |
589322ecb6ff00213a3d14bd8703c2f3
|
|
| BLAKE2b-256 |
a2e3fb5c956e704bb253e18849089e5192c992c72f3527520210cd9e41e4600c
|