Skip to main content

A deterministic, rule-based Quranic phonemizer for TTS and AI training.

Project description

📖 quran-phonemizer

PyPI version License: MIT Streamlit App

quran-phonemizer is an industrial-strength, Tajweed-aware phonemization library designed specifically for Quranic Arabic.

It uses a Hybrid Architecture combining a "Golden Source" database (82,000+ expert-verified words) with a robust rule-based fallback engine. This ensures 100% accuracy for Quranic text while gracefully handling Hadith, poetry, or imperfect input.

It is specifically optimized for training Neural TTS models (VITS, FastSpeech2) in the style of reciters like Mishary Al-Afasy.


🚀 Live Demo

Try the library instantly in your browser: 👉 Click here to open the Live App

Or run it locally:

pip install streamlit
streamlit run demo_app.py

⚡ Quick Start

from src.quran_phonemizer.core import QuranPhonemizer

# 1. Initialize Engine (Loads DB & FTS5 Index)
qp = QuranPhonemizer()

# 2. Phonemize Quranic Text (Database Mode)
# "Allah is the Eternal Refuge"
text = "ٱللَّهُ ٱلصَّمَدُ"
phonemes = qp.phonemize_text(text)

print(phonemes)
# Output: 2aLLaahu SSamadu_Q
# (Note: '2a' = Glottal Stop, 'LL' = Heavy Lam, '_Q' = Qalqalah on Stop)

# 3. Get Atomic Tokens for ML Training (VITS format)
tokens = qp.tokenize_to_atomic(phonemes)
print(tokens)
# Output: ['2', 'a', 'L_H', 'aa', 'h', 'u', 'SP', 'S', 'S', 'a', 'm', 'a', 'd', 'u', 'QK']

🌟 Key Features

1. 🕌 High-Fidelity Tajweed

Captures nuances that standard Arabic G2P tools miss:

  • Heavy/Light Letters (Tafkhim/Tarqiq):
    • Distinguishes Heavy Lam (LL) in "Allah" vs Light Lam (ll).
    • Distinguishes Heavy Ra (R) vs Light Ra (r).
  • Madd (Elongation): Numeric markers for duration (: = 2, :: = 4, ::: = 6 counts).
  • Qalqalah (Echo): Automatically adds _Q when stopping on Qaf, Taa, Ba, Jim, Dal.
  • Ghunnah (Nasal): Marks nasalization (ŋ) for Noon/Meem Shadda.
  • Idgham/Iqlab: Merges sounds across word boundaries (e.g. min ba'di -> mim ba'di).

2. 🧠 ML-Ready Tokenization

Includes a tokenizer that converts human-readable strings into Atomic Tokens for model training.

  • Separated Phonemes: in becomes ['i', 'n'].
  • Symbol Mapping: _Q becomes QK, LL becomes L_H.
  • Word Boundaries: Inserts SP tokens automatically.

3. 🛡️ Robust Search & Fallback

  • FTS5 Search: Instantly finds verses even with missing diacritics or spelling variations.
  • Smart Normalization: Handles Tatweel (ـ), Alif Khanjareeya (ٰ), and Hamza forms transparently.
  • Fallback Engine: If text is not in the Quran, a sophisticated rule-based engine generates phonetically accurate approximations, ensuring your pipeline never crashes.

📦 Installation & Setup

  1. Clone the repository:

    git clone [https://github.com/RazwanSiktany/quran-phonemizer.git](https://github.com/RazwanSiktany/quran-phonemizer.git)
    cd quran-phonemizer
    
  2. Install dependencies:

    pip install .
    
  3. Build the Database (First Run):

    python src/quran_phonemizer/db_builder.py
    

👨‍💻 Author

Razwan M. Haji

📄 License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quran_phonemizer-0.1.0.tar.gz (4.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quran_phonemizer-0.1.0-py3-none-any.whl (3.5 MB view details)

Uploaded Python 3

File details

Details for the file quran_phonemizer-0.1.0.tar.gz.

File metadata

  • Download URL: quran_phonemizer-0.1.0.tar.gz
  • Upload date:
  • Size: 4.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for quran_phonemizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 958feed25c2586e50e33c8f60d0ff884abda99ffb4044bff45d6e276245196e4
MD5 3b8fa38d6bbf3736c35d9ff1a361ef8b
BLAKE2b-256 e232d988b4261d53c5ff2c41b6a32d63ac1935c5299651ae12409e51fd6f52e5

See more details on using hashes here.

File details

Details for the file quran_phonemizer-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for quran_phonemizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d5733f586a854c66e204b26728fd533a9de117360fe36cc4ccbd5b4748960fef
MD5 986cff369e2d1b7442f956bd07f2e200
BLAKE2b-256 6966a8108d35b45679a94b8ca862d62130bc0ecff10bde2d0fc1db31e89aaa55

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page