High-performance Urdu Grapheme-to-Phoneme converter
Project description
Urdu G2P - Grapheme-to-Phoneme Converter
Author: Humair Munir Awan (humairmunirawan@gmail.com)
A high-performance, production-ready Grapheme-to-Phoneme (G2P) library for Urdu. Converts Urdu text to IPA (International Phonetic Alphabet) phonemes using a massive dictionary with intelligent fallback mechanisms.
✨ Features
- Refined Dictionary: 323,000+ single-word entries (634k+ total data points managed)
- Streaming & Memory Efficiency: Process multi-GB files line-by-line with constant low RAM usage
- Smart Fallback: Automatic
espeak-ngfallback for out-of-vocabulary (OOV) words - Robust Input Handling: Automatically filters emojis, symbols, and nonsense characters
- Quote Normalization: Unifies all quote variants (
",“,”,‘,’) to a single' - Punctuation Mapping: Maps Urdu punctuation (
۔,،,؟) to custom symbols (default:|,~,?) - Vowel Length Normalization: Collapses repeated vowels (e.g.,
iii->iː,aa->aː) - Configurable Output: Remove stress markers, language tags, and syllable dots
- Diverse Output Formats: Support for JSON, Dot-separated, and detailed token analytics
- High Performance: 168,000+ chars/sec throughput with LRU caching
- Type-Safe API: Full Python type hints with comprehensive docstrings
🔄 How It Works
- Input: Urdu text (with optional mixed English, numbers, emojis)
- Text Cleaning: Filters out symbols, emojis, and non-linguistic characters
- Dictionary Lookup: Searches 478K+ word dictionary with smart diacritic handling
- Fallback: Uses
espeak-ngfor OOV words with IPA normalization - Output: Clean IPA phonemes ready for TTS or linguistic analysis
📦 Installation
From PyPI (Recommended)
pip install urdu-g2p
From Source
# Clone the repository
git clone https://github.com/humair-m/urdu-g2p.git
cd urdu-g2p
# Install the package
pip install .
Dependencies
- Python 3.8+
- espeak-ng (Required for OOV fallback)
# Ubuntu/Debian
sudo apt-get install espeak-ng
# macOS
brew install espeak-ng
# Windows
# Download from: https://github.com/espeak-ng/espeak-ng/releases
🎯 Quick Start
Python API
from urdu_g2p import UrduG2P
# Initialize with default settings
g2p = UrduG2P()
# Basic conversion
text = "پاکستان زندہ باد"
phonemes = g2p(text)
print(' '.join(phonemes))
# Output: paːkɪsˈt̪aːn zɪnˈd̪ə baːd̪
# With stress removal
g2p_clean = UrduG2P(ignore_stress=True)
phonemes = g2p_clean("مجھے پاکستان پسند ہے")
print(' '.join(phonemes))
# Output: mʊd͡ʒeː paːkɪst̪aːn pəsənd̪ ɦɛ
Command Line Interface (CLI)
# Basic usage
python inference.py "اسلام آباد"
# Output: ɪslaːm aːbaːd̪
# JSON output with details
python inference.py "ٹیسٹ" --format json --pretty
# Dot-separated (TTS style)
python inference.py "ہیلو" --format dot
# Output: heː.loː
# Remove stress markers
python inference.py "مجھے" --strip-stress
# Output: mʊd͡ʒeː
🔧 Advanced Usage
Configuration Options
g2p = UrduG2P(
fallback='auto', # 'auto', True, or False
diacritic_mode='auto', # 'auto', 'ignore', 'strict'
ignore_tag=True, # Remove (en)/(ur) language tags
ignore_stress=False, # Remove stress markers (ˈ)
save_oov_path=None # Path to save OOV words
)
OOV Tracking & Saving
Track words not found in the dictionary to improve your dataset:
g2p = UrduG2P(save_oov_path="oov_words.json")
g2p("یہ ایک ٹیسٹ ورڈ ہے۔")
g2p.save_oov() # Saves OOV words to JSON
print(g2p.get_oov()) # View OOV words
Diacritic Modes
Handle text with or without vowel marks (Zer/Zabar/Pesh):
# Mode: 'ignore' (Best for heavily diacritized text)
g2p = UrduG2P(diacritic_mode='ignore')
print(g2p("اَلسَّلَامُ")) # -> æs.səˈlaːm
# Mode: 'strict' (Exact match only)
g2p = UrduG2P(diacritic_mode='strict')
Detailed Inference (JSON)
Get rich information about each token:
from inference import UrduG2PInference
inference = UrduG2PInference()
result = inference.predict("گوگل", format='json')
print(result['tokens'][0])
# {
# 'word': 'گوگل',
# 'phoneme': 'ɡuːɡəl',
# 'source': 'dict',
# 'exact_match': True
# }
Custom Phonemes
Override dictionary or fallback results:
g2p = UrduG2P()
g2p.add_custom_phoneme("آرٹیفیشل", "ɑːrʈiːfɪʃəl")
📁 Project Structure
urdu-g2p/
├── urdu_g2p/ # Main package
│ ├── data/ # Phoneme dictionary (30MB+)
│ │ └── phoneme_map.json # 478K+ word mappings
│ └── g2p.py # Core G2P logic
├── tests/ # Test suite
│ ├── test_basic.py
│ ├── test_comprehensive.py
│ ├── test_robustness.py # Emoji/symbol filtering tests
│ └── benchmark.py # Performance tests
├── examples/
│ └── demo.py # Usage examples
├── assets/ # Images for documentation
├── inference.py # CLI tool
├── pyproject.toml # Build configuration
└── README.md # This file
📊 Performance
| Metric | Value |
|---|---|
| Clean Dictionary | 323,000+ single words |
| Unique IPA Characters | 92 (Optimized) |
| Throughput | 168,000+ chars/sec |
| Memory Usage | Streaming (Files) / ~150MB (Dict) |
📚 Citation
If you use this library in your research, please cite:
@software{urdu_g2p_2026,
author = {Awan, Humair Munir},
title = {Urdu G2P: A High-Performance Grapheme-to-Phoneme Converter for Urdu},
year = {2026},
publisher = {GitHub},
url = {https://github.com/humair-m/urdu-g2p},
version = {2.0.0},
note = {478,000+ word dictionary with espeak-ng fallback. Non-commercial use only.}
}
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Run tests (
pytest tests/) - Commit your changes
- Push to the branch
- Open a Pull Request
📄 License
⚠️ NON-COMMERCIAL USE ONLY
This project (both code and data) is licensed for non-commercial use only.
- ✅ Academic research
- ✅ Personal projects
- ✅ Educational purposes
- ❌ Commercial products/services
- ❌ Monetization of any kind
For commercial licensing, please contact:
📧 humairmunirawan@gmail.com
See the LICENSE file for full details.
Made with ❤️ for the Urdu language
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file urdu_g2p-2.0.1.tar.gz.
File metadata
- Download URL: urdu_g2p-2.0.1.tar.gz
- Upload date:
- Size: 3.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a81a08d581e8e4f386f77afb710b547643f585e2eede8da6f62df43be9ba413
|
|
| MD5 |
646976a293d2b028cbc79db59033919c
|
|
| BLAKE2b-256 |
8a346a107c6ee035a7c0693005573993d6753ce92da8562f798eda5536905995
|
Provenance
The following attestation bundles were made for urdu_g2p-2.0.1.tar.gz:
Publisher:
python-package.yml on humair-m/urdu-g2p
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
urdu_g2p-2.0.1.tar.gz -
Subject digest:
3a81a08d581e8e4f386f77afb710b547643f585e2eede8da6f62df43be9ba413 - Sigstore transparency entry: 854284380
- Sigstore integration time:
-
Permalink:
humair-m/urdu-g2p@16fd0e636e3156108507fd7ce73fdcbc1c5a28f9 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/humair-m
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-package.yml@16fd0e636e3156108507fd7ce73fdcbc1c5a28f9 -
Trigger Event:
release
-
Statement type:
File details
Details for the file urdu_g2p-2.0.1-py3-none-any.whl.
File metadata
- Download URL: urdu_g2p-2.0.1-py3-none-any.whl
- Upload date:
- Size: 3.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51ec3f770ae8ad25d2c29c546b56be23b4036315dc01e4415e99a86220df14ed
|
|
| MD5 |
3121e5d7cd5fdebf907556236cc75476
|
|
| BLAKE2b-256 |
9fe66186584930659a479e7f289aed10c4055e5cc00368c87101416fcd8a9bae
|
Provenance
The following attestation bundles were made for urdu_g2p-2.0.1-py3-none-any.whl:
Publisher:
python-package.yml on humair-m/urdu-g2p
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
urdu_g2p-2.0.1-py3-none-any.whl -
Subject digest:
51ec3f770ae8ad25d2c29c546b56be23b4036315dc01e4415e99a86220df14ed - Sigstore transparency entry: 854284396
- Sigstore integration time:
-
Permalink:
humair-m/urdu-g2p@16fd0e636e3156108507fd7ce73fdcbc1c5a28f9 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/humair-m
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-package.yml@16fd0e636e3156108507fd7ce73fdcbc1c5a28f9 -
Trigger Event:
release
-
Statement type: