Skip to main content

High-performance Urdu Grapheme-to-Phoneme converter

Project description

Urdu G2P - Grapheme-to-Phoneme Converter

Urdu G2P Banner

Python 3.8+ License: Non-Commercial Build Status PyPI version

Author: Humair Munir Awan (humairmunirawan@gmail.com)

A high-performance, production-ready Grapheme-to-Phoneme (G2P) library for Urdu. Converts Urdu text to IPA (International Phonetic Alphabet) phonemes using a massive dictionary with intelligent fallback mechanisms.


✨ Features

Features

  • Refined Dictionary: 323,000+ single-word entries (634k+ total data points managed)
  • Streaming & Memory Efficiency: Process multi-GB files line-by-line with constant low RAM usage
  • Smart Fallback: Automatic espeak-ng fallback for out-of-vocabulary (OOV) words
  • Robust Input Handling: Automatically filters emojis, symbols, and nonsense characters
  • Quote Normalization: Unifies all quote variants (", , , , ) to a single '
  • Punctuation Mapping: Maps Urdu punctuation (۔, ،, ؟) to custom symbols (default: |, ~, ?)
  • Vowel Length Normalization: Collapses repeated vowels (e.g., iii -> , aa -> )
  • Configurable Output: Remove stress markers, language tags, and syllable dots
  • Diverse Output Formats: Support for JSON, Dot-separated, and detailed token analytics
  • High Performance: 168,000+ chars/sec throughput with LRU caching
  • Type-Safe API: Full Python type hints with comprehensive docstrings

🔄 How It Works

Workflow

  1. Input: Urdu text (with optional mixed English, numbers, emojis)
  2. Text Cleaning: Filters out symbols, emojis, and non-linguistic characters
  3. Dictionary Lookup: Searches 478K+ word dictionary with smart diacritic handling
  4. Fallback: Uses espeak-ng for OOV words with IPA normalization
  5. Output: Clean IPA phonemes ready for TTS or linguistic analysis

📦 Installation

From PyPI (Recommended)

pip install urdu-g2p

From Source

# Clone the repository
git clone https://github.com/humair-m/urdu-g2p.git
cd urdu-g2p

# Install the package
pip install .

Dependencies

  • Python 3.8+
  • espeak-ng (Required for OOV fallback)
# Ubuntu/Debian
sudo apt-get install espeak-ng

# macOS
brew install espeak-ng

# Windows
# Download from: https://github.com/espeak-ng/espeak-ng/releases

🎯 Quick Start

Python API

from urdu_g2p import UrduG2P

# Initialize with default settings
g2p = UrduG2P()

# Basic conversion
text = "پاکستان زندہ باد"
phonemes = g2p(text)
print(' '.join(phonemes))
# Output: paːkɪsˈt̪aːn zɪnˈd̪ə baːd̪

# With stress removal
g2p_clean = UrduG2P(ignore_stress=True)
phonemes = g2p_clean("مجھے پاکستان پسند ہے")
print(' '.join(phonemes))
# Output: mʊd͡ʒeː paːkɪst̪aːn pəsənd̪ ɦɛ

Command Line Interface (CLI)

# Basic usage
python inference.py "اسلام آباد"
# Output: ɪslaːm aːbaːd̪

# JSON output with details
python inference.py "ٹیسٹ" --format json --pretty

# Dot-separated (TTS style)
python inference.py "ہیلو" --format dot
# Output: heː.loː

# Remove stress markers
python inference.py "مجھے" --strip-stress
# Output: mʊd͡ʒeː

🔧 Advanced Usage

Configuration Options

g2p = UrduG2P(
    fallback='auto',           # 'auto', True, or False
    diacritic_mode='auto',     # 'auto', 'ignore', 'strict'
    ignore_tag=True,           # Remove (en)/(ur) language tags
    ignore_stress=False,       # Remove stress markers (ˈ)
    save_oov_path=None         # Path to save OOV words
)

OOV Tracking & Saving

Track words not found in the dictionary to improve your dataset:

g2p = UrduG2P(save_oov_path="oov_words.json")
g2p("یہ ایک ٹیسٹ ورڈ ہے۔")
g2p.save_oov()  # Saves OOV words to JSON
print(g2p.get_oov())  # View OOV words

Diacritic Modes

Handle text with or without vowel marks (Zer/Zabar/Pesh):

# Mode: 'ignore' (Best for heavily diacritized text)
g2p = UrduG2P(diacritic_mode='ignore')
print(g2p("اَلسَّلَامُ"))  # -> æs.səˈlaːm

# Mode: 'strict' (Exact match only)
g2p = UrduG2P(diacritic_mode='strict')

Detailed Inference (JSON)

Get rich information about each token:

from inference import UrduG2PInference

inference = UrduG2PInference()
result = inference.predict("گوگل", format='json')
print(result['tokens'][0])
# {
#   'word': 'گوگل',
#   'phoneme': 'ɡuːɡəl',
#   'source': 'dict',
#   'exact_match': True
# }

Custom Phonemes

Override dictionary or fallback results:

g2p = UrduG2P()
g2p.add_custom_phoneme("آرٹیفیشل", "ɑːrʈiːfɪʃəl")

📁 Project Structure

urdu-g2p/
├── urdu_g2p/                   # Main package
│   ├── data/                   # Phoneme dictionary (30MB+)
│   │   └── phoneme_map.json    # 478K+ word mappings
│   └── g2p.py                  # Core G2P logic
├── tests/                      # Test suite
│   ├── test_basic.py
│   ├── test_comprehensive.py
│   ├── test_robustness.py      # Emoji/symbol filtering tests
│   └── benchmark.py            # Performance tests
├── examples/
│   └── demo.py                 # Usage examples
├── assets/                     # Images for documentation
├── inference.py                # CLI tool
├── pyproject.toml              # Build configuration
└── README.md                   # This file

📊 Performance

Metric Value
Clean Dictionary 323,000+ single words
Unique IPA Characters 92 (Optimized)
Throughput 168,000+ chars/sec
Memory Usage Streaming (Files) / ~150MB (Dict)

📚 Citation

If you use this library in your research, please cite:

@software{urdu_g2p_2026,
  author       = {Awan, Humair Munir},
  title        = {Urdu G2P: A High-Performance Grapheme-to-Phoneme Converter for Urdu},
  year         = {2026},
  publisher    = {GitHub},
  url          = {https://github.com/humair-m/urdu-g2p},
  version      = {2.0.0},
  note         = {478,000+ word dictionary with espeak-ng fallback. Non-commercial use only.}
}

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Run tests (pytest tests/)
  4. Commit your changes
  5. Push to the branch
  6. Open a Pull Request

📄 License

⚠️ NON-COMMERCIAL USE ONLY

This project (both code and data) is licensed for non-commercial use only.

  • ✅ Academic research
  • ✅ Personal projects
  • ✅ Educational purposes
  • ❌ Commercial products/services
  • ❌ Monetization of any kind

For commercial licensing, please contact:
📧 humairmunirawan@gmail.com

See the LICENSE file for full details.


Made with ❤️ for the Urdu language

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

urdu_g2p-2.0.1.tar.gz (3.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

urdu_g2p-2.0.1-py3-none-any.whl (3.2 MB view details)

Uploaded Python 3

File details

Details for the file urdu_g2p-2.0.1.tar.gz.

File metadata

  • Download URL: urdu_g2p-2.0.1.tar.gz
  • Upload date:
  • Size: 3.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for urdu_g2p-2.0.1.tar.gz
Algorithm Hash digest
SHA256 3a81a08d581e8e4f386f77afb710b547643f585e2eede8da6f62df43be9ba413
MD5 646976a293d2b028cbc79db59033919c
BLAKE2b-256 8a346a107c6ee035a7c0693005573993d6753ce92da8562f798eda5536905995

See more details on using hashes here.

Provenance

The following attestation bundles were made for urdu_g2p-2.0.1.tar.gz:

Publisher: python-package.yml on humair-m/urdu-g2p

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file urdu_g2p-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: urdu_g2p-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 3.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for urdu_g2p-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 51ec3f770ae8ad25d2c29c546b56be23b4036315dc01e4415e99a86220df14ed
MD5 3121e5d7cd5fdebf907556236cc75476
BLAKE2b-256 9fe66186584930659a479e7f289aed10c4055e5cc00368c87101416fcd8a9bae

See more details on using hashes here.

Provenance

The following attestation bundles were made for urdu_g2p-2.0.1-py3-none-any.whl:

Publisher: python-package.yml on humair-m/urdu-g2p

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page