High-performance Urdu Grapheme-to-Phoneme converter

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

humairmunir

These details have not been verified by PyPI

Project description

Urdu G2P - Grapheme-to-Phoneme Converter

Urdu G2P Banner

Author: Humair Munir Awan (humairmunirawan@gmail.com)

A high-performance, production-ready Grapheme-to-Phoneme (G2P) library for Urdu. Converts Urdu text to IPA (International Phonetic Alphabet) phonemes using a massive dictionary with intelligent fallback mechanisms.

✨ Features

Features

Refined Dictionary: 323,000+ single-word entries (634k+ total data points managed)
Streaming & Memory Efficiency: Process multi-GB files line-by-line with constant low RAM usage
Smart Fallback: Automatic espeak-ng fallback for out-of-vocabulary (OOV) words
Robust Input Handling: Automatically filters emojis, symbols, and nonsense characters
Quote Normalization: Unifies all quote variants (", “, ”, ‘, ’) to a single '
Punctuation Mapping: Maps Urdu punctuation (۔, ،, ؟) to custom symbols (default: |, ~, ?)
Vowel Length Normalization: Collapses repeated vowels (e.g., iii -> iː, aa -> aː)
Configurable Output: Remove stress markers, language tags, and syllable dots
Diverse Output Formats: Support for JSON, Dot-separated, and detailed token analytics
High Performance: 168,000+ chars/sec throughput with LRU caching
Type-Safe API: Full Python type hints with comprehensive docstrings

🔄 How It Works

Workflow

Input: Urdu text (with optional mixed English, numbers, emojis)
Text Cleaning: Filters out symbols, emojis, and non-linguistic characters
Dictionary Lookup: Searches 478K+ word dictionary with smart diacritic handling
Fallback: Uses espeak-ng for OOV words with IPA normalization
Output: Clean IPA phonemes ready for TTS or linguistic analysis

📦 Installation

From PyPI (Recommended)

pip install urdu-g2p

From Source

# Clone the repository
git clone https://github.com/humair-m/urdu-g2p.git
cd urdu-g2p

# Install the package
pip install .

Dependencies

Python 3.8+
espeak-ng (Required for OOV fallback)

# Ubuntu/Debian
sudo apt-get install espeak-ng

# macOS
brew install espeak-ng

# Windows
# Download from: https://github.com/espeak-ng/espeak-ng/releases

🎯 Quick Start

Python API

from urdu_g2p import UrduG2P

# Initialize with default settings
g2p = UrduG2P()

# Basic conversion
text = "پاکستان زندہ باد"
phonemes = g2p(text)
print(' '.join(phonemes))
# Output: paːkɪsˈt̪aːn zɪnˈd̪ə baːd̪

# With stress removal
g2p_clean = UrduG2P(ignore_stress=True)
phonemes = g2p_clean("مجھے پاکستان پسند ہے")
print(' '.join(phonemes))
# Output: mʊd͡ʒeː paːkɪst̪aːn pəsənd̪ ɦɛ

Command Line Interface (CLI)

# Basic usage
python inference.py "اسلام آباد"
# Output: ɪslaːm aːbaːd̪

# JSON output with details
python inference.py "ٹیسٹ" --format json --pretty

# Dot-separated (TTS style)
python inference.py "ہیلو" --format dot
# Output: heː.loː

# Remove stress markers
python inference.py "مجھے" --strip-stress
# Output: mʊd͡ʒeː

🔧 Advanced Usage

Configuration Options

g2p = UrduG2P(
    fallback='auto',           # 'auto', True, or False
    diacritic_mode='auto',     # 'auto', 'ignore', 'strict'
    ignore_tag=True,           # Remove (en)/(ur) language tags
    ignore_stress=False,       # Remove stress markers (ˈ)
    save_oov_path=None         # Path to save OOV words
)

OOV Tracking & Saving

Track words not found in the dictionary to improve your dataset:

g2p = UrduG2P(save_oov_path="oov_words.json")
g2p("یہ ایک ٹیسٹ ورڈ ہے۔")
g2p.save_oov()  # Saves OOV words to JSON
print(g2p.get_oov())  # View OOV words

Diacritic Modes

Handle text with or without vowel marks (Zer/Zabar/Pesh):

# Mode: 'ignore' (Best for heavily diacritized text)
g2p = UrduG2P(diacritic_mode='ignore')
print(g2p("اَلسَّلَامُ"))  # -> æs.səˈlaːm

# Mode: 'strict' (Exact match only)
g2p = UrduG2P(diacritic_mode='strict')

Detailed Inference (JSON)

Get rich information about each token:

from inference import UrduG2PInference

inference = UrduG2PInference()
result = inference.predict("گوگل", format='json')
print(result['tokens'][0])
# {
#   'word': 'گوگل',
#   'phoneme': 'ɡuːɡəl',
#   'source': 'dict',
#   'exact_match': True
# }

Custom Phonemes

Override dictionary or fallback results:

g2p = UrduG2P()
g2p.add_custom_phoneme("آرٹیفیشل", "ɑːrʈiːfɪʃəl")

📁 Project Structure

urdu-g2p/
├── urdu_g2p/                   # Main package
│   ├── data/                   # Phoneme dictionary (30MB+)
│   │   └── phoneme_map.json    # 478K+ word mappings
│   └── g2p.py                  # Core G2P logic
├── tests/                      # Test suite
│   ├── test_basic.py
│   ├── test_comprehensive.py
│   ├── test_robustness.py      # Emoji/symbol filtering tests
│   └── benchmark.py            # Performance tests
├── examples/
│   └── demo.py                 # Usage examples
├── assets/                     # Images for documentation
├── inference.py                # CLI tool
├── pyproject.toml              # Build configuration
└── README.md                   # This file

📊 Performance

Metric	Value
Clean Dictionary	323,000+ single words
Unique IPA Characters	92 (Optimized)
Throughput	168,000+ chars/sec
Memory Usage	Streaming (Files) / ~150MB (Dict)

📚 Citation

If you use this library in your research, please cite:

@software{urdu_g2p_2026,
  author       = {Awan, Humair Munir},
  title        = {Urdu G2P: A High-Performance Grapheme-to-Phoneme Converter for Urdu},
  year         = {2026},
  publisher    = {GitHub},
  url          = {https://github.com/humair-m/urdu-g2p},
  version      = {2.0.0},
  note         = {478,000+ word dictionary with espeak-ng fallback. Non-commercial use only.}
}

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Run tests (pytest tests/)
Commit your changes
Push to the branch
Open a Pull Request

📄 License

⚠️ NON-COMMERCIAL USE ONLY

This project (both code and data) is licensed for non-commercial use only.

✅ Academic research
✅ Personal projects
✅ Educational purposes
❌ Commercial products/services
❌ Monetization of any kind

For commercial licensing, please contact:
📧 humairmunirawan@gmail.com

See the LICENSE file for full details.

Made with ❤️ for the Urdu language

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

humairmunir

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.0.1

Jan 25, 2026

This version

2.0.0

Jan 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

urdu_g2p-2.0.0.tar.gz (7.7 MB view details)

Uploaded Jan 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

urdu_g2p-2.0.0-py3-none-any.whl (7.8 MB view details)

Uploaded Jan 19, 2026 Python 3

File details

Details for the file urdu_g2p-2.0.0.tar.gz.

File metadata

Download URL: urdu_g2p-2.0.0.tar.gz
Upload date: Jan 19, 2026
Size: 7.7 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for urdu_g2p-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`328767772b212978fede6c4a1344cdfbced254a1f3e26603c196c1dbbc1ce9ef`
MD5	`f2adbd3e6f3d12547d0cc4ab3e673a5b`
BLAKE2b-256	`40166dd7627ac275ec51be323cf43534e5a716ef1996353335c9b4661deac2de`

See more details on using hashes here.

Provenance

The following attestation bundles were made for urdu_g2p-2.0.0.tar.gz:

Publisher: python-package.yml on humair-m/urdu-g2p

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: urdu_g2p-2.0.0.tar.gz
- Subject digest: 328767772b212978fede6c4a1344cdfbced254a1f3e26603c196c1dbbc1ce9ef
- Sigstore transparency entry: 835688277
- Sigstore integration time: Jan 19, 2026
Source repository:
- Permalink: humair-m/urdu-g2p@0f82fb3c4d6cdecbcc7464b3f80de160f9ba3701
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/humair-m
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-package.yml@0f82fb3c4d6cdecbcc7464b3f80de160f9ba3701
- Trigger Event: release

File details

Details for the file urdu_g2p-2.0.0-py3-none-any.whl.

File metadata

Download URL: urdu_g2p-2.0.0-py3-none-any.whl
Upload date: Jan 19, 2026
Size: 7.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for urdu_g2p-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1df34f48ae873771472a49c23d95eba5397d6421b987d70cb7fc7d3e236a780d`
MD5	`946a060fb4a5fa9d2e90bf6da2f9f1b3`
BLAKE2b-256	`3ed880fc286dedfdf2cd454d3b0088fecc4366e5561362860ed3694e7eb78d67`

See more details on using hashes here.

Provenance

The following attestation bundles were made for urdu_g2p-2.0.0-py3-none-any.whl:

Publisher: python-package.yml on humair-m/urdu-g2p

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: urdu_g2p-2.0.0-py3-none-any.whl
- Subject digest: 1df34f48ae873771472a49c23d95eba5397d6421b987d70cb7fc7d3e236a780d
- Sigstore transparency entry: 835688283
- Sigstore integration time: Jan 19, 2026
Source repository:
- Permalink: humair-m/urdu-g2p@0f82fb3c4d6cdecbcc7464b3f80de160f9ba3701
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/humair-m
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-package.yml@0f82fb3c4d6cdecbcc7464b3f80de160f9ba3701
- Trigger Event: release

urdu-g2p 2.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Urdu G2P - Grapheme-to-Phoneme Converter

✨ Features

🔄 How It Works

📦 Installation

From PyPI (Recommended)

From Source

Dependencies

🎯 Quick Start

Python API

Command Line Interface (CLI)

🔧 Advanced Usage

Configuration Options

OOV Tracking & Saving

Diacritic Modes

Detailed Inference (JSON)

Custom Phonemes

📁 Project Structure

📊 Performance

📚 Citation

🤝 Contributing

📄 License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance