Central Kurdish Grapheme-to-Phoneme (G2P) converter and Syllabifier for TTS.
Project description
Central Kurdish G2P (ckb_g2p)
A linguistically accurate Grapheme-to-Phoneme (G2P) converter and Syllabifier for Central Kurdish (Sorani).
Designed specifically for training modern Text-to-Speech (TTS) models (VITS, FastSpeech2, Glow-TTS) by providing robust phonetization, stress marking, and syllable boundaries.
(کوردی) دەربارەی پڕۆژە
ئەم پڕۆژەیە ئامرازێکی پێشکەوتووە بۆ گۆڕینی دەقی کوردی (سۆرانی) بۆ فۆنێم و بڕگە. بەتایبەت دیزاین کراوە بۆ سیستەمەکانی دروستکردنی دەنگ و ڕاهێنانی مۆدێلەکانی زیرەکی دەستکرد.
🌟 Why Use This?
Generic G2P tools often fail on Kurdish phonology. ckb_g2p solves these specific challenges:
| Feature | Problem in Generic Tools | Solution in ckb_g2p |
|---|---|---|
| Palatalization | Treats all 'k' and 'g' the same. | Distinguishes Heavy (Postalveolar t͡ʃ, d͡ʒ) vs Light (Dental t̪͡ʃ̟, d̪͡ʒ̟) based on vowel context. |
| Schwa Insertion | Fails on clusters like "grft". | Automatically inserts Bizroka (/ɪ/) to fix illegal consonant clusters (gɪ.ɾɪft). |
| Geminate Consonants | Merges double letters. | Preserves true geminates or splits them if phonologically required (e.g., dat̪͡ʃɛnn → da.t̪͡ʃ̟ɛ.ˈnɪn). |
| Stress (Prosody) | Ignores stress. | Smartly assigns stress (ˈ). Handles Negative Verb shifts (nachu → ˈna.t̪͡ʃ̟uː) vs Nouns (kurd → kurd). |
| Complex Onsets | Incorrectly splits clusters. | Respects valid onsets like kw and cy (wusha → wu.ʃa). |
🔗 Live Demos
- G2P Playground: ckb-g2p.streamlit.app
- Text Normalizer: ckb-textify.streamlit.app
📦 Installation
pip install ckb_g2p
Dependencies: This library automatically installs ckb-textify for normalizing numbers (1991 → hazar...), dates, and symbols.
🚀 Usage
Basic Conversion
from ckb_g2p import Converter
# Default: Normalization=ON, Pauses=ON, Stress=OFF
converter = Converter()
text = "کوردستان"
ipa = converter.syllabify(text)
print(ipa)
# Output: kuɾ.dɪs.tän
TTS-Ready Output (With Stress)
For training TTS models, you want explicit stress markers and pause tokens.
# Enable stress marking
converter = Converter(use_stress=True, use_pause_markers=True)
# Handles negative verbs correctly (Stress on first syllable)
text = "نەچوو بۆ بازاڕ, لە ساڵی 1991."
ipa = converter.syllabify(text)
print(ipa)
# Output: ˈna.t̪͡ʃ̟uː bo̞ bä.ˈzäɾ | la sä.ˈɫiː ha.ˈzäɾ w no̞.ˈsad w na.ˈwa.du ˈjak ||
Configuration Options
| Argument | Type | Default | Description |
|---|---|---|---|
use_stress |
bool |
False |
Adds primary stress marker (ˈ) to the appropriate syllable. |
use_pause_markers |
bool |
True |
Converts punctuation to IPA boundaries (| short, || long). |
normalize |
bool |
True |
Uses ckb-textify to convert numbers/symbols to text before processing. |
🗣️ Phoneme Inventory
We use a precise IPA set to capture allophonic variations critical for natural speech synthesis.
Consonants (Key Distinctions)
| Grapheme | IPA | Type | Description |
|---|---|---|---|
| چ | t̪͡ʃ̟ |
Light (Dental) | Standard "ch". Tongue tip touches teeth. |
| ک | t͡ʃ |
Heavy (Postalveolar) | Palatalized /k/ before front vowels (i, e, y). Like English "Chair". |
| ج | d̪͡ʒ̟ |
Light (Dental) | Standard "j". Tongue tip touches teeth. |
| گ | d͡ʒ |
Heavy (Postalveolar) | Palatalized /g/ before front vowels. Like English "Jack". |
| ڵ | ɫ |
Velarized | "Dark L", distinct from clear l. |
| ڕ | r |
Trill | Rolled R, distinct from tap ɾ. |
🛠️ Customizing Pronunciation
If the rule-based engine fails on a specific word (e.g., a foreign name), you can manually override it by editing src/ckb_g2p/resources/exceptions.csv inside the package or locally mapping exceptions before processing.
🤝 Contributing
Contributions are welcome!
- Fork the repository.
- Create a feature branch.
- Submit a Pull Request.
👨💻 Author
Developed by Razwan M. Haji.
Special thanks to the open-source community and the contributors of ckb-textify.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ckb_g2p-2.0.1.tar.gz.
File metadata
- Download URL: ckb_g2p-2.0.1.tar.gz
- Upload date:
- Size: 13.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa961ddb19c27d5b0ffea999bc8e7b61aee84cf6d487f1f39bc55d8de088cbde
|
|
| MD5 |
d9ea53b64eb65b70c166e3f9faca793d
|
|
| BLAKE2b-256 |
e91c40fb8a9306decc0004244fc2999036b3613d4f32118551826c833c630394
|
File details
Details for the file ckb_g2p-2.0.1-py3-none-any.whl.
File metadata
- Download URL: ckb_g2p-2.0.1-py3-none-any.whl
- Upload date:
- Size: 10.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf06c3c7be79c6bbb71ce5f2e704c6ff992403a109b9cb08d961edd516902d6f
|
|
| MD5 |
a2e934881a76485178ba3da6da40783f
|
|
| BLAKE2b-256 |
dec0640196528b0f05a83624e5039c54654a52af9bb164a2f657757bd9920be0
|