A comprehensive Text Normalization library for Central Kurdish (Sorani).
Project description
🦁 ckb-textify
ckb-textify is an industrial-strength Text Normalization and Transliteration library designed specifically for Central Kurdish (Sorani).
While most normalizers perform simple "Find & Replace", ckb-textify uses a context-aware pipeline to transform "messy" real-world text—including mixed languages, scientific notation, Quranic Tajweed, and technical jargon—into clean, spoken Kurdish text. It is the perfect pre-processor for Text-to-Speech (TTS) and NLP models.
🚀 Live Demo
Try the library instantly in your browser: 👉 Click here to open the Live App
🔮 The Ecosystem
ckb-textify handles Normalization (Text-to-Text). For Phonemization (Text-to-Sounds/IPA), check out the companion project:
📦 Installation
pip install ckb-textify
Key Dependencies:
eng-to-ipa: For accurate English pronunciation (e.g., "Phone" -> "Fôn").anyascii: For universal script transliteration (Chinese, Russian, etc.).
⚡ Quick Start
from ckb_textify import convert_all
text = """
سڵاو! تکایە پەیوەندی بکە بە 07501234567.
نرخی زێڕ ≈ $2500.
کۆدەکە A1-B2 یە.
سڵاو لە Putin و Xi Jinping.
"""
# Default normalization (All features enabled)
normalized = convert_all(text)
print(normalized)
Output:
سڵاو! تکایە پەیوەندی بکە بە سفر حەوت سەد و پەنجا سەد و بیست و سێ چل و پێنج شەست و حەوت.
نرخی زێڕ نزیکەی دوو ھەزار و پێنج سەد دۆلار.
کۆدەکە ئەی یەک داش بی دوو یە.
سڵاو لە پوتین و سی جینپینگ.
🏛️ Architecture
ckb-textify processes text through a strictly ordered pipeline to handle dependencies (e.g., Units must be processed before Technical codes).
🌟 Advanced Features
1. 🕌 Deep Linguistic & Tajweed Support
Unlike basic normalizers, this library respects complex phonological rules for Arabic/Islamic text embedded in Kurdish.
- Shamsi (Sun) Letters: Automatically assimilates the 'L' in 'Al-'.
- Input:
بِسْمِ ٱللَّهِ - Output:
بیسمی للاھی(Handles the "Light Lam" vs "Dark Lam" rule automatically).
- Input:
- Context-Aware "Allah": Determines pronunciation (L vs LL) based on the preceding vowel.
- Alif Wasla (ٱ): Treated as silent in continuation, but pronounced as 'E' at the start.
- Tajweed Rules: Handles Iqlab (N->M) and Idgham.
- Heavy 'R' (ڕ): Detects heavy R based on Arabic vowel context (e.g.,
مِرْصَاد->میڕساد).
2. 🌍 Universal Script Support ("The Latin Bridge")
Transliterates almost any world script into Sorani using a smart "Latin Bridge" technique.
| Language | Input | Output (Sorani) |
|---|---|---|
| Chinese | 你好 |
نی هاو |
| Russian | Путин |
پوتین |
| Greek | Χαίρετε |
چایرێت |
| German | Straẞe |
ستراسسە |
| French | République |
ڕیپەبلیک |
| English | Phone |
فۆن (IPA-based, not rule-based) |
3. ➗ Scientific & Mathematical Logic
Handles complex math that breaks most normalizers.
- Scientific Notation:
5e-23$\rightarrow$پێنج جارانی دە توانی سالب بیست و سێ - Functions:
ln 4$\rightarrow$لۆگاریتمی سروشتی چوار - Fraction Logic:
1/2$\rightarrow$نیوە3/4$\rightarrow$سێ چارەک120km/h$\rightarrow$... بۆ هەر کاتژمێرێک(Context-aware "Per" rule)7/6$\rightarrow$حەوت دابەش شەش(Context-aware "Division" rule)
4. 📞 Smart Phone Numbers
Recognizes Iraqi and International phone formats and groups digits for natural reading (4-3-2-2 format).
07501234567$\rightarrow$سفر حەوت سەد و پەنجا ...+964...$\rightarrow$کۆ نۆ سەد و شەست و چوار ...
5. 💻 Web & Technical Entities
- URLs:
www.rudaw.net$\rightarrow$دابڵیو دابڵیو دابڵیو دۆت رووداو دۆت نێت - Emails:
info@gmail.com$\rightarrow$... ئەت جیمەیڵ دۆت کۆم(Recognizes common domains) - Codes:
A1-B2$\rightarrow$ئەی یەک داش بی دوو(Character-by-character reading)
6. 📏 Context-Aware Units
Solves the ambiguity between units and letters.
10m$\rightarrow$دە مەترI am m$\rightarrow$ئای ئەم ئێم(Letter M)12.5kg$\rightarrow$دوازدە کیلۆگرام و نیو(Handles .5 as "Half")
🎛️ Configuration
You can fully customize the pipeline by passing a config dictionary.
from ckb_textify import convert_all
custom_config = {
"phone_numbers": False, # Keep phone numbers as digits
"foreign": False, # Disable Chinese/Russian transliteration
"shadda_mode": "remove", # "remove" or "double" (default)
"emoji_mode": "convert", # "remove" (default), "convert", "ignore"
"chat_speak": True # Enable '7ez' -> 'حەز' conversion
}
print(convert_all("Text...", config=custom_config))
Available Options
| Key | Default | Description |
|---|---|---|
diacritics_mode |
"convert" |
Convert Arabic Harakat to Kurdish vowels. |
shadda_mode |
"double" |
Doubles the letter for Shadda (مّ -> مم). |
emoji_mode |
"remove" |
Removes emojis. Set to "convert" to speak them. |
chat_speak |
False |
Converts Arabizi numbers (7->ح, 3->ع). |
math |
True |
Normalizes math expressions and functions. |
web |
True |
Spells out URLs and Emails. |
technical |
True |
Spells out codes like UUIDs. |
🤝 Contributing
Contributions are widely welcomed! If you have ideas for new rules, found a bug, or want to add support for more units:
- Fork the repository.
- Clone locally.
- Create a branch (
git checkout -b feature/new-rule). - Run Tests (
python -m unittest discover tests). - Submit a Pull Request.
👨💻 Author
Razwan M. Haji
- GitHub: RazwanSiktany
- PyPI: ckb-textify
📄 License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ckb_textify-4.1.1.tar.gz.
File metadata
- Download URL: ckb_textify-4.1.1.tar.gz
- Upload date:
- Size: 39.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d7e1f4caecef385da8b905ed35affe3e948b4b99f7c5a5023ac474d6674a134d
|
|
| MD5 |
099ff7712d81bfb3dd270d13a1b8ac5e
|
|
| BLAKE2b-256 |
30ab6e0d1e5fbe643f09b5eba738bc08974a05b92f7b5470f2ab6eae466f1f18
|
File details
Details for the file ckb_textify-4.1.1-py3-none-any.whl.
File metadata
- Download URL: ckb_textify-4.1.1-py3-none-any.whl
- Upload date:
- Size: 43.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dcdaca240ccc8c0a80ee5765ecf24d535806ce194d1c3f5a5f162b57a58bf066
|
|
| MD5 |
87948040b0166a614371ca7dcc64362f
|
|
| BLAKE2b-256 |
6b604d470730db0732cf58396836361a63719298f4e169e803a14af0ca4c730f
|