Industrial-strength Text Normalization and Transliteration for Central Kurdish (Sorani)
Project description
🦁 ckb-textify
ckb-textify is an industrial-strength Text Normalization and Transliteration library designed specifically for Central Kurdish (Sorani).
While most normalizers perform simple "Find & Replace", ckb-textify uses a context-aware pipeline to transform "messy" real-world text—including mixed languages, scientific notation, Quranic Tajweed, and technical jargon—into clean, spoken Kurdish text. It is the perfect pre-processor for Text-to-Speech (TTS) and NLP models.
🚀 Live Demo
Try the library instantly in your browser: 👉 Click here to open the Live App
🔮 The Ecosystem
ckb-textify handles Normalization (Text-to-Text). For Phonemization (Text-to-Sounds/IPA), check out the companion project:
📦 Installation
pip install ckb-textify
Key Dependencies:
eng-to-ipa: For accurate English pronunciation (e.g., "Phone" -> "فۆن").anyascii: For universal script transliteration (Chinese, Russian, etc.).
⚡ Quick Start
from ckb_textify.core.pipeline import Pipeline
from ckb_textify.core.types import NormalizationConfig
text = """
سڵاو! تکایە پەیوەندی بکە بە 07501234567.
نرخی زێڕ ≈ $2500.
کۆدەکە A1-B2 یە.
سڵاو لە Putin و Xi Jinping.
"""
# 1. Initialize Default Pipeline
pipe = Pipeline()
# 2. Normalize
normalized = pipe.normalize(text)
print(normalized)
Output:
سڵاو! تکایە پەیوەندی بکە بە سفر حەوت سەد و پەنجا سەد و بیست و سێ چل و پێنج شەست و حەوت.
نرخی زێڕ نزیکەی دوو ھەزار و پێنج سەد دۆلار.
کۆدەکە ئەی یەک داش بی دوو یە.
سڵاو لە پوتین و سی جینپینگ.
🏛️ Architecture
ckb-textify processes text through a strictly ordered pipeline to handle dependencies (e.g., Units must be processed before Technical codes).
🌟 Advanced Features
1. 🕌 Deep Linguistic & Tajweed Support
Unlike basic normalizers, this library respects complex phonological rules for Arabic/Islamic text embedded in Kurdish.
- Shamsi (Sun) Letters: Automatically assimilates the 'L' in 'Al-'.
- Input:
بِسْمِ ٱللَّهِ - Output:
بیسمی للاھی(Handles the "Light Lam" vs "Dark Lam" rule automatically).
- Input:
- Context-Aware "Allah": Determines pronunciation (L vs LL) based on the preceding vowel.
- Alif Wasla (ٱ): Treated as silent in continuation, but pronounced as 'E' at the start.
- Tajweed Rules: Handles Iqlab (N->M) and Idgham.
- Heavy 'R' (ڕ): Detects heavy R based on Arabic vowel context (e.g.,
مِرْصَاد->میڕساد).
2. 🌍 Universal Script Support ("The Latin Bridge")
Transliterates almost any world script into Sorani using a smart "Latin Bridge" technique.
| Language | Input | Output (Sorani) |
|---|---|---|
| Chinese | 你好 |
نی ھەو |
| Russian | Путин |
پوتین |
| Greek | Χαίρετε |
چایڕێتێ |
| German | Straẞe |
ستراسسە |
| French | République |
ڕێپوبلیکوێ |
| English | Phone |
فۆن (IPA-based, not rule-based) |
3. ➗ Scientific & Mathematical Logic
Handles complex math that breaks most normalizers.
- Scientific Notation:
5e-23$\rightarrow$پێنج جارانی دە توانی سالب بیست و سێ - Functions:
ln 4$\rightarrow$لۆگاریتمی سروشتی چوار - Fraction Logic:
1/2$\rightarrow$نیوە3/4$\rightarrow$سێ دابەش چوار120km/h$\rightarrow$... بۆ هەر کاتژمێرێک(Context-aware "Per" rule)7/6$\rightarrow$حەوت دابەش شەش(Context-aware "Division" rule)
4. 📞 Smart Phone Numbers
Recognizes Iraqi and International phone formats and groups digits for natural reading (4-3-2-2 format).
07501234567$\rightarrow$سفر حەوت سەد و پەنجا ...+964...$\rightarrow$کۆ نۆ سەد و شەست و چوار ...
5. 💻 Web & Technical Entities
- URLs:
www.google.com$\rightarrow$دەبڵیو دەبڵیو دەبڵیو دۆت گووگڵ دۆت کۆم - Emails:
info@gmail.com$\rightarrow$... ئەت جیمەیڵ دۆت کۆم(Recognizes common domains) - Codes:
A1-B2$\rightarrow$ئەی یەک داش بی دوو(Character-by-character reading)
6. 📏 Context-Aware Units
Solves the ambiguity between units and letters.
10m$\rightarrow$دە مەترI am m$\rightarrow$ئای ئەم ئێم(Letter M)12.5kg$\rightarrow$دوازدە کیلۆگرام و نیو(Handles .5 as "Half")
🎛️ Configuration
You can fully customize the pipeline by passing a NormalizationConfig object.
from ckb_textify.core.pipeline import Pipeline
from ckb_textify.core.types import NormalizationConfig
config = NormalizationConfig(
enable_phone=False, # Keep phone numbers as digits
enable_transliteration=False, # Disable foreign script transliteration
shadda_mode="remove", # "remove" or "double" (default)
emoji_mode="convert", # "remove" (default), "convert", "ignore"
enable_math=True # Normalizes math expressions
)
pipe = Pipeline(config)
print(pipe.normalize("Text..."))
Available Options
| Key | Default | Description |
|---|---|---|
enable_numbers |
True |
Convert 123 to text. |
enable_web |
True |
Spells out URLs/Emails. |
enable_phone |
True |
Groups and reads phone numbers. |
enable_units |
True |
Expands km, kg, etc. |
enable_math |
True |
Handles scientific notation and math symbols. |
diacritics_mode |
"convert" |
Convert Arabic Harakat to Kurdish vowels. |
shadda_mode |
"double" |
Doubles the letter for Shadda (مّ -> مم). |
emoji_mode |
"remove" |
Removes emojis. Set to "convert" to speak them. |
🤝 Contributing
Contributions are widely welcomed! If you have ideas for new rules, found a bug, or want to add support for more units:
- Fork the repository.
- Clone locally.
- Create a branch (
git checkout -b feature/new-rule). - Run Tests (
python -m unittest discover tests). - Submit a Pull Request.
👨💻 Author
Razwan M. Haji
- GitHub: RazwanSiktany
- PyPI: ckb-textify
📄 License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ckb_textify-5.0.0.tar.gz.
File metadata
- Download URL: ckb_textify-5.0.0.tar.gz
- Upload date:
- Size: 62.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f649eb044a589bd151eca3a7e1489bc581ee632bfafcf1f5e3609310f915babe
|
|
| MD5 |
9bbc369a8d9c7ae4f7c7be4c31c0060e
|
|
| BLAKE2b-256 |
08bed0d127c4a15780009987dd4aca996b26b30ddbee6dd0278bc4c6bf109e58
|
File details
Details for the file ckb_textify-5.0.0-py3-none-any.whl.
File metadata
- Download URL: ckb_textify-5.0.0-py3-none-any.whl
- Upload date:
- Size: 73.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ba518e7fc899eb57f1fc6cd77ba80c0771d6503d2c373337b8ca4dff42d98a3
|
|
| MD5 |
7fe3b7f6b39ec4968cb3e4355cde4f5c
|
|
| BLAKE2b-256 |
e73e4957e339e63c44f58718d27ad5975f99c7b1c1a38af06f82b6959d47b0be
|