A comprehensive Text Normalization library for Central Kurdish (Sorani).
Project description
ckb-textify
ckb-textify is a comprehensive, industrial-strength Text Normalization and Transliteration library designed specifically for Central Kurdish (Sorani).
It is built to prepare "messy" real-world text for Natural Language Processing (NLP) tasks, with a specific focus on Text-to-Speech (TTS) engines. It handles complex ambiguity, mixed-script (English/Kurdish) text, technical codes, and locale-specific formats like Iraqi phone numbers.
📦 Installation
You can install the package directly from PyPI:
pip install ckb-textify
Dependencies:
- eng-to-ipa: Used for accurate English-to-Sorani transliteration.
🚀 Quick Start
Python Example
from ckb_textify import convert_all
text = "سڵاو، پەیوەندی بکە بە 07501234567.
نرخی کاڵاکە $12.50 یە.
کۆدەکە A1-B2 یە."
normalized = convert_all(text)
print(normalized)
Output:
سڵاو, پەیوەندی بکە بە سفر حەوت سەد و پەنجا سەد و بیست و سێ چل و پێنج شەست و حەوت.
نرخی کاڵاکە دوازدە دۆلار و پەنجا سەنت یە.
کۆدەکە ئەی یەک داش بی دوو یە.
🎛️ Configuration
You can enable or disable specific normalization modules using a config
dictionary.
By default, all features are enabled.
from ckb_textify import convert_all
my_config = {
"phone_numbers": False,
"technical": False
}
text = "Call 07501234567 regarding ID: 550e8400."
normalized = convert_all(text, config=my_config)
print(normalized)
Output will keep the phone number and ID unchanged.
🌟 Features & Examples
1. Smart Phone Numbers
Handles Iraqi/Kurdish formats with intelligent grouping (4-3-2-2).
07501234567 → سفر حەوت سەد و پەنجا سەد و بیست و سێ چل و پێنج شەست و حەوت
+964... → کۆ نۆ سەد و شەست و چوار...
2. Math & Scientific Notation
Context-aware processing:
7/6→ حەوت دابەش شەش\km/h→ کیلۆمەتر بۆ هەر کاتژمێرێک\34000000000000000000→ سێ پۆینت چوار جارانی دە توانی نۆزدە\00025→ سێ جار سفر بیست و پێنج
3. English Transliteration (IPA-Based)
Phone → فۆن
Action → ئاکشن
Google → گووگڵ
GPT → جی پی تی
USA → یو ئێس ئەی
4. Web & Technical Entities
Reads symbols character-by-character:
- Emails:
info@gmail.com→ ئای ئێن ئێف ئۆ ئەت جیمەیڵ دۆت کۆم\ - URLs:
www.razwan.net→ دابڵیو دابڵیو دابڵیو دۆت ئاڕ ئەی زێت دابڵیو ئەی ئێن دۆت نێت\ - Codes:
A1-B2→ ئەی یەک داش بی دوو\ - MAC/UUID: Handles patterns like
00:1A:2B...
5. Units & Measurements
10m → دە مەتر
m (alone) → m
10kg → ده کیلۆگرام
6. Currency & Finance
$12.5 → دوازدە دۆلار و نیو
IQD 5000 → پێنج هەزار دیناری عێڕاقی
£50, ¥1000, €20 → Supported
7. Standardization
- Fixes Unicode inconsistencies (ی, ە, ه, ھ, ة)
- Expands abbreviations (د. ← دکتۆر)
- Normalizes Arabic names (محمد ← موحەمەد)
📄 License
This project is licensed under the MIT License.
See the LICENSE file for details.
🤝 Contributing
Contributions are welcome!
Open an issue or submit a pull request with improvements or new rules.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ckb_textify-2.0.0.tar.gz.
File metadata
- Download URL: ckb_textify-2.0.0.tar.gz
- Upload date:
- Size: 25.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0e1e9d5787299f4d7805506673746f66aa1887a8c3168665e803deb44557d0a
|
|
| MD5 |
1c551cd77c329671c3d593524cf3a063
|
|
| BLAKE2b-256 |
1eff9e1f41540f40c3be2b12d4e00ff2a82f667076c21b6aa664c35f0a2f6af9
|
File details
Details for the file ckb_textify-2.0.0-py3-none-any.whl.
File metadata
- Download URL: ckb_textify-2.0.0-py3-none-any.whl
- Upload date:
- Size: 30.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c47cae54055599e1bd0fe9fa9eaa08dd5d65ef23c0bf98239644fec2526a9d1
|
|
| MD5 |
672466c276c6c9f638eae7533dda5c6d
|
|
| BLAKE2b-256 |
259f7fb8ed741e92bfc31a343ab75cc637a5307267257ec10f5a7f2e0c5f84cb
|