A comprehensive Text Normalization library for Central Kurdish (Sorani).
Project description
ckb-textify
ckb-textify is an industrial-strength Text Normalization and Transliteration library designed specifically for Central Kurdish (Sorani).
It transforms "messy" real-world text (mixed languages, symbols, math, codes) into clean, spoken Kurdish text, making it the perfect pre-processor for Text-to-Speech (TTS) and NLP models.
🚀 Live Demo
Try the library instantly without installing anything: 👉 Click here to open the Live App
<<<<<<< HEAD
🔮 Next Step: Phonemization (G2P)
This library handles normalization (Text-to-Text). If you need Phonemization (Text-to-Sounds/IPA) for building a TTS model, check out the companion project:
🦁 ckb-g2p (Central Kurdish Grapheme-to-Phoneme)
- GitHub: RazwanSiktany/ckb_g2p
- Live Demo: ckb-g2p.streamlit.app
=======
2f09d3aaad7790a5fc9b75d15b279e7d9d258c77
📦 Installation
pip install ckb-textify
Dependencies:
eng-to-ipa: For accurate English pronunciation.anyascii: For universal transliteration (Chinese, Russian, etc.).
⚡ Usage
from ckb_textify import convert_all
text = """
سڵاو، پەیوەندی بکە بە 07501234567.
نرخی زێڕ ≈ $2500.
کۆدەکە A1-B2 یە.
سڵاو لە Peter و Xi Jinping.
"""
normalized = convert_all(text)
print(normalized)
# Output:
# سڵاو, پەیوەندی بکە بە سفر حەوت سەد و پەنجا سەد و بیست و سێ چل و پێنج شەست و حەوت.
# نرخی زێڕ نزیکەی دوو هەزار و پێنج سەد دۆلار یە.
# کۆدەکە ئەی یەک داش بی دوو یە.
# سڵاو لە پیتەر و سی جینپینگ.
🌟 Features (v3.0.0)
1. 🌍 Universal Script Support
Transliterates almost any language into Sorani script using a "Latin Bridge" technique.
- Chinese: 你好 → نی هاو
- Russian: Путин → پوتین
- Greek: Χαίρετε → چایرێت
- German/French: Handles accents (Straẞe → ستراسسە, République → ڕیپەبلیک).
2. ➗ Advanced Math & Science
- Scientific Notation: 5e-23 → پێنج جارانی دە توانی سالب بیست و سێ
- Functions: ln 4 → لۆگاریتمی سروشتی چوار
- Context-Aware: Distinguishes Division (7/6) from Rates (km/h).
3. 📞 Smart Phone Numbers
Handles Iraqi/Kurdish formats with intelligent grouping (4-3-2-2).
07501234567→ سفر حەوت سەد و پەنجا...+964...→ کۆ نۆ سەد و شەست و چوار...
4. 🔡 English Transliteration (IPA)
Uses the International Phonetic Alphabet to pronounce English words correctly.
- Phone → فۆن (Not "پھۆنە")
- Google → گووگڵ
- Acronyms: GPT → جی پی تی
5. 💻 Web & Technical
Reads technical strings character-by-character.
- Emails: info@gmail.com → ئای ئێن ئێف ئۆ ئەت جیمەیڵ دۆت کۆم
- URLs: www.razwan.net → دابڵیو دابڵیو دابڵیو دۆت ئاڕ ئەی زێت یو ئەی ئێن دۆت نێت
- Codes: A1-B2 → ئەی یەک داش بی دوو
6. 📏 Units & Measurements
Solves ambiguity between units and nouns.
10m→ دە مەتر (Unit) vsm→ m (Noun/Letter)120km/h→ سەد و بیست کیلۆمەتر بۆ هەر کاتژمێرێک
🎛️ Configuration
You can disable specific modules if needed:
config = {
"phone_numbers": False,
"foreign": False # Disable Chinese/Russian transliteration
}
convert_all(text, config=config)
🤝 Contributing
Contributions are widely welcomed! If you have ideas for new rules, found a bug, or want to add support for more units, please feel free to open a Pull Request.
- Fork the repository on GitHub.
- Clone your fork locally.
- Create a new branch for your feature (
git checkout -b feature/amazing-feature). - Run Tests to ensure everything is working (
python -m unittest discover tests). - Commit your changes.
- Push to the branch and open a Pull Request.
<<<<<<< HEAD
🤝 Author
Razwan. M Haji
- GitHub: RazwanSiktany
=======
2f09d3aaad7790a5fc9b75d15b279e7d9d258c77
📄 License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ckb_textify-3.0.1.tar.gz.
File metadata
- Download URL: ckb_textify-3.0.1.tar.gz
- Upload date:
- Size: 28.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e10c0566e20239cf6420249c776e59c56432ae808e117b8497cb6140fa14f76
|
|
| MD5 |
9859de2d4fd47e8be4c81a8df8ad932d
|
|
| BLAKE2b-256 |
6e6ccbcf41d280f8f576bb23623e0d587c181c00d5a506b80c74df767246b346
|
File details
Details for the file ckb_textify-3.0.1-py3-none-any.whl.
File metadata
- Download URL: ckb_textify-3.0.1-py3-none-any.whl
- Upload date:
- Size: 33.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d69d69c2c8893f006fe91ba65b4cfe35f2b5ea90c81bc4bfbb27da3993a1374
|
|
| MD5 |
26a7474a85ccdffd02804d86757f3d26
|
|
| BLAKE2b-256 |
7e48f6c4dc024efe5c18c009dea61069b38a6f461703ea5ba4840711dd46e9ea
|