Skip to main content

A comprehensive Text Normalization library for Central Kurdish (Sorani).

Project description

🦁 ckb-textify

PyPI version License: MIT Streamlit App

ckb-textify is an industrial-strength Text Normalization and Transliteration library designed specifically for Central Kurdish (Sorani).

While most normalizers perform simple "Find & Replace", ckb-textify uses a context-aware pipeline to transform "messy" real-world text—including mixed languages, scientific notation, Quranic Tajweed, and technical jargon—into clean, spoken Kurdish text. It is the perfect pre-processor for Text-to-Speech (TTS) and NLP models.


🚀 Live Demo

Try the library instantly in your browser: 👉 Click here to open the Live App

🔮 The Ecosystem

ckb-textify handles Normalization (Text-to-Text). For Phonemization (Text-to-Sounds/IPA), check out the companion project:

  • 🦁 ckb-g2p (Grapheme-to-Phoneme): GitHub | Demo

📦 Installation

pip install ckb-textify

Key Dependencies:

  • eng-to-ipa: For accurate English pronunciation (e.g., "Phone" -> "Fôn").
  • anyascii: For universal script transliteration (Chinese, Russian, etc.).

⚡ Quick Start

from ckb_textify import convert_all

text = """
سڵاو! تکایە پەیوەندی بکە بە 07501234567.
نرخی زێڕ ≈ $2500.
کۆدەکە A1-B2 یە.
سڵاو لە Putin و Xi Jinping.
"""

# Default normalization (All features enabled)
normalized = convert_all(text)

print(normalized)

Output:

سڵاو! تکایە پەیوەندی بکە بە سفر حەوت سەد و پەنجا سەد و بیست و سێ چل و پێنج شەست و حەوت.
نرخی زێڕ نزیکەی دوو ھەزار و پێنج سەد دۆلار.
کۆدەکە ئەی یەک داش بی دوو یە.
سڵاو لە پوتین و سی جینپینگ.

🏛️ Architecture

ckb-textify processes text through a strictly ordered pipeline to handle dependencies (e.g., Units must be processed before Technical codes).


🌟 Advanced Features

1. 🕌 Deep Linguistic & Tajweed Support

Unlike basic normalizers, this library respects complex phonological rules for Arabic/Islamic text embedded in Kurdish.

  • Shamsi (Sun) Letters: Automatically assimilates the 'L' in 'Al-'.
    • Input: بِسْمِ ٱللَّهِ
    • Output: بیسمی للاھی (Handles the "Light Lam" vs "Dark Lam" rule automatically).
  • Context-Aware "Allah": Determines pronunciation (L vs LL) based on the preceding vowel.
  • Alif Wasla (ٱ): Treated as silent in continuation, but pronounced as 'E' at the start.
  • Tajweed Rules: Handles Iqlab (N->M) and Idgham.
  • Heavy 'R' (ڕ): Detects heavy R based on Arabic vowel context (e.g., مِرْصَاد -> میڕساد).

2. 🌍 Universal Script Support ("The Latin Bridge")

Transliterates almost any world script into Sorani using a smart "Latin Bridge" technique.

Language Input Output (Sorani)
Chinese 你好 نی هاو
Russian Путин پوتین
Greek Χαίρετε چایرێت
German Straẞe ستراسسە
French République ڕیپەبلیک
English Phone فۆن (IPA-based, not rule-based)

3. ➗ Scientific & Mathematical Logic

Handles complex math that breaks most normalizers.

  • Scientific Notation: 5e-23 $\rightarrow$ پێنج جارانی دە توانی سالب بیست و سێ
  • Functions: ln 4 $\rightarrow$ لۆگاریتمی سروشتی چوار
  • Fraction Logic:
    • 1/2 $\rightarrow$ نیوە
    • 3/4 $\rightarrow$ سێ چارەک
    • 120km/h $\rightarrow$ ... بۆ هەر کاتژمێرێک (Context-aware "Per" rule)
    • 7/6 $\rightarrow$ حەوت دابەش شەش (Context-aware "Division" rule)

4. 📞 Smart Phone Numbers

Recognizes Iraqi and International phone formats and groups digits for natural reading (4-3-2-2 format).

  • 07501234567 $\rightarrow$ سفر حەوت سەد و پەنجا ...
  • +964... $\rightarrow$ کۆ نۆ سەد و شەست و چوار ...

5. 💻 Web & Technical Entities

  • URLs: www.rudaw.net $\rightarrow$ دابڵیو دابڵیو دابڵیو دۆت رووداو دۆت نێت
  • Emails: info@gmail.com $\rightarrow$ ... ئەت جیمەیڵ دۆت کۆم (Recognizes common domains)
  • Codes: A1-B2 $\rightarrow$ ئەی یەک داش بی دوو (Character-by-character reading)

6. 📏 Context-Aware Units

Solves the ambiguity between units and letters.

  • 10m $\rightarrow$ دە مەتر
  • I am m $\rightarrow$ ئای ئەم ئێم (Letter M)
  • 12.5kg $\rightarrow$ دوازدە کیلۆگرام و نیو (Handles .5 as "Half")

🎛️ Configuration

You can fully customize the pipeline by passing a config dictionary.

from ckb_textify import convert_all

custom_config = {
    "phone_numbers": False,    # Keep phone numbers as digits
    "foreign": False,          # Disable Chinese/Russian transliteration
    "shadda_mode": "remove",   # "remove" or "double" (default)
    "emoji_mode": "convert",   # "remove" (default), "convert", "ignore"
    "chat_speak": True         # Enable '7ez' -> 'حەز' conversion
}

print(convert_all("Text...", config=custom_config))

Available Options

Key Default Description
diacritics_mode "convert" Convert Arabic Harakat to Kurdish vowels.
shadda_mode "double" Doubles the letter for Shadda (مّ -> مم).
emoji_mode "remove" Removes emojis. Set to "convert" to speak them.
chat_speak False Converts Arabizi numbers (7->ح, 3->ع).
math True Normalizes math expressions and functions.
web True Spells out URLs and Emails.
technical True Spells out codes like UUIDs.

🤝 Contributing

Contributions are widely welcomed! If you have ideas for new rules, found a bug, or want to add support for more units:

  1. Fork the repository.
  2. Clone locally.
  3. Create a branch (git checkout -b feature/new-rule).
  4. Run Tests (python -m unittest discover tests).
  5. Submit a Pull Request.

👨‍💻 Author

Razwan M. Haji

📄 License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ckb_textify-4.1.1.tar.gz (39.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ckb_textify-4.1.1-py3-none-any.whl (43.2 kB view details)

Uploaded Python 3

File details

Details for the file ckb_textify-4.1.1.tar.gz.

File metadata

  • Download URL: ckb_textify-4.1.1.tar.gz
  • Upload date:
  • Size: 39.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for ckb_textify-4.1.1.tar.gz
Algorithm Hash digest
SHA256 d7e1f4caecef385da8b905ed35affe3e948b4b99f7c5a5023ac474d6674a134d
MD5 099ff7712d81bfb3dd270d13a1b8ac5e
BLAKE2b-256 30ab6e0d1e5fbe643f09b5eba738bc08974a05b92f7b5470f2ab6eae466f1f18

See more details on using hashes here.

File details

Details for the file ckb_textify-4.1.1-py3-none-any.whl.

File metadata

  • Download URL: ckb_textify-4.1.1-py3-none-any.whl
  • Upload date:
  • Size: 43.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for ckb_textify-4.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dcdaca240ccc8c0a80ee5765ecf24d535806ce194d1c3f5a5f162b57a58bf066
MD5 87948040b0166a614371ca7dcc64362f
BLAKE2b-256 6b604d470730db0732cf58396836361a63719298f4e169e803a14af0ca4c730f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page