Skip to main content

A comprehensive Text Normalization library for Central Kurdish (Sorani).

Project description

🦁 ckb-textify

PyPI version License: MIT Streamlit App

ckb-textify is an industrial-strength Text Normalization and Transliteration library designed specifically for Central Kurdish (Sorani).

While most normalizers perform simple "Find & Replace", ckb-textify uses a context-aware pipeline to transform "messy" real-world text—including mixed languages, scientific notation, Quranic Tajweed, and technical jargon—into clean, spoken Kurdish text. It is the perfect pre-processor for Text-to-Speech (TTS) and NLP models.


🚀 Live Demo

Try the library instantly in your browser: 👉 Click here to open the Live App

🔮 The Ecosystem

ckb-textify handles Normalization (Text-to-Text). For Phonemization (Text-to-Sounds/IPA), check out the companion project:

  • 🦁 ckb-g2p (Grapheme-to-Phoneme): GitHub | Demo

📦 Installation

pip install ckb-textify

Key Dependencies:

  • eng-to-ipa: For accurate English pronunciation (e.g., "Phone" -> "Fôn").
  • anyascii: For universal script transliteration (Chinese, Russian, etc.).

⚡ Quick Start

from ckb_textify import convert_all

text = """
سڵاو! تکایە پەیوەندی بکە بە 07501234567.
نرخی زێڕ ≈ $2500.
کۆدەکە A1-B2 یە.
سڵاو لە Putin و Xi Jinping.
"""

# Default normalization (All features enabled)
normalized = convert_all(text)

print(normalized)

Output:

سڵاو! تکایە پەیوەندی بکە بە سفر حەوت سەد و پەنجا سەد و بیست و سێ چل و پێنج شەست و حەوت.
نرخی زێڕ نزیکەی دوو ھەزار و پێنج سەد دۆلار.
کۆدەکە ئەی یەک داش بی دوو یە.
سڵاو لە پوتین و سی جینپینگ.

🏛️ Architecture

ckb-textify processes text through a strictly ordered pipeline to handle dependencies (e.g., Units must be processed before Technical codes).


🌟 Advanced Features

1. 🕌 Deep Linguistic & Tajweed Support

Unlike basic normalizers, this library respects complex phonological rules for Arabic/Islamic text embedded in Kurdish.

  • Shamsi (Sun) Letters: Automatically assimilates the 'L' in 'Al-'.
    • Input: بِسْمِ ٱللَّهِ
    • Output: بیسمی للاھی (Handles the "Light Lam" vs "Dark Lam" rule automatically).
  • Context-Aware "Allah": Determines pronunciation (L vs LL) based on the preceding vowel.
  • Alif Wasla (ٱ): Treated as silent in continuation, but pronounced as 'E' at the start.
  • Tajweed Rules: Handles Iqlab (N->M) and Idgham.
  • Heavy 'R' (ڕ): Detects heavy R based on Arabic vowel context (e.g., مِرْصَاد -> میڕساد).

2. 🌍 Universal Script Support ("The Latin Bridge")

Transliterates almost any world script into Sorani using a smart "Latin Bridge" technique.

Language Input Output (Sorani)
Chinese 你好 نی هاو
Russian Путин پوتین
Greek Χαίρετε چایرێت
German Straẞe ستراسسە
French République ڕیپەبلیک
English Phone فۆن (IPA-based, not rule-based)

3. ➗ Scientific & Mathematical Logic

Handles complex math that breaks most normalizers.

  • Scientific Notation: 5e-23 $\rightarrow$ پێنج جارانی دە توانی سالب بیست و سێ
  • Functions: ln 4 $\rightarrow$ لۆگاریتمی سروشتی چوار
  • Fraction Logic:
    • 1/2 $\rightarrow$ نیوە
    • 3/4 $\rightarrow$ سێ چارەک
    • 120km/h $\rightarrow$ ... بۆ هەر کاتژمێرێک (Context-aware "Per" rule)
    • 7/6 $\rightarrow$ حەوت دابەش شەش (Context-aware "Division" rule)

4. 📞 Smart Phone Numbers

Recognizes Iraqi and International phone formats and groups digits for natural reading (4-3-2-2 format).

  • 07501234567 $\rightarrow$ سفر حەوت سەد و پەنجا ...
  • +964... $\rightarrow$ کۆ نۆ سەد و شەست و چوار ...

5. 💻 Web & Technical Entities

  • URLs: www.rudaw.net $\rightarrow$ دابڵیو دابڵیو دابڵیو دۆت رووداو دۆت نێت
  • Emails: info@gmail.com $\rightarrow$ ... ئەت جیمەیڵ دۆت کۆم (Recognizes common domains)
  • Codes: A1-B2 $\rightarrow$ ئەی یەک داش بی دوو (Character-by-character reading)

6. 📏 Context-Aware Units

Solves the ambiguity between units and letters.

  • 10m $\rightarrow$ دە مەتر
  • I am m $\rightarrow$ ئای ئەم ئێم (Letter M)
  • 12.5kg $\rightarrow$ دوازدە کیلۆگرام و نیو (Handles .5 as "Half")

🎛️ Configuration

You can fully customize the pipeline by passing a config dictionary.

from ckb_textify import convert_all

custom_config = {
    "phone_numbers": False,    # Keep phone numbers as digits
    "foreign": False,          # Disable Chinese/Russian transliteration
    "shadda_mode": "remove",   # "remove" or "double" (default)
    "emoji_mode": "convert",   # "remove" (default), "convert", "ignore"
    "chat_speak": True         # Enable '7ez' -> 'حەز' conversion
}

print(convert_all("Text...", config=custom_config))

Available Options

Key Default Description
diacritics_mode "convert" Convert Arabic Harakat to Kurdish vowels.
shadda_mode "double" Doubles the letter for Shadda (مّ -> مم).
emoji_mode "remove" Removes emojis. Set to "convert" to speak them.
chat_speak False Converts Arabizi numbers (7->ح, 3->ع).
math True Normalizes math expressions and functions.
web True Spells out URLs and Emails.
technical True Spells out codes like UUIDs.

🤝 Contributing

Contributions are widely welcomed! If you have ideas for new rules, found a bug, or want to add support for more units:

  1. Fork the repository.
  2. Clone locally.
  3. Create a branch (git checkout -b feature/new-rule).
  4. Run Tests (python -m unittest discover tests).
  5. Submit a Pull Request.

👨‍💻 Author

Razwan M. Haji

📄 License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ckb_textify-4.1.2.tar.gz (39.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ckb_textify-4.1.2-py3-none-any.whl (43.4 kB view details)

Uploaded Python 3

File details

Details for the file ckb_textify-4.1.2.tar.gz.

File metadata

  • Download URL: ckb_textify-4.1.2.tar.gz
  • Upload date:
  • Size: 39.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for ckb_textify-4.1.2.tar.gz
Algorithm Hash digest
SHA256 679178d8234a59197d08261098dc753a0570227d8051caf8268184b1bbda1770
MD5 f0f0d3dd73dd58903d0688f95db7e0a6
BLAKE2b-256 149fc4c38a1237ccd0b088108b0af638d7cbdb1992e698e3650d5eb8b2729445

See more details on using hashes here.

File details

Details for the file ckb_textify-4.1.2-py3-none-any.whl.

File metadata

  • Download URL: ckb_textify-4.1.2-py3-none-any.whl
  • Upload date:
  • Size: 43.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for ckb_textify-4.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8aa4da78662af0359913d66ff6341f6b34572e0a8eb8f8a05fd258544abe8331
MD5 5a9b20affa5111c7a2657f64c4d3ab9b
BLAKE2b-256 83178b070c55001f5ca6614fa1fc795ada4877ebf3b61eaba48e448afc87dbec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page