Skip to main content

Industrial-strength Text Normalization and Transliteration for Central Kurdish (Sorani)

Project description

🦁 ckb-textify

PyPI version License: MIT Streamlit App

ckb-textify is an industrial-strength Text Normalization and Transliteration library designed specifically for Central Kurdish (Sorani).

While most normalizers perform simple "Find & Replace", ckb-textify uses a context-aware pipeline to transform "messy" real-world text—including mixed languages, scientific notation, Quranic Tajweed, and technical jargon—into clean, spoken Kurdish text. It is the perfect pre-processor for Text-to-Speech (TTS) and NLP models.


🚀 Live Demo

Try the library instantly in your browser: 👉 Click here to open the Live App

🔮 The Ecosystem

ckb-textify handles Normalization (Text-to-Text). For Phonemization (Text-to-Sounds/IPA), check out the companion project:

  • 🦁 ckb-g2p (Grapheme-to-Phoneme): GitHub | Demo

📦 Installation

pip install ckb-textify

Key Dependencies:

  • eng-to-ipa: For accurate English pronunciation (e.g., "Phone" -> "فۆن").
  • anyascii: For universal script transliteration (Chinese, Russian, etc.).

⚡ Quick Start

from ckb_textify.core.pipeline import Pipeline
from ckb_textify.core.types import NormalizationConfig

text = """
سڵاو! تکایە پەیوەندی بکە بە 07501234567.
نرخی زێڕ ≈ $2500.
کۆدەکە A1-B2 یە.
سڵاو لە Putin و Xi Jinping.
"""

# 1. Initialize Default Pipeline
pipe = Pipeline()

# 2. Normalize
normalized = pipe.normalize(text)

print(normalized)

Output:

سڵاو! تکایە پەیوەندی بکە بە سفر حەوت سەد و پەنجا سەد و بیست و سێ چل و پێنج شەست و حەوت.
نرخی زێڕ نزیکەی دوو ھەزار و پێنج سەد دۆلار.
کۆدەکە ئەی یەک داش بی دوو یە.
سڵاو لە پوتین و سی جینپینگ.

🏛️ Architecture

ckb-textify processes text through a strictly ordered pipeline to handle dependencies (e.g., Units must be processed before Technical codes).


🌟 Advanced Features

1. 🕌 Deep Linguistic & Tajweed Support

Unlike basic normalizers, this library respects complex phonological rules for Arabic/Islamic text embedded in Kurdish.

  • Shamsi (Sun) Letters: Automatically assimilates the 'L' in 'Al-'.
    • Input: بِسْمِ ٱللَّهِ
    • Output: بیسمی للاھی (Handles the "Light Lam" vs "Dark Lam" rule automatically).
  • Context-Aware "Allah": Determines pronunciation (L vs LL) based on the preceding vowel.
  • Alif Wasla (ٱ): Treated as silent in continuation, but pronounced as 'E' at the start.
  • Tajweed Rules: Handles Iqlab (N->M) and Idgham.
  • Heavy 'R' (ڕ): Detects heavy R based on Arabic vowel context (e.g., مِرْصَاد -> میڕساد).

2. 🌍 Universal Script Support ("The Latin Bridge")

Transliterates almost any world script into Sorani using a smart "Latin Bridge" technique.

Language Input Output (Sorani)
Chinese 你好 نی ھەو
Russian Путин پوتین
Greek Χαίρετε چایڕێتێ
German Straẞe ستراسسە
French République ڕێپوبلیکوێ
English Phone فۆن (IPA-based, not rule-based)

3. ➗ Scientific & Mathematical Logic

Handles complex math that breaks most normalizers.

  • Scientific Notation: 5e-23 $\rightarrow$ پێنج جارانی دە توانی سالب بیست و سێ
  • Functions: ln 4 $\rightarrow$ لۆگاریتمی سروشتی چوار
  • Fraction Logic:
    • 1/2 $\rightarrow$ نیوە
    • 3/4 $\rightarrow$ سێ دابەش چوار
    • 120km/h $\rightarrow$ ... بۆ هەر کاتژمێرێک (Context-aware "Per" rule)
    • 7/6 $\rightarrow$ حەوت دابەش شەش (Context-aware "Division" rule)

4. 📞 Smart Phone Numbers

Recognizes Iraqi and International phone formats and groups digits for natural reading (4-3-2-2 format).

  • 07501234567 $\rightarrow$ سفر حەوت سەد و پەنجا ...
  • +964... $\rightarrow$ کۆ نۆ سەد و شەست و چوار ...

5. 💻 Web & Technical Entities

  • URLs: www.google.com $\rightarrow$ دەبڵیو دەبڵیو دەبڵیو دۆت گووگڵ دۆت کۆم
  • Emails: info@gmail.com $\rightarrow$ ... ئەت جیمەیڵ دۆت کۆم (Recognizes common domains)
  • Codes: A1-B2 $\rightarrow$ ئەی یەک داش بی دوو (Character-by-character reading)

6. 📏 Context-Aware Units

Solves the ambiguity between units and letters.

  • 10m $\rightarrow$ دە مەتر
  • I am m $\rightarrow$ ئای ئەم ئێم (Letter M)
  • 12.5kg $\rightarrow$ دوازدە کیلۆگرام و نیو (Handles .5 as "Half")

🎛️ Configuration

You can fully customize the pipeline by passing a NormalizationConfig object.

from ckb_textify.core.pipeline import Pipeline
from ckb_textify.core.types import NormalizationConfig

config = NormalizationConfig(
    enable_phone=False,        # Keep phone numbers as digits
    enable_transliteration=False, # Disable foreign script transliteration
    shadda_mode="remove",      # "remove" or "double" (default)
    emoji_mode="convert",      # "remove" (default), "convert", "ignore"
    enable_math=True           # Normalizes math expressions
)

pipe = Pipeline(config)
print(pipe.normalize("Text..."))

Available Options

Key Default Description
enable_numbers True Convert 123 to text.
enable_web True Spells out URLs/Emails.
enable_phone True Groups and reads phone numbers.
enable_units True Expands km, kg, etc.
enable_math True Handles scientific notation and math symbols.
diacritics_mode "convert" Convert Arabic Harakat to Kurdish vowels.
shadda_mode "double" Doubles the letter for Shadda (مّ -> مم).
emoji_mode "remove" Removes emojis. Set to "convert" to speak them.

🤝 Contributing

Contributions are widely welcomed! If you have ideas for new rules, found a bug, or want to add support for more units:

  1. Fork the repository.
  2. Clone locally.
  3. Create a branch (git checkout -b feature/new-rule).
  4. Run Tests (python -m unittest discover tests).
  5. Submit a Pull Request.

👨‍💻 Author

Razwan M. Haji

📄 License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ckb_textify-5.0.1.tar.gz (62.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ckb_textify-5.0.1-py3-none-any.whl (73.1 kB view details)

Uploaded Python 3

File details

Details for the file ckb_textify-5.0.1.tar.gz.

File metadata

  • Download URL: ckb_textify-5.0.1.tar.gz
  • Upload date:
  • Size: 62.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for ckb_textify-5.0.1.tar.gz
Algorithm Hash digest
SHA256 2432fa968a74e9c7eb733f4a1d701ee3a922f521df7bb355dd4b6a781c77b2b8
MD5 7d9175e75c3290be62c06ac317a34ddf
BLAKE2b-256 39d40fdc1ebbad0024b0ad53fbe335d38f6722a4a1fcad8e12fc61d312f36309

See more details on using hashes here.

File details

Details for the file ckb_textify-5.0.1-py3-none-any.whl.

File metadata

  • Download URL: ckb_textify-5.0.1-py3-none-any.whl
  • Upload date:
  • Size: 73.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for ckb_textify-5.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a85a0789812626b8dad9c1079ce6cb42ab7fe0dfcaa024b2f35116acb2d146a6
MD5 11f46dd6029a167f450524c507d9f094
BLAKE2b-256 2ce992c2629cf9693aad437345b044ec30eeb59f3735e3884005092e2178e40d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page