Skip to main content

Industrial-strength Text Normalization and Transliteration for Central Kurdish (Sorani)

Project description

🦁 ckb-textify

PyPI version License: MIT Streamlit App

ckb-textify is an industrial-strength Text Normalization and Transliteration library designed specifically for Central Kurdish (Sorani).

While most normalizers perform simple "Find & Replace", ckb-textify uses a context-aware pipeline to transform "messy" real-world text—including mixed languages, scientific notation, Quranic Tajweed, and technical jargon—into clean, spoken Kurdish text. It is the perfect pre-processor for Text-to-Speech (TTS) and NLP models.


🚀 Live Demo

Try the library instantly in your browser: 👉 Click here to open the Live App

🔮 The Ecosystem

ckb-textify handles Normalization (Text-to-Text). For Phonemization (Text-to-Sounds/IPA), check out the companion project:

  • 🦁 ckb-g2p (Grapheme-to-Phoneme): GitHub | Demo

📦 Installation

pip install ckb-textify

Key Dependencies:

  • eng-to-ipa: For accurate English pronunciation (e.g., "Phone" -> "فۆن").
  • anyascii: For universal script transliteration (Chinese, Russian, etc.).

⚡ Quick Start

from ckb_textify.core.pipeline import Pipeline
from ckb_textify.core.types import NormalizationConfig

text = """
سڵاو! تکایە پەیوەندی بکە بە 07501234567.
نرخی زێڕ ≈ $2500.
کۆدەکە A1-B2 یە.
سڵاو لە Putin و Xi Jinping.
"""

# 1. Initialize Default Pipeline
pipe = Pipeline()

# 2. Normalize
normalized = pipe.normalize(text)

print(normalized)

Output:

سڵاو! تکایە پەیوەندی بکە بە سفر حەوت سەد و پەنجا سەد و بیست و سێ چل و پێنج شەست و حەوت.
نرخی زێڕ نزیکەی دوو ھەزار و پێنج سەد دۆلار.
کۆدەکە ئەی یەک داش بی دوو یە.
سڵاو لە پوتین و سی جینپینگ.

🏛️ Architecture

ckb-textify processes text through a strictly ordered pipeline to handle dependencies (e.g., Units must be processed before Technical codes).


🌟 Advanced Features

1. 🕌 Deep Linguistic & Tajweed Support

Unlike basic normalizers, this library respects complex phonological rules for Arabic/Islamic text embedded in Kurdish.

  • Shamsi (Sun) Letters: Automatically assimilates the 'L' in 'Al-'.
    • Input: بِسْمِ ٱللَّهِ
    • Output: بیسمی للاھی (Handles the "Light Lam" vs "Dark Lam" rule automatically).
  • Context-Aware "Allah": Determines pronunciation (L vs LL) based on the preceding vowel.
  • Alif Wasla (ٱ): Treated as silent in continuation, but pronounced as 'E' at the start.
  • Tajweed Rules: Handles Iqlab (N->M) and Idgham.
  • Heavy 'R' (ڕ): Detects heavy R based on Arabic vowel context (e.g., مِرْصَاد -> میڕساد).

2. 🌍 Universal Script Support ("The Latin Bridge")

Transliterates almost any world script into Sorani using a smart "Latin Bridge" technique.

Language Input Output (Sorani)
Chinese 你好 نی ھەو
Russian Путин پوتین
Greek Χαίρετε چایڕێتێ
German Straẞe ستراسسە
French République ڕێپوبلیکوێ
English Phone فۆن (IPA-based, not rule-based)

3. ➗ Scientific & Mathematical Logic

Handles complex math that breaks most normalizers.

  • Scientific Notation: 5e-23 $\rightarrow$ پێنج جارانی دە توانی سالب بیست و سێ
  • Functions: ln 4 $\rightarrow$ لۆگاریتمی سروشتی چوار
  • Fraction Logic:
    • 1/2 $\rightarrow$ نیوە
    • 3/4 $\rightarrow$ سێ دابەش چوار
    • 120km/h $\rightarrow$ ... بۆ هەر کاتژمێرێک (Context-aware "Per" rule)
    • 7/6 $\rightarrow$ حەوت دابەش شەش (Context-aware "Division" rule)

4. 📞 Smart Phone Numbers

Recognizes Iraqi and International phone formats and groups digits for natural reading (4-3-2-2 format).

  • 07501234567 $\rightarrow$ سفر حەوت سەد و پەنجا ...
  • +964... $\rightarrow$ کۆ نۆ سەد و شەست و چوار ...

5. 💻 Web & Technical Entities

  • URLs: www.google.com $\rightarrow$ دەبڵیو دەبڵیو دەبڵیو دۆت گووگڵ دۆت کۆم
  • Emails: info@gmail.com $\rightarrow$ ... ئەت جیمەیڵ دۆت کۆم (Recognizes common domains)
  • Codes: A1-B2 $\rightarrow$ ئەی یەک داش بی دوو (Character-by-character reading)

6. 📏 Context-Aware Units

Solves the ambiguity between units and letters.

  • 10m $\rightarrow$ دە مەتر
  • I am m $\rightarrow$ ئای ئەم ئێم (Letter M)
  • 12.5kg $\rightarrow$ دوازدە کیلۆگرام و نیو (Handles .5 as "Half")

🎛️ Configuration

You can fully customize the pipeline by passing a NormalizationConfig object.

from ckb_textify.core.pipeline import Pipeline
from ckb_textify.core.types import NormalizationConfig

config = NormalizationConfig(
    enable_phone=False,        # Keep phone numbers as digits
    enable_transliteration=False, # Disable foreign script transliteration
    shadda_mode="remove",      # "remove" or "double" (default)
    emoji_mode="convert",      # "remove" (default), "convert", "ignore"
    enable_math=True           # Normalizes math expressions
)

pipe = Pipeline(config)
print(pipe.normalize("Text..."))

Available Options

Key Default Description
enable_numbers True Convert 123 to text.
enable_web True Spells out URLs/Emails.
enable_phone True Groups and reads phone numbers.
enable_units True Expands km, kg, etc.
enable_math True Handles scientific notation and math symbols.
diacritics_mode "convert" Convert Arabic Harakat to Kurdish vowels.
shadda_mode "double" Doubles the letter for Shadda (مّ -> مم).
emoji_mode "remove" Removes emojis. Set to "convert" to speak them.

🤝 Contributing

Contributions are widely welcomed! If you have ideas for new rules, found a bug, or want to add support for more units:

  1. Fork the repository.
  2. Clone locally.
  3. Create a branch (git checkout -b feature/new-rule).
  4. Run Tests (python -m unittest discover tests).
  5. Submit a Pull Request.

👨‍💻 Author

Razwan M. Haji

📄 License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ckb_textify-5.0.0.tar.gz (62.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ckb_textify-5.0.0-py3-none-any.whl (73.3 kB view details)

Uploaded Python 3

File details

Details for the file ckb_textify-5.0.0.tar.gz.

File metadata

  • Download URL: ckb_textify-5.0.0.tar.gz
  • Upload date:
  • Size: 62.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for ckb_textify-5.0.0.tar.gz
Algorithm Hash digest
SHA256 f649eb044a589bd151eca3a7e1489bc581ee632bfafcf1f5e3609310f915babe
MD5 9bbc369a8d9c7ae4f7c7be4c31c0060e
BLAKE2b-256 08bed0d127c4a15780009987dd4aca996b26b30ddbee6dd0278bc4c6bf109e58

See more details on using hashes here.

File details

Details for the file ckb_textify-5.0.0-py3-none-any.whl.

File metadata

  • Download URL: ckb_textify-5.0.0-py3-none-any.whl
  • Upload date:
  • Size: 73.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for ckb_textify-5.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9ba518e7fc899eb57f1fc6cd77ba80c0771d6503d2c373337b8ca4dff42d98a3
MD5 7fe3b7f6b39ec4968cb3e4355cde4f5c
BLAKE2b-256 e73e4957e339e63c44f58718d27ad5975f99c7b1c1a38af06f82b6959d47b0be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page