Skip to main content

A comprehensive Text Normalization library for Central Kurdish (Sorani).

Project description

ckb-textify

PyPI version License: MIT Streamlit App

ckb-textify is an industrial-strength Text Normalization and Transliteration library designed specifically for Central Kurdish (Sorani).

It transforms "messy" real-world text (mixed languages, symbols, math, codes) into clean, spoken Kurdish text, making it the perfect pre-processor for Text-to-Speech (TTS) and NLP models.

🚀 Live Demo

Try the library instantly without installing anything: 👉 Click here to open the Live App

🔮 Next Step: Phonemization (G2P)

This library handles normalization (Text-to-Text). If you need Phonemization (Text-to-Sounds/IPA) for building a TTS model, check out the companion project:

🦁 ckb-g2p (Central Kurdish Grapheme-to-Phoneme)

📦 Installation

pip install ckb-textify

Dependencies:

  • eng-to-ipa: For accurate English pronunciation.

  • anyascii: For universal transliteration (Chinese, Russian, etc.).

⚡ Usage

from ckb_textify import convert_all

text = """
سڵاو، پەیوەندی بکە بە 07501234567.
نرخی زێڕ ≈ $2500.
کۆدەکە A1-B2 یە.
سڵاو لە Peter و Xi Jinping.
"""

normalized = convert_all(text)

print(normalized)

# Output:
# سڵاو, پەیوەندی بکە بە سفر حەوت سەد و پەنجا سەد و بیست و سێ چل و پێنج شەست و حەوت.
# نرخی زێڕ نزیکەی دوو هەزار و پێنج سەد دۆلار یە.
# کۆدەکە ئەی یەک داش بی دوو یە.
# سڵاو لە پیتەر و سی جینپینگ.

🌟 Features (v3.0.0)

1. 🌍 Universal Script Support

Transliterates almost any language into Sorani script using a "Latin Bridge" technique.

  • Chinese: 你好نی هاو

  • Russian: Путинپوتین

  • Greek: Χαίρετεچایرێت

  • German/French: Handles accents (Straẞeستراسسە, Républiqueڕیپەبلیک).

2. ➗ Advanced Math & Science

  • Scientific Notation: 5e-23پێنج جارانی دە توانی سالب بیست و سێ

  • Functions: ln 4لۆگاریتمی سروشتی چوار

  • Context-Aware: Distinguishes Division (7/6) from Rates (km/h).

3. 📞 Smart Phone Numbers

Handles Iraqi/Kurdish formats with intelligent grouping (4-3-2-2).

  • 07501234567سفر حەوت سەد و پەنجا...

  • +964...کۆ نۆ سەد و شەست و چوار...

4. 🔡 English Transliteration (IPA)

Uses the International Phonetic Alphabet to pronounce English words correctly.

  • Phoneفۆن (Not "پھۆنە")

  • Googleگووگڵ

  • Acronyms: GPTجی پی تی

5. 💻 Web & Technical

Reads technical strings character-by-character.

  • Emails: info@gmail.comئای ئێن ئێف ئۆ ئەت جیمەیڵ دۆت کۆم

  • URLs: www.razwan.netدابڵیو دابڵیو دابڵیو دۆت ئاڕ ئەی زێت یو ئەی ئێن دۆت نێت

  • Codes: A1-B2ئەی یەک داش بی دوو

6. 📏 Units & Measurements

Solves ambiguity between units and nouns.

  • 10mدە مەتر (Unit) vs mm (Noun/Letter)

  • 120km/hسەد و بیست کیلۆمەتر بۆ هەر کاتژمێرێک

🎛️ Configuration

ckb-textify is fully configurable. You can enable or disable specific normalization modules by passing a dictionary to convert_all.

Default Configuration

By default, all features are enabled (True), and the diacritics mode is set to "convert".

Here is the full list of available options:

config = {
    # --- Foundational ---
    "normalize_characters": True,   # Unify letters (ی, ک, ه)
    "normalize_digits": True,       # Convert ١٢٣ -> 123

    # --- Diacritics & Harakat ---
    "diacritics_mode": "convert",   # Options: "convert" (Fatha->ە), "remove", "keep"
    "shadda_mode": "double",        # Options: "double" (مّ -> مم), "remove"
    "remove_tatweel": True,         # Remove elongation character (ـ)

    # --- Expansion Modules ---
    "date_time": True,              # Dates and Times
    "phone_numbers": True,          # Phone numbers grouping
    "units": True,                  # Unit expansion (kg, m, cm...)
    "per_rule": True,               # Rates (km/h -> ... bo her ...)
    "math": True,                   # Math operations (+, -, scientific)
    "currency": True,               # Currency symbols ($, IQD)
    "percentage": True,             # Percentages (%, ٪)

    # --- Textual Features ---
    "web": True,                    # URLs and Emails
    "technical": True,              # Technical codes (UUID, MAC)
    "abbreviations": True,          # Expand abbr (د. -> دکتۆر)
    "arabic_names": True,           # Normalize names (محمد -> موحەمەد)
    "latin": True,                  # Transliterate English/Latin text
    "foreign": True,                # Transliterate other scripts (Chinese, Russian)
    "symbols": True,                # Common symbols (@, #, &)

    # --- Number Types ---
    "decimals": True,               # Decimal numbers
    "integers": True                # Integer numbers
}

# Example: Disable phone numbers and change diacritics mode
convert_all(text, config={"phone_numbers": False, "diacritics_mode": "remove"})

🤝 Contributing

Contributions are widely welcomed! If you have ideas for new rules, found a bug, or want to add support for more units, please feel free to open a Pull Request.

  1. Fork the repository on GitHub.

  2. Clone your fork locally.

  3. Create a new branch for your feature (git checkout -b feature/amazing-feature).

  4. Run Tests to ensure everything is working (python -m unittest discover tests).

  5. Commit your changes.

  6. Push to the branch and open a Pull Request.

🤝 Author

Razwan M. Haji

📄 License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ckb_textify-4.0.0.tar.gz (34.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ckb_textify-4.0.0-py3-none-any.whl (38.4 kB view details)

Uploaded Python 3

File details

Details for the file ckb_textify-4.0.0.tar.gz.

File metadata

  • Download URL: ckb_textify-4.0.0.tar.gz
  • Upload date:
  • Size: 34.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for ckb_textify-4.0.0.tar.gz
Algorithm Hash digest
SHA256 579b041ad02e276526a798ee9d16a9fad221ba8a86be632635c5c9839127789a
MD5 4d7f0d09c9f479a0bee2d180c03d4639
BLAKE2b-256 3b82117358759459cecf9a7b98c3c34d40789b7d56c0acd5cf5f5d2e7cac74c3

See more details on using hashes here.

File details

Details for the file ckb_textify-4.0.0-py3-none-any.whl.

File metadata

  • Download URL: ckb_textify-4.0.0-py3-none-any.whl
  • Upload date:
  • Size: 38.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for ckb_textify-4.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e59e4dcebcbb9cd20c22ab0ffcab34caf970f9ebe649a55a5084a09eab8503e3
MD5 599c2c693abaf7934af9923621d1cbb4
BLAKE2b-256 d7b510020e0b372f5ded341c83edf04ed55bc1c8a66e990160b19a1ba5f91d30

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page