Skip to main content

A comprehensive Text Normalization library for Central Kurdish (Sorani).

Project description

ckb-textify

PyPI version License: MIT Streamlit App

ckb-textify is an industrial-strength Text Normalization and Transliteration library designed specifically for Central Kurdish (Sorani).

It transforms "messy" real-world text (mixed languages, symbols, math, codes) into clean, spoken Kurdish text, making it the perfect pre-processor for Text-to-Speech (TTS) and NLP models.

🚀 Live Demo

Try the library instantly without installing anything: 👉 Click here to open the Live App

🔮 Next Step: Phonemization (G2P)

This library handles normalization (Text-to-Text). If you need Phonemization (Text-to-Sounds/IPA) for building a TTS model, check out the companion project:

🦁 ckb-g2p (Central Kurdish Grapheme-to-Phoneme)

📦 Installation

pip install ckb-textify

Dependencies:

  • eng-to-ipa: For accurate English pronunciation.

  • anyascii: For universal transliteration (Chinese, Russian, etc.).

⚡ Usage

from ckb_textify import convert_all

text = """
سڵاو، پەیوەندی بکە بە 07501234567.
نرخی زێڕ ≈ $2500.
کۆدەکە A1-B2 یە.
سڵاو لە Peter و Xi Jinping.
"""

normalized = convert_all(text)

print(normalized)

# Output:
# سڵاو, پەیوەندی بکە بە سفر حەوت سەد و پەنجا سەد و بیست و سێ چل و پێنج شەست و حەوت.
# نرخی زێڕ نزیکەی دوو هەزار و پێنج سەد دۆلار یە.
# کۆدەکە ئەی یەک داش بی دوو یە.
# سڵاو لە پیتەر و سی جینپینگ.

🌟 Features (v4.0.1)

1. 🌍 Universal Script Support

Transliterates almost any language into Sorani script using a "Latin Bridge" technique.

  • Chinese: 你好نی هاو

  • Russian: Путинپوتین

  • Greek: Χαίρετεچایرێت

  • German/French: Handles accents (Straẞeستراسسە, Républiqueڕیپەبلیک).

2. ➗ Advanced Math & Science

  • Scientific Notation: 5e-23پێنج جارانی دە توانی سالب بیست و سێ

  • Functions: ln 4لۆگاریتمی سروشتی چوار

  • Context-Aware: Distinguishes Division (7/6) from Rates (km/h).

3. 📞 Smart Phone Numbers

Handles Iraqi/Kurdish formats with intelligent grouping (4-3-2-2).

  • 07501234567سفر حەوت سەد و پەنجا...

  • +964...کۆ نۆ سەد و شەست و چوار...

4. 🔡 English Transliteration (IPA)

Uses the International Phonetic Alphabet to pronounce English words correctly.

  • Phoneفۆن (Not "پھۆنە")

  • Googleگووگڵ

  • Acronyms: GPTجی پی تی

5. 💻 Web & Technical

Reads technical strings character-by-character.

  • Emails: info@gmail.comئای ئێن ئێف ئۆ ئەت جیمەیڵ دۆت کۆم

  • URLs: www.razwan.netدابڵیو دابڵیو دابڵیو دۆت ئاڕ ئەی زێت یو ئەی ئێن دۆت نێت

  • Codes: A1-B2ئەی یەک داش بی دوو

6. 📏 Units & Measurements

Solves ambiguity between units and nouns.

  • 10mدە مەتر (Unit) vs mm (Noun/Letter)

  • 120km/hسەد و بیست کیلۆمەتر بۆ هەر کاتژمێرێک

🎛️ Configuration

ckb-textify is fully configurable. You can enable or disable specific normalization modules by passing a dictionary to convert_all.

Default Configuration

By default, all features are enabled (True), and the diacritics mode is set to "convert".

Here is the full list of available options:

config = {
    # --- Foundational ---
    "normalize_characters": True,   # Unify letters (ی, ک, ه)
    "normalize_digits": True,       # Convert ١٢٣ -> 123

    # --- Diacritics & Harakat ---
    "diacritics_mode": "convert",   # Options: "convert" (Fatha->ە), "remove", "keep"
    "shadda_mode": "double",        # Options: "double" (مّ -> مم), "remove"
    "remove_tatweel": True,         # Remove elongation character (ـ)

    # --- Expansion Modules ---
    "date_time": True,              # Dates and Times
    "phone_numbers": True,          # Phone numbers grouping
    "units": True,                  # Unit expansion (kg, m, cm...)
    "per_rule": True,               # Rates (km/h -> ... bo her ...)
    "math": True,                   # Math operations (+, -, scientific)
    "currency": True,               # Currency symbols ($, IQD)
    "percentage": True,             # Percentages (%, ٪)

    # --- Textual Features ---
    "web": True,                    # URLs and Emails
    "technical": True,              # Technical codes (UUID, MAC)
    "abbreviations": True,          # Expand abbr (د. -> دکتۆر)
    "arabic_names": True,           # Normalize names (محمد -> موحەمەد)
    "latin": True,                  # Transliterate English/Latin text
    "foreign": True,                # Transliterate other scripts (Chinese, Russian)
    "symbols": True,                # Common symbols (@, #, &)

    # --- Number Types ---
    "decimals": True,               # Decimal numbers
    "integers": True                # Integer numbers
}

# Example: Disable phone numbers and change diacritics mode
convert_all(text, config={"phone_numbers": False, "diacritics_mode": "remove"})

🤝 Contributing

Contributions are widely welcomed! If you have ideas for new rules, found a bug, or want to add support for more units, please feel free to open a Pull Request.

  1. Fork the repository on GitHub.

  2. Clone your fork locally.

  3. Create a new branch for your feature (git checkout -b feature/amazing-feature).

  4. Run Tests to ensure everything is working (python -m unittest discover tests).

  5. Commit your changes.

  6. Push to the branch and open a Pull Request.

🤝 Author

Razwan M. Haji

📄 License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ckb_textify-4.0.1.tar.gz (35.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ckb_textify-4.0.1-py3-none-any.whl (39.6 kB view details)

Uploaded Python 3

File details

Details for the file ckb_textify-4.0.1.tar.gz.

File metadata

  • Download URL: ckb_textify-4.0.1.tar.gz
  • Upload date:
  • Size: 35.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for ckb_textify-4.0.1.tar.gz
Algorithm Hash digest
SHA256 f2d6304cb21b5986de829eaa6c4fd8a90d7d677bda10d693912af95b17a370e7
MD5 e97da3b8b76447f0d205fa07b354e1bb
BLAKE2b-256 1b211093b9ff9209e06f9a5bb278a0bca8e1fa42651c150b014dae3a3426a245

See more details on using hashes here.

File details

Details for the file ckb_textify-4.0.1-py3-none-any.whl.

File metadata

  • Download URL: ckb_textify-4.0.1-py3-none-any.whl
  • Upload date:
  • Size: 39.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for ckb_textify-4.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 baf60c241fc0525450b4ea9dc168bce730a531cacfd0d92726eecbf8f3462d9c
MD5 e681609807ff4d0a21235a3629b204b0
BLAKE2b-256 0a07a4a7ceab70da16ccd7459598424290fb4dd123a8234bae2bf85ec7798813

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page