Skip to main content

A comprehensive Text Normalization library for Central Kurdish (Sorani).

Project description

ckb-textify

PyPI version License: MIT Streamlit App

ckb-textify is an industrial-strength Text Normalization and Transliteration library designed specifically for Central Kurdish (Sorani).

It transforms "messy" real-world text (mixed languages, symbols, math, codes) into clean, spoken Kurdish text, making it the perfect pre-processor for Text-to-Speech (TTS) and NLP models.

🚀 Live Demo

Try the library instantly without installing anything: 👉 Click here to open the Live App

<<<<<<< HEAD

🔮 Next Step: Phonemization (G2P)

This library handles normalization (Text-to-Text). If you need Phonemization (Text-to-Sounds/IPA) for building a TTS model, check out the companion project:

🦁 ckb-g2p (Central Kurdish Grapheme-to-Phoneme)

=======

2f09d3aaad7790a5fc9b75d15b279e7d9d258c77

📦 Installation

pip install ckb-textify

Dependencies:

  • eng-to-ipa: For accurate English pronunciation.
  • anyascii: For universal transliteration (Chinese, Russian, etc.).

⚡ Usage

from ckb_textify import convert_all

text = """
سڵاو، پەیوەندی بکە بە 07501234567.
نرخی زێڕ ≈ $2500.
کۆدەکە A1-B2 یە.
سڵاو لە Peter و Xi Jinping.
"""

normalized = convert_all(text)

print(normalized)

# Output:
# سڵاو, پەیوەندی بکە بە سفر حەوت سەد و پەنجا سەد و بیست و سێ چل و پێنج شەست و حەوت.
# نرخی زێڕ نزیکەی دوو هەزار و پێنج سەد دۆلار یە.
# کۆدەکە ئەی یەک داش بی دوو یە.
# سڵاو لە پیتەر و سی جینپینگ.

🌟 Features (v3.0.0)

1. 🌍 Universal Script Support

Transliterates almost any language into Sorani script using a "Latin Bridge" technique.

  • Chinese: 你好 → نی هاو
  • Russian: Путин → پوتین
  • Greek: Χαίρετε → چایرێت
  • German/French: Handles accents (Straẞe → ستراسسە, République → ڕیپەبلیک).

2. ➗ Advanced Math & Science

  • Scientific Notation: 5e-23 → پێنج جارانی دە توانی سالب بیست و سێ
  • Functions: ln 4 → لۆگاریتمی سروشتی چوار
  • Context-Aware: Distinguishes Division (7/6) from Rates (km/h).

3. 📞 Smart Phone Numbers

Handles Iraqi/Kurdish formats with intelligent grouping (4-3-2-2).

  • 07501234567 → سفر حەوت سەد و پەنجا...
  • +964... → کۆ نۆ سەد و شەست و چوار...

4. 🔡 English Transliteration (IPA)

Uses the International Phonetic Alphabet to pronounce English words correctly.

  • Phone → فۆن (Not "پھۆنە")
  • Google → گووگڵ
  • Acronyms: GPT → جی پی تی

5. 💻 Web & Technical

Reads technical strings character-by-character.

  • Emails: info@gmail.com → ئای ئێن ئێف ئۆ ئەت جیمەیڵ دۆت کۆم
  • URLs: www.razwan.net → دابڵیو دابڵیو دابڵیو دۆت ئاڕ ئەی زێت یو ئەی ئێن دۆت نێت
  • Codes: A1-B2 → ئەی یەک داش بی دوو

6. 📏 Units & Measurements

Solves ambiguity between units and nouns.

  • 10m → دە مەتر (Unit) vs m → m (Noun/Letter)
  • 120km/h → سەد و بیست کیلۆمەتر بۆ هەر کاتژمێرێک

🎛️ Configuration

You can disable specific modules if needed:

config = {
    "phone_numbers": False,
    "foreign": False  # Disable Chinese/Russian transliteration
}

convert_all(text, config=config)

🤝 Contributing

Contributions are widely welcomed! If you have ideas for new rules, found a bug, or want to add support for more units, please feel free to open a Pull Request.

  1. Fork the repository on GitHub.
  2. Clone your fork locally.
  3. Create a new branch for your feature (git checkout -b feature/amazing-feature).
  4. Run Tests to ensure everything is working (python -m unittest discover tests).
  5. Commit your changes.
  6. Push to the branch and open a Pull Request.

<<<<<<< HEAD

🤝 Author

Razwan. M Haji

=======

2f09d3aaad7790a5fc9b75d15b279e7d9d258c77

📄 License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ckb_textify-3.0.1.tar.gz (28.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ckb_textify-3.0.1-py3-none-any.whl (33.2 kB view details)

Uploaded Python 3

File details

Details for the file ckb_textify-3.0.1.tar.gz.

File metadata

  • Download URL: ckb_textify-3.0.1.tar.gz
  • Upload date:
  • Size: 28.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for ckb_textify-3.0.1.tar.gz
Algorithm Hash digest
SHA256 7e10c0566e20239cf6420249c776e59c56432ae808e117b8497cb6140fa14f76
MD5 9859de2d4fd47e8be4c81a8df8ad932d
BLAKE2b-256 6e6ccbcf41d280f8f576bb23623e0d587c181c00d5a506b80c74df767246b346

See more details on using hashes here.

File details

Details for the file ckb_textify-3.0.1-py3-none-any.whl.

File metadata

  • Download URL: ckb_textify-3.0.1-py3-none-any.whl
  • Upload date:
  • Size: 33.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for ckb_textify-3.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4d69d69c2c8893f006fe91ba65b4cfe35f2b5ea90c81bc4bfbb27da3993a1374
MD5 26a7474a85ccdffd02804d86757f3d26
BLAKE2b-256 7e48f6c4dc024efe5c18c009dea61069b38a6f461703ea5ba4840711dd46e9ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page