Skip to main content

A Python library for normalizing Kashmiri text (Persio-Arabic script)

Project description

KashmiriNormalizer

A Python library designed for normalizing Kashmiri text (Persio-Arabic script). This tool standardizes text by handling character variations, consistent punctuation spacing, and digit conversion. It is optimized for Natural Language Processing (NLP) pipelines and Machine Learning data preprocessing.

Features

  • Character Canonicalization: Maps multiple Unicode variants of Kashmiri characters to a single standard form using extensive character maps.
  • Code standardization: Handles common inconsistencies in Kashmiri typing.
  • Punctuation & Spacing: Automatically removes spaces before punctuation marks and ensures a single space follows them.
  • Digit Normalization: Converts Kashmiri (Persio-Arabic) digits to standard English (Latin) digits for consistency.

Installation

Ensure you have Python 3.8 or higher installed.

You can install the package directly from GitHub:

pip install git+https://github.com/abdulmuizz0903/KashmiriNormalizer.git

Usage

Normalization

The normalize method is intended for cleaning text data. It performs canonicalization, digit conversion, and punctuation fixing.

from KashmiriNormalizer import KashmiriNormalizer

# Initialize the normalizer
kn = KashmiriNormalizer()

text = "مےٚ چُھ لۄکچارٕ پٮ۪ٹھٕ یہٕ عادت" # Example text

# Normalize the text
cleaned_text = kn.normalize(text)
print(cleaned_text)

Digit Handling

The library automatically converts Kashmiri digits to English digits during normalization.

digit_text = "١٢٣٤٥"
print(kn.normalize(digit_text)) 
# Output will have standardized English digits

Text-to-Speech (TTS) Normalization

The library includes a specialized TTSNormalizer class tailored for Text-to-Speech tasks. This class extends the base normalization set with:

  • Preserves Diacritics: Does not remove diacritics, which are crucial for correct pronunciation in Kashmiri.
  • Digit Expansion: Converts digits (both Kashmiri and English) into their Kashmiri word forms (e.g., "1" -> "اکھ").
    • Note: Requires populating the WORD_TO_DIGIT_MAP in constants.py.
  • Plat Ye Handling: Converts ؠ to ۍ at the end of words to align with standard writing rules.
  • Character Filtering: Removes any characters not present in the allowed Kashmiri character set (ALL_CHARACTERS), ensuring clean input for TTS models.
from KashmiriNormalizer import TTSNormalizer

# Initialize the TTS normalizer
tts_norm = TTSNormalizer()

text = "مےٚ چُھ 1 لۄکچارٕ پٮ۪ٹھٕ یہٕ عادت۔" 

# Normalize for TTS
tts_text = tts_norm.normalize(text)
print(tts_text)
# Output will have diacritics preserved, digits expanded to words, and non-Kashmiri chars removed.

Dependencies

  • regex: Used for advanced Unicode string handling

Development

The project structure is as follows:

KashmiriNormalizer/
├── src/
│   └── KashmiriNormalizer/
│       ├── __init__.py
│       ├── constants.py     # Character maps and regex constants
│       └── normalizer.py    # Main Normalizer class
└── pyproject.toml           # Build configuration and dependencies

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kashmirinormalizer-0.1.0.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kashmirinormalizer-0.1.0-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file kashmirinormalizer-0.1.0.tar.gz.

File metadata

  • Download URL: kashmirinormalizer-0.1.0.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for kashmirinormalizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 07c5e75874aa5f6da07b75a822112f442ab20eeecd9c9ab439f38cce0b43ece8
MD5 bf27d20ffa9d192eeeba315e9c96d0db
BLAKE2b-256 e19ad4eb2eceefb463919abee1ff62e4bf2b679ea2c1b415179ea334e70d3c71

See more details on using hashes here.

File details

Details for the file kashmirinormalizer-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for kashmirinormalizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c44d7ba252bde74512773f44d291a8d89e55413e70e94005fa4678b95d93b2cc
MD5 863bb3255add936240d75f85b2ead8c8
BLAKE2b-256 55b7997091fe20b1ac10ecc7c4b4c6f3fd71bbdfbb05f873c3ef9ed1ffff648a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page