A Python library for normalizing Kashmiri text (Persio-Arabic script)
Project description
KashmiriNormalizer
A Python library designed for normalizing Kashmiri text (Persio-Arabic script). This tool standardizes text by handling character variations, consistent punctuation spacing, and digit conversion. It is optimized for Natural Language Processing (NLP) pipelines and Machine Learning data preprocessing.
Features
- Character Canonicalization: Maps multiple Unicode variants of Kashmiri characters to a single standard form using extensive character maps.
- Code standardization: Handles common inconsistencies in Kashmiri typing.
- Punctuation & Spacing: Automatically removes spaces before punctuation marks and ensures a single space follows them.
- Digit Normalization: Converts Kashmiri (Persio-Arabic) digits to standard English (Latin) digits for consistency.
Installation
Ensure you have Python 3.8 or higher installed.
You can install the package directly from GitHub:
pip install git+https://github.com/abdulmuizz0903/KashmiriNormalizer.git
Usage
Normalization
The normalize method is intended for cleaning text data. It performs canonicalization, digit conversion, and punctuation fixing.
from KashmiriNormalizer import KashmiriNormalizer
# Initialize the normalizer
kn = KashmiriNormalizer()
text = "مےٚ چُھ لۄکچارٕ پٮ۪ٹھٕ یہٕ عادت" # Example text
# Normalize the text
cleaned_text = kn.normalize(text)
print(cleaned_text)
Digit Handling
The library automatically converts Kashmiri digits to English digits during normalization.
digit_text = "١٢٣٤٥"
print(kn.normalize(digit_text))
# Output will have standardized English digits
Text-to-Speech (TTS) Normalization
The library includes a specialized TTSNormalizer class tailored for Text-to-Speech tasks. This class extends the base normalization set with:
- Preserves Diacritics: Does not remove diacritics, which are crucial for correct pronunciation in Kashmiri.
- Digit Expansion: Converts digits (both Kashmiri and English) into their Kashmiri word forms (e.g., "1" -> "اکھ").
- Note: Requires populating the
WORD_TO_DIGIT_MAPinconstants.py.
- Note: Requires populating the
- Plat Ye Handling: Converts
ؠtoۍat the end of words to align with standard writing rules. - Character Filtering: Removes any characters not present in the allowed Kashmiri character set (
ALL_CHARACTERS), ensuring clean input for TTS models.
from KashmiriNormalizer import TTSNormalizer
# Initialize the TTS normalizer
tts_norm = TTSNormalizer()
text = "مےٚ چُھ 1 لۄکچارٕ پٮ۪ٹھٕ یہٕ عادت۔"
# Normalize for TTS
tts_text = tts_norm.normalize(text)
print(tts_text)
# Output will have diacritics preserved, digits expanded to words, and non-Kashmiri chars removed.
Dependencies
- regex: Used for advanced Unicode string handling
Development
The project structure is as follows:
KashmiriNormalizer/
├── src/
│ └── KashmiriNormalizer/
│ ├── __init__.py
│ ├── constants.py # Character maps and regex constants
│ └── normalizer.py # Main Normalizer class
└── pyproject.toml # Build configuration and dependencies
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kashmirinormalizer-0.1.0.tar.gz.
File metadata
- Download URL: kashmirinormalizer-0.1.0.tar.gz
- Upload date:
- Size: 8.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07c5e75874aa5f6da07b75a822112f442ab20eeecd9c9ab439f38cce0b43ece8
|
|
| MD5 |
bf27d20ffa9d192eeeba315e9c96d0db
|
|
| BLAKE2b-256 |
e19ad4eb2eceefb463919abee1ff62e4bf2b679ea2c1b415179ea334e70d3c71
|
File details
Details for the file kashmirinormalizer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: kashmirinormalizer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c44d7ba252bde74512773f44d291a8d89e55413e70e94005fa4678b95d93b2cc
|
|
| MD5 |
863bb3255add936240d75f85b2ead8c8
|
|
| BLAKE2b-256 |
55b7997091fe20b1ac10ecc7c4b4c6f3fd71bbdfbb05f873c3ef9ed1ffff648a
|