An advanced text normalizer converting standard Uzbek Latin into the New Turkic Alphabet (YTA) for highly optimized NLP LLM tokenization natively supporting the O(1) NG ambiguity exclusion algorithms.
Project description
uzbek2turkic: Uzbek Normalizer for NLP Models
A robust, rule-based text normalization library designed to prepare and standardize Uzbek text for modern AI and NLP models (BERT, mDeBERTa, Llama, WordPiece, BPE).
The library intelligently maps Uzbek Latin sequences to the Yangi Turkiy Alifbo (YTA - New Turkic Alphabet). By unifying the Turkic language subword ecosystem, this package prevents tokenizers from hopelessly fragmenting multi-character digraphs (like o', g', sh, ch, and ng), drastically improving training efficiency and downstream NLP performance.
Key Features
- Algorithmic Apostrophe Unification: Flattens all erratic variations of apostrophes (
ʻ,‘,’,´,`) into the internationally recognized W3C modifier mark:ʼ(U+02BC). - Digraph Compression: Securely converts sequential digraphs into single-character YTA equivalents (
o'→ö,g'→ğ,sh→ş,ch→ç). - Advanced
NGDisambiguation: Uses a dual-layered algorithm (Grammatical Suffix Matrix + O(1) Dictionary Hash) to distinguish between the velar nasal (ñ) and dative/participle morphology boundaries (n+g). - Case-Sensitivity: Fully preserves logical structural capitalization (
Sh→ŞorSH→Ş). - Reversible: Supports flawless denormalization (
trc2uz) to restore model outputs back to standard standard Uzbek Latin.
Installation
Available on PyPI. Install via pip:
pip install uzbek2turkic
Quick Start
from uzbek2turkic.normalizer import UzbekNormalizer
# Initialize the normalizer
normalizer = UzbekNormalizer()
original_text = "O'g'il bola shahar tomon choy ichgani yo'l oldi."
# Forward Conversion (Original -> YTA)
normalized = normalizer.uz2trc(original_text)
print(normalized)
# Output: Öğil bola şahar tomon çoy içgani yöl oldi.
# Backward Conversion (YTA -> Original)
restored = normalizer.trc2uz(normalized)
print(restored)
# Output: O'g'il bola shahar tomon choy ichgani yo'l oldi.
Deep Dive: Semantic Disambiguation of NG (ñ vs n + g)
Converting ng to the nasal ñ is one of the most profound challenges in computational Uzbek text processing. n and g frequently appear sequentially as separate grammatical units (for example, a root ending in n appended by the dative suffix ga, or a passive verb ending in n appended by -gan).
A simplistic replacement script would erroneously destroy morphology by mapping jonga into joña or ishlangan into işlañan.
uzbek2turkic solves this dynamically by analyzing every single occurrence of ng independently, applying two powerful mechanisms:
1. Grammatical Suffix Concurrency Matrix
The algorithm detects localized combinations of N-suffixes (like -lan, -gan, -qan) colliding with G-suffixes (like -gach, -gan, -ga). It automatically isolates these morphological boundaries without relying on absolute dictionary lookups.
# Example of Grammatical Matrix working its magic:
text = "quvonganning ishlanganlariga"
print(normalizer.uz2trc(text))
# Output: quvonganniñ işlanganlariga
(Notice how the legitimate initial n+g breaks (quvon+gan, ishlan+gan) successfully bypass ñ conversion, while the true genitive suffix -ning at the end smoothly transitions into niñ within the exact same lexical word).
2. O(1) Root Dictionary Filtering
For native nouns that structurally end in n (like tun, kun, jon, vatan), the text is instantly validated against exceptions.txt.
# Example of Dictionary Exclusion:
text = "Tungi sovuq vatanga ham ta'sir qildi."
print(normalizer.uz2trc(text))
# Output: Tungi sovuq vatanga ham taʼsir qildi.
(You can also load your own root words by passing a custom dictionary during initialization: UzbekNormalizer(exceptions_file='/path/to/custom_roots.txt')).
W3C Typography and BPE Tokenizer Compliance
According to W3C Typography and International Unicode Standards, letters that act as an integral part of a word (like the Uzbek "tutuq") must not be written using standard punctuation apostrophes (', U+0027 or ’, U+2019). The international standard mandates the use of U+02BC (Modifier Letter Apostrophe — ʼ).
The Uzbek tutuq mark (used in words like ma'no, sur'at) is treated as a "punctuation boundary" by standard tokenizers (like HuggingFace BPE), unconditionally shattering the word into useless bytes (e.g., ["ma", "'", "no"]).
By converting standalone apostrophes into U+02BC (ʼ), which is officially classed as a Letter Modifier, models treat maʼno as a unified, physically unbroken linguistic token. This single fix dramatically reduces context window loss during LLM training.
License
MIT License. Open-source and ready for large-scale NLP pipelining.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file uzbek2turkic-0.2.1.tar.gz.
File metadata
- Download URL: uzbek2turkic-0.2.1.tar.gz
- Upload date:
- Size: 6.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a1404d0826dfad7fcb7114ec71f3c0837d22b37fe458b31022180dea99c5130
|
|
| MD5 |
de4dea26ead3cc89883436b6a836b713
|
|
| BLAKE2b-256 |
86e2e8b9a0f262028b8572f1d0ff358ed6dee7ea9b53a6f1df949a40ed2ecf77
|
File details
Details for the file uzbek2turkic-0.2.1-py3-none-any.whl.
File metadata
- Download URL: uzbek2turkic-0.2.1-py3-none-any.whl
- Upload date:
- Size: 6.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d16a59c3df0742f22eda348bb097eb97926317b07e29a43f3290ea2d4c440f9c
|
|
| MD5 |
1eb0a07fe857e3f3a9843e4f64e35fee
|
|
| BLAKE2b-256 |
1cfb51404b620d5a012a4daeccbc47f1e5d7d3755f6d5970ae5ebd966cc856d8
|