An advanced text normalizer converting standard Uzbek Latin into the Common Turkic Alphabet (CTA) for highly optimized NLP LLM tokenization natively supporting the O(1) NG ambiguity exclusion algorithms.

These details have not been verified by PyPI

Project links

Homepage

Project description

uzbek2turkic: Uzbek Normalizer for NLP Models

A robust, rule-based text normalization library designed to prepare and standardize Uzbek text for modern AI and NLP models (BERT, mDeBERTa, Llama, WordPiece, BPE).

The library intelligently maps Uzbek Latin sequences to the Common Turkic Alphabet (CTA). By unifying the Turkic language subword ecosystem, this package prevents tokenizers from hopelessly fragmenting multi-character digraphs (like o', g', sh, ch, and ng), drastically improving training efficiency and downstream NLP performance.

Key Features

Algorithmic Apostrophe Unification: Flattens all erratic variations of apostrophes (ʻ, ‘, ’, ´, `) into the internationally recognized W3C modifier mark: ʼ (U+02BC).
Digraph Compression: Securely converts sequential digraphs into single-character CTA equivalents (o' → ö, g' → ğ, sh → ş, ch → ç).
Advanced NG Disambiguation: Uses a dual-layered algorithm (Grammatical Suffix Matrix + O(1) Dictionary Hash) to distinguish between the velar nasal (ñ) and dative/participle morphology boundaries (n + g).
Case-Sensitivity: Fully preserves logical structural capitalization (Sh → Ş or SH → Ş).
Reversible: Supports flawless denormalization (trc2uz) to restore model outputs back to standard standard Uzbek Latin.

Installation

Available on PyPI. Install via pip:

pip install uzbek2turkic

Quick Start

from uzbek2turkic.normalizer import UzbekNormalizer

# Initialize the normalizer
normalizer = UzbekNormalizer()

original_text = "O'g'il bola shahar tomon choy ichgani yo'l oldi."

# Forward Conversion (Original -> CTA)
normalized = normalizer.uz2trc(original_text)
print(normalized)
# Output: Öğil bola şahar tomon çoy içgani yöl oldi.

# Backward Conversion (CTA -> Original)
restored = normalizer.trc2uz(normalized)
print(restored)
# Output: O'g'il bola shahar tomon choy ichgani yo'l oldi.

Deep Dive: Semantic Disambiguation of NG (`ñ` vs `n` + `g`)

Converting ng to the nasal ñ is one of the most profound challenges in computational Uzbek text processing. n and g frequently appear sequentially as separate grammatical units (for example, a root ending in n appended by the dative suffix ga, or a passive verb ending in n appended by -gan).

A simplistic replacement script would erroneously destroy morphology by mapping jonga into joña or ishlangan into işlañan.

uzbek2turkic solves this dynamically by analyzing every single occurrence of ng independently, applying two powerful mechanisms:

1. Grammatical Suffix Concurrency Matrix

The algorithm detects localized combinations of N-suffixes (like -lan, -gan, -qan) colliding with G-suffixes (like -gach, -gan, -ga). It automatically isolates these morphological boundaries without relying on absolute dictionary lookups.

# Example of Grammatical Matrix working its magic:
text = "quvonganning ishlanganlariga"

print(normalizer.uz2trc(text))
# Output: quvonganniñ işlanganlariga

(Notice how the legitimate initial n+g breaks (quvon+gan, ishlan+gan) successfully bypass ñ conversion, while the true genitive suffix -ning at the end smoothly transitions into niñ within the exact same lexical word).

2. O(1) Root Dictionary Filtering

For native nouns that structurally end in n (like tun, kun, jon, vatan), the text is instantly validated against exceptions.txt.

# Example of Dictionary Exclusion:
text = "Tungi sovuq kunga va buguning bir qismiga ta'sir qildi."

print(normalizer.uz2trc(text))
# Output: Tungi sovuq kunga va buguniñ bir qismiga taʼsir qildi.

(You can also load your own root words by passing a custom dictionary during initialization: UzbekNormalizer(exceptions_file='/path/to/custom_roots.txt')).

W3C Typography and BPE Tokenizer Compliance

According to W3C Typography and International Unicode Standards, letters that act as an integral part of a word (like the Uzbek "tutuq") must not be written using standard punctuation apostrophes (', U+0027 or ’, U+2019). The international standard mandates the use of U+02BC (Modifier Letter Apostrophe — ʼ).

The Uzbek tutuq mark (used in words like ma'no, sur'at) is treated as a "punctuation boundary" by standard tokenizers (like HuggingFace BPE), unconditionally shattering the word into useless bytes (e.g., ["ma", "'", "no"]).

By converting standalone apostrophes into U+02BC (ʼ), which is officially classed as a Letter Modifier, models treat maʼno as a unified, physically unbroken linguistic token. This single fix dramatically reduces context window loss during LLM training.

License

MIT License. Open-source and ready for large-scale NLP pipelining.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.2

Apr 3, 2026

0.2.1

Apr 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uzbek2turkic-0.2.2.tar.gz (6.3 kB view details)

Uploaded Apr 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

uzbek2turkic-0.2.2-py3-none-any.whl (6.3 kB view details)

Uploaded Apr 3, 2026 Python 3

File details

Details for the file uzbek2turkic-0.2.2.tar.gz.

File metadata

Download URL: uzbek2turkic-0.2.2.tar.gz
Upload date: Apr 3, 2026
Size: 6.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for uzbek2turkic-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`479abe54c063437198da6bce78a1de58c17b06163cb35562c1b5550f28135c4c`
MD5	`1d76df91bef6a94a2f00368ddfa94527`
BLAKE2b-256	`c6f3817dfaea2cb7e77fb8ee0a7f296f2a1a709ae8c0269a4d20011912412904`

See more details on using hashes here.

File details

Details for the file uzbek2turkic-0.2.2-py3-none-any.whl.

File metadata

Download URL: uzbek2turkic-0.2.2-py3-none-any.whl
Upload date: Apr 3, 2026
Size: 6.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for uzbek2turkic-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d5581ec670bf10ef62d5b829b73c9085ff9fa9a555e85c08ee1935157fec8d63`
MD5	`9deb2a4902ec9228ef858a98f8675952`
BLAKE2b-256	`6e84e5e985d46386ff26e19879e26d49140b4e781281e188387d6dafe958f579`

See more details on using hashes here.

uzbek2turkic 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

uzbek2turkic: Uzbek Normalizer for NLP Models

Key Features

Installation

Quick Start

Deep Dive: Semantic Disambiguation of NG (`ñ` vs `n` + `g`)

1. Grammatical Suffix Concurrency Matrix

2. O(1) Root Dictionary Filtering

W3C Typography and BPE Tokenizer Compliance

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

uzbek2turkic 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

uzbek2turkic: Uzbek Normalizer for NLP Models

Key Features

Installation

Quick Start

Deep Dive: Semantic Disambiguation of NG (ñ vs n + g)

1. Grammatical Suffix Concurrency Matrix

2. O(1) Root Dictionary Filtering

W3C Typography and BPE Tokenizer Compliance

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Deep Dive: Semantic Disambiguation of NG (`ñ` vs `n` + `g`)