Skip to main content

An advanced text normalizer converting standard Uzbek Latin into the Common Turkic Alphabet (CTA) for highly optimized NLP LLM tokenization natively supporting the O(1) NG ambiguity exclusion algorithms.

Project description

uzbek2turkic: Uzbek Normalizer for NLP Models

A robust, rule-based text normalization library designed to prepare and standardize Uzbek text for modern AI and NLP models (BERT, mDeBERTa, Llama, WordPiece, BPE).

The library intelligently maps Uzbek Latin sequences to the Common Turkic Alphabet (CTA). By unifying the Turkic language subword ecosystem, this package prevents tokenizers from hopelessly fragmenting multi-character digraphs (like o', g', sh, ch, and ng), drastically improving training efficiency and downstream NLP performance.


Key Features

  • Algorithmic Apostrophe Unification: Flattens all erratic variations of apostrophes (ʻ, , , ´, `) into the internationally recognized W3C modifier mark: ʼ (U+02BC).
  • Digraph Compression: Securely converts sequential digraphs into single-character CTA equivalents (o'ö, g'ğ, shş, chç).
  • Advanced NG Disambiguation: Uses a dual-layered algorithm (Grammatical Suffix Matrix + O(1) Dictionary Hash) to distinguish between the velar nasal (ñ) and dative/participle morphology boundaries (n + g).
  • Case-Sensitivity: Fully preserves logical structural capitalization (ShŞ or SHŞ).
  • Reversible: Supports flawless denormalization (trc2uz) to restore model outputs back to standard standard Uzbek Latin.

Installation

Available on PyPI. Install via pip:

pip install uzbek2turkic

Quick Start

from uzbek2turkic.normalizer import UzbekNormalizer

# Initialize the normalizer
normalizer = UzbekNormalizer()

original_text = "O'g'il bola shahar tomon choy ichgani yo'l oldi."

# Forward Conversion (Original -> CTA)
normalized = normalizer.uz2trc(original_text)
print(normalized)
# Output: Öğil bola şahar tomon çoy içgani yöl oldi.

# Backward Conversion (CTA -> Original)
restored = normalizer.trc2uz(normalized)
print(restored)
# Output: O'g'il bola shahar tomon choy ichgani yo'l oldi.

Deep Dive: Semantic Disambiguation of NG (ñ vs n + g)

Converting ng to the nasal ñ is one of the most profound challenges in computational Uzbek text processing. n and g frequently appear sequentially as separate grammatical units (for example, a root ending in n appended by the dative suffix ga, or a passive verb ending in n appended by -gan).

A simplistic replacement script would erroneously destroy morphology by mapping jonga into joña or ishlangan into işlañan.

uzbek2turkic solves this dynamically by analyzing every single occurrence of ng independently, applying two powerful mechanisms:

1. Grammatical Suffix Concurrency Matrix

The algorithm detects localized combinations of N-suffixes (like -lan, -gan, -qan) colliding with G-suffixes (like -gach, -gan, -ga). It automatically isolates these morphological boundaries without relying on absolute dictionary lookups.

# Example of Grammatical Matrix working its magic:
text = "quvonganning ishlanganlariga"

print(normalizer.uz2trc(text))
# Output: quvonganniñ işlanganlariga

(Notice how the legitimate initial n+g breaks (quvon+gan, ishlan+gan) successfully bypass ñ conversion, while the true genitive suffix -ning at the end smoothly transitions into niñ within the exact same lexical word).

2. O(1) Root Dictionary Filtering

For native nouns that structurally end in n (like tun, kun, jon, vatan), the text is instantly validated against exceptions.txt.

# Example of Dictionary Exclusion:
text = "Tungi sovuq kunga va buguning bir qismiga ta'sir qildi."

print(normalizer.uz2trc(text))
# Output: Tungi sovuq kunga va buguniñ bir qismiga taʼsir qildi.

(You can also load your own root words by passing a custom dictionary during initialization: UzbekNormalizer(exceptions_file='/path/to/custom_roots.txt')).


W3C Typography and BPE Tokenizer Compliance

According to W3C Typography and International Unicode Standards, letters that act as an integral part of a word (like the Uzbek "tutuq") must not be written using standard punctuation apostrophes (', U+0027 or , U+2019). The international standard mandates the use of U+02BC (Modifier Letter Apostrophe — ʼ).

The Uzbek tutuq mark (used in words like ma'no, sur'at) is treated as a "punctuation boundary" by standard tokenizers (like HuggingFace BPE), unconditionally shattering the word into useless bytes (e.g., ["ma", "'", "no"]).

By converting standalone apostrophes into U+02BC (ʼ), which is officially classed as a Letter Modifier, models treat maʼno as a unified, physically unbroken linguistic token. This single fix dramatically reduces context window loss during LLM training.


License

MIT License. Open-source and ready for large-scale NLP pipelining.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uzbek2turkic-0.2.2.tar.gz (6.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uzbek2turkic-0.2.2-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file uzbek2turkic-0.2.2.tar.gz.

File metadata

  • Download URL: uzbek2turkic-0.2.2.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for uzbek2turkic-0.2.2.tar.gz
Algorithm Hash digest
SHA256 479abe54c063437198da6bce78a1de58c17b06163cb35562c1b5550f28135c4c
MD5 1d76df91bef6a94a2f00368ddfa94527
BLAKE2b-256 c6f3817dfaea2cb7e77fb8ee0a7f296f2a1a709ae8c0269a4d20011912412904

See more details on using hashes here.

File details

Details for the file uzbek2turkic-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: uzbek2turkic-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for uzbek2turkic-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d5581ec670bf10ef62d5b829b73c9085ff9fa9a555e85c08ee1935157fec8d63
MD5 9deb2a4902ec9228ef858a98f8675952
BLAKE2b-256 6e84e5e985d46386ff26e19879e26d49140b4e781281e188387d6dafe958f579

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page