Skip to main content

An advanced text normalizer converting standard Uzbek Latin into the New Turkic Alphabet (YTA) for highly optimized NLP LLM tokenization natively supporting the O(1) NG ambiguity exclusion algorithms.

Project description

uzbek2turkic: Uzbek Normalizer for NLP Models

A robust, rule-based text normalization library designed to prepare and standardize Uzbek text for modern AI and NLP models (BERT, mDeBERTa, Llama, WordPiece, BPE).

The library intelligently maps Uzbek Latin sequences to the Yangi Turkiy Alifbo (YTA - New Turkic Alphabet). By unifying the Turkic language subword ecosystem, this package prevents tokenizers from hopelessly fragmenting multi-character digraphs (like o', g', sh, ch, and ng), drastically improving training efficiency and downstream NLP performance.


Key Features

  • Algorithmic Apostrophe Unification: Flattens all erratic variations of apostrophes (ʻ, , , ´, `) into the internationally recognized W3C modifier mark: ʼ (U+02BC).
  • Digraph Compression: Securely converts sequential digraphs into single-character YTA equivalents (o'ö, g'ğ, shş, chç).
  • Advanced NG Disambiguation: Uses a dual-layered algorithm (Grammatical Suffix Matrix + O(1) Dictionary Hash) to distinguish between the velar nasal (ñ) and dative/participle morphology boundaries (n + g).
  • Case-Sensitivity: Fully preserves logical structural capitalization (ShŞ or SHŞ).
  • Reversible: Supports flawless denormalization (trc2uz) to restore model outputs back to standard standard Uzbek Latin.

Installation

Available on PyPI. Install via pip:

pip install uzbek2turkic

Quick Start

from uzbek2turkic.normalizer import UzbekNormalizer

# Initialize the normalizer
normalizer = UzbekNormalizer()

original_text = "O'g'il bola shahar tomon choy ichgani yo'l oldi."

# Forward Conversion (Original -> YTA)
normalized = normalizer.uz2trc(original_text)
print(normalized)
# Output: Öğil bola şahar tomon çoy içgani yöl oldi.

# Backward Conversion (YTA -> Original)
restored = normalizer.trc2uz(normalized)
print(restored)
# Output: O'g'il bola shahar tomon choy ichgani yo'l oldi.

Deep Dive: Semantic Disambiguation of NG (ñ vs n + g)

Converting ng to the nasal ñ is one of the most profound challenges in computational Uzbek text processing. n and g frequently appear sequentially as separate grammatical units (for example, a root ending in n appended by the dative suffix ga, or a passive verb ending in n appended by -gan).

A simplistic replacement script would erroneously destroy morphology by mapping jonga into joña or ishlangan into işlañan.

uzbek2turkic solves this dynamically by analyzing every single occurrence of ng independently, applying two powerful mechanisms:

1. Grammatical Suffix Concurrency Matrix

The algorithm detects localized combinations of N-suffixes (like -lan, -gan, -qan) colliding with G-suffixes (like -gach, -gan, -ga). It automatically isolates these morphological boundaries without relying on absolute dictionary lookups.

# Example of Grammatical Matrix working its magic:
text = "quvonganning ishlanganlariga"

print(normalizer.uz2trc(text))
# Output: quvonganniñ işlanganlariga

(Notice how the legitimate initial n+g breaks (quvon+gan, ishlan+gan) successfully bypass ñ conversion, while the true genitive suffix -ning at the end smoothly transitions into niñ within the exact same lexical word).

2. O(1) Root Dictionary Filtering

For native nouns that structurally end in n (like tun, kun, jon, vatan), the text is instantly validated against exceptions.txt.

# Example of Dictionary Exclusion:
text = "Tungi sovuq vatanga ham ta'sir qildi."

print(normalizer.uz2trc(text))
# Output: Tungi sovuq vatanga ham taʼsir qildi.

(You can also load your own root words by passing a custom dictionary during initialization: UzbekNormalizer(exceptions_file='/path/to/custom_roots.txt')).


W3C Typography and BPE Tokenizer Compliance

According to W3C Typography and International Unicode Standards, letters that act as an integral part of a word (like the Uzbek "tutuq") must not be written using standard punctuation apostrophes (', U+0027 or , U+2019). The international standard mandates the use of U+02BC (Modifier Letter Apostrophe — ʼ).

The Uzbek tutuq mark (used in words like ma'no, sur'at) is treated as a "punctuation boundary" by standard tokenizers (like HuggingFace BPE), unconditionally shattering the word into useless bytes (e.g., ["ma", "'", "no"]).

By converting standalone apostrophes into U+02BC (ʼ), which is officially classed as a Letter Modifier, models treat maʼno as a unified, physically unbroken linguistic token. This single fix dramatically reduces context window loss during LLM training.


License

MIT License. Open-source and ready for large-scale NLP pipelining.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uzbek2turkic-0.2.1.tar.gz (6.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uzbek2turkic-0.2.1-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file uzbek2turkic-0.2.1.tar.gz.

File metadata

  • Download URL: uzbek2turkic-0.2.1.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for uzbek2turkic-0.2.1.tar.gz
Algorithm Hash digest
SHA256 4a1404d0826dfad7fcb7114ec71f3c0837d22b37fe458b31022180dea99c5130
MD5 de4dea26ead3cc89883436b6a836b713
BLAKE2b-256 86e2e8b9a0f262028b8572f1d0ff358ed6dee7ea9b53a6f1df949a40ed2ecf77

See more details on using hashes here.

File details

Details for the file uzbek2turkic-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: uzbek2turkic-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for uzbek2turkic-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d16a59c3df0742f22eda348bb097eb97926317b07e29a43f3290ea2d4c440f9c
MD5 1eb0a07fe857e3f3a9843e4f64e35fee
BLAKE2b-256 1cfb51404b620d5a012a4daeccbc47f1e5d7d3755f6d5970ae5ebd966cc856d8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page