A package designed for segmenting Uzbek texts into (a) words (with compound words and phrases), (b) syllables, (c) affixes, and (d) characters.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

UzbekTokenization

A package designed for segmenting Uzbek texts into (a) words (with compound words and phrases), (b) syllables, (c) affixes, and (d) characters.

Features

Word Tokenization: Divide the text into words (with compound words and phrases).
Syllable Tokenization:Divide the word into syllables.
Affix Tokenization: Divide the word into affixes.
Char Tokenization: Divide the text into characters.

GitHub

To work with the project, it can be downloaded from GitHub:

git clone https://github.com/ddasturbek/UzbekTokenization.git

Install

Installing the libraries: To use the project, install this library from PyPI:

pip install UzbekTokenization

Usage

Using the library is very easy. You can perform tokenization processes through the following code examples.

Word Tokenization

from UzbekTokenization import WordTokenizer as wt

text = "Uzoq davom etgan janjaldan keyin mashinaning abjag‘i chiqdi."

print(wt.tokenize(text))
print(wt.tokenize(text, multi_word=True))
print(wt.tokenize(text, multi_word=True, pos=True))


""" Results
['Uzoq', 'davom etgan', 'janjaldan', 'keyin', 'mashinaning', 'abjag‘i chiqdi', '.']
['Uzoq', 'davom+etgan', 'janjaldan', 'keyin', 'mashinaning', 'abjag‘i+chiqdi', '.']
['Uzoq', 'davom+etgan(VERB)', 'janjaldan', 'keyin', 'mashinaning', 'abjag‘i+chiqdi', '.']
"""

This Word Tokenization program tokenizes Uzbek language texts into words. It separates compound words (verbs, adverbs, pronouns, and interjections) and KFSQ (Ko‘makchi Fe'lli So‘z Qo‘shilmasi, Compound Verb Phrases) as single units, for example, 'idrok etmoq', 'mana bu', 'hech narsa', 'sevib boshlamoq'. Furthermore, it also tokenizes Uzbek language phrases (idioms) separately.

Syllable Tokenization

from UzbekTokenization import SyllableTokenizer as st

print(st.tokenize("Gul"))  # Gul
print(st.tokenize("Yulduz"))  # Yul-duz
print(st.tokenize("shashlik"))  # shash-lik
print(st.tokenize("BOG‘BON"))  # BOG‘-BON
print(st.tokenize("kelinglar"))  # ke-ling-lar
print(st.tokenize("yangilik"))  # yan-gi-lik
print(st.tokenize("Agglyutinativ"))  # ag-glyu-ti-na-tiv
print(st.tokenize("Salom barchaga"))  # Salom barchaga

This Syllable Tokenization program tokenizes Uzbek language words into syllables. It correctly separates the letter combinations and symbols O‘o‘ G‘g‘ as well as the digraphs Shsh Chch Ngng. Furthermore, the ng may appear either as a digraph or as separate letters (n and g) within a word; in such cases, they are not separated if they form a digraph, but can be separated if they appear as individual letters. Some complex words do not follow the rules, which is why a list of them has been compiled within the program. It is case-sensitive.

Affix Tokenization

from UzbekTokenization import AffixTokenizer as at

print(at.tokenize("Serquyosh"))  # Ser-quyosh
print(at.tokenize("KITOBLAR"))  # KITOB-LAR
print(at.tokenize("o‘qiganman"))  # o‘qi-gan-man
print(at.tokenize("Salom odamlar"))  # Salom odamlar

This Affixes Tokenization program tokenizes Uzbek language words into affixes. Affixes in Uzbek are of two types: derivational (word-forming) and inflectional (form-forming). Inflectional affixes, in turn, are divided into two types: lexical inflectional and syntactic inflectional affixes.

The program specifically separates syntactic inflectional affixes that consist of two or more characters. This is because derivational affixes, lexical inflectional affixes, and single-character syntactic inflectional affixes resemble individual letters within the word stem, making it complicated to distinguish between the affix and the letters of the word stem in those cases.

The program only tokenizes words into affixes; if a text (a phrase or sentence) is provided, it returns it as is. It is case-sensitive.

Char Tokenization

from UzbekTokenization import CharTokenizer as ct

print(ct.tokenize("o‘g‘ri"))  # ['o‘', 'g‘', 'r', 'i']
print(ct.tokenize("choshgoh"))  # ['ch', 'o', 'sh', 'g', 'o', 'h']
print(ct.tokenize("bodiring"))  # ['b', 'o', 'd', 'i', 'r', 'i', 'ng']
print(ct.tokenize("Salom, dunyo!"))  # ['S', 'a', 'l', 'o', 'm', ',', 'd', 'u', 'n', 'y', 'o', '!']
print(ct.tokenize("Salom, dunyo!", True))  # ['S', 'a', 'l', 'o', 'm', ',', ' ', 'd', 'u', 'n', 'y', 'o', '!']

This Char Tokenization program works correctly for Uzbek language letters, because in Uzbek, the letter combinations and symbols O‘o‘ G‘g‘ and the digraphs Shsh Chch Ngng are considered as a single character. If the value True is provided as the second parameter to the tokenize function, it performs the tokenization while also taking into account the spaces.

License

This project is licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

2.0

Dec 9, 2025

1.2.2

Sep 1, 2025

1.2.1

Feb 10, 2025

1.2

Feb 9, 2025

1.0.0

Feb 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uzbektokenization-2.0.tar.gz (52.8 kB view details)

Uploaded Dec 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

uzbektokenization-2.0-py3-none-any.whl (52.2 kB view details)

Uploaded Dec 9, 2025 Python 3

File details

Details for the file uzbektokenization-2.0.tar.gz.

File metadata

Download URL: uzbektokenization-2.0.tar.gz
Upload date: Dec 9, 2025
Size: 52.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.3

File hashes

Hashes for uzbektokenization-2.0.tar.gz
Algorithm	Hash digest
SHA256	`1e0f128c1edb9dc748be608fca1840fe4264ccdbc33e0dc52252e1d28ea2c3f8`
MD5	`a943ac352a5b0609e29f5d28de04301c`
BLAKE2b-256	`ac7d8faab5914a17a0688f0a4871d575c20783ff9fe2776b2bfa36e594fa8e58`

See more details on using hashes here.

File details

Details for the file uzbektokenization-2.0-py3-none-any.whl.

File metadata

Download URL: uzbektokenization-2.0-py3-none-any.whl
Upload date: Dec 9, 2025
Size: 52.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.3

File hashes

Hashes for uzbektokenization-2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bea6ca8211a500de8a06ee0c8205ca8fa9bbcbcaeb3c6342a7aea7c59737e3b8`
MD5	`c4eb949496e489294448625548ffc2aa`
BLAKE2b-256	`ad54c5c8227ecd4f29089089237d3e0b2d7c02ce7c7a13bb543a273e5f44b750`

See more details on using hashes here.

uzbektokenization 2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

UzbekTokenization

Features

GitHub

Install

Usage

Word Tokenization

Syllable Tokenization

Affix Tokenization

Char Tokenization

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes