Skip to main content

A package designed for segmenting Uzbek texts into (a) words (with compound words and phrases), (b) syllables, (c) affixes, and (d) characters.

Project description

UzbekTokenization

A package designed for segmenting Uzbek texts into (a) words (with compound words and phrases), (b) syllables, (c) affixes, and (d) characters.

Features

  • Word Tokenization: Divide the text into words (with compound words and phrases).
  • Syllable Tokenization:Divide the word into syllables.
  • Affix Tokenization: Divide the word into affixes.
  • Char Tokenization: Divide the text into characters.

GitHub

To work with the project, it can be downloaded from GitHub:

git clone https://github.com/ddasturbek/UzbekTokenization.git

Install

Installing the libraries: To use the project, install this library from PyPI:

pip install UzbekTokenization

Usage

Using the library is very easy. You can perform tokenization processes through the following code examples.

Word Tokenization

from UzbekTokenization import WordTokenizer as wt

text = "Uzoq davom etgan janjaldan keyin mashinaning abjag‘i chiqdi."

print(wt.tokenize(text))
print(wt.tokenize(text, multi_word=True))
print(wt.tokenize(text, multi_word=True, pos=True))


""" Results
['Uzoq', 'davom etgan', 'janjaldan', 'keyin', 'mashinaning', 'abjag‘i chiqdi', '.']
['Uzoq', 'davom+etgan', 'janjaldan', 'keyin', 'mashinaning', 'abjag‘i+chiqdi', '.']
['Uzoq', 'davom+etgan(VERB)', 'janjaldan', 'keyin', 'mashinaning', 'abjag‘i+chiqdi', '.']
"""

This Word Tokenization program tokenizes Uzbek language texts into words. It separates compound words (verbs, adverbs, pronouns, and interjections) and KFSQ (Ko‘makchi Fe'lli So‘z Qo‘shilmasi, Compound Verb Phrases) as single units, for example, 'idrok etmoq', 'mana bu', 'hech narsa', 'sevib boshlamoq'. Furthermore, it also tokenizes Uzbek language phrases (idioms) separately.

Syllable Tokenization

from UzbekTokenization import SyllableTokenizer as st

print(st.tokenize("Gul"))  # Gul
print(st.tokenize("Yulduz"))  # Yul-duz
print(st.tokenize("shashlik"))  # shash-lik
print(st.tokenize("BOG‘BON"))  # BOG‘-BON
print(st.tokenize("kelinglar"))  # ke-ling-lar
print(st.tokenize("yangilik"))  # yan-gi-lik
print(st.tokenize("Agglyutinativ"))  # ag-glyu-ti-na-tiv
print(st.tokenize("Salom barchaga"))  # Salom barchaga

This Syllable Tokenization program tokenizes Uzbek language words into syllables. It correctly separates the letter combinations and symbols O‘o‘ G‘g‘ as well as the digraphs Shsh Chch Ngng. Furthermore, the ng may appear either as a digraph or as separate letters (n and g) within a word; in such cases, they are not separated if they form a digraph, but can be separated if they appear as individual letters. Some complex words do not follow the rules, which is why a list of them has been compiled within the program. It is case-sensitive.

Affix Tokenization

from UzbekTokenization import AffixTokenizer as at

print(at.tokenize("Serquyosh"))  # Ser-quyosh
print(at.tokenize("KITOBLAR"))  # KITOB-LAR
print(at.tokenize("o‘qiganman"))  # o‘qi-gan-man
print(at.tokenize("Salom odamlar"))  # Salom odamlar

This Affixes Tokenization program tokenizes Uzbek language words into affixes. Affixes in Uzbek are of two types: derivational (word-forming) and inflectional (form-forming). Inflectional affixes, in turn, are divided into two types: lexical inflectional and syntactic inflectional affixes.

The program specifically separates syntactic inflectional affixes that consist of two or more characters. This is because derivational affixes, lexical inflectional affixes, and single-character syntactic inflectional affixes resemble individual letters within the word stem, making it complicated to distinguish between the affix and the letters of the word stem in those cases.

The program only tokenizes words into affixes; if a text (a phrase or sentence) is provided, it returns it as is. It is case-sensitive.

Char Tokenization

from UzbekTokenization import CharTokenizer as ct

print(ct.tokenize("o‘g‘ri"))  # ['o‘', 'g‘', 'r', 'i']
print(ct.tokenize("choshgoh"))  # ['ch', 'o', 'sh', 'g', 'o', 'h']
print(ct.tokenize("bodiring"))  # ['b', 'o', 'd', 'i', 'r', 'i', 'ng']
print(ct.tokenize("Salom, dunyo!"))  # ['S', 'a', 'l', 'o', 'm', ',', 'd', 'u', 'n', 'y', 'o', '!']
print(ct.tokenize("Salom, dunyo!", True))  # ['S', 'a', 'l', 'o', 'm', ',', ' ', 'd', 'u', 'n', 'y', 'o', '!']

This Char Tokenization program works correctly for Uzbek language letters, because in Uzbek, the letter combinations and symbols O‘o‘ G‘g‘ and the digraphs Shsh Chch Ngng are considered as a single character. If the value True is provided as the second parameter to the tokenize function, it performs the tokenization while also taking into account the spaces.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uzbektokenization-2.0.tar.gz (52.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uzbektokenization-2.0-py3-none-any.whl (52.2 kB view details)

Uploaded Python 3

File details

Details for the file uzbektokenization-2.0.tar.gz.

File metadata

  • Download URL: uzbektokenization-2.0.tar.gz
  • Upload date:
  • Size: 52.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.3

File hashes

Hashes for uzbektokenization-2.0.tar.gz
Algorithm Hash digest
SHA256 1e0f128c1edb9dc748be608fca1840fe4264ccdbc33e0dc52252e1d28ea2c3f8
MD5 a943ac352a5b0609e29f5d28de04301c
BLAKE2b-256 ac7d8faab5914a17a0688f0a4871d575c20783ff9fe2776b2bfa36e594fa8e58

See more details on using hashes here.

File details

Details for the file uzbektokenization-2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for uzbektokenization-2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bea6ca8211a500de8a06ee0c8205ca8fa9bbcbcaeb3c6342a7aea7c59737e3b8
MD5 c4eb949496e489294448625548ffc2aa
BLAKE2b-256 ad54c5c8227ecd4f29089089237d3e0b2d7c02ce7c7a13bb543a273e5f44b750

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page