A package designed for segmenting Uzbek texts into (a) words (with compound words and phrases), (b) syllables, (c) affixes, and (d) characters.
Project description
UzbekTokenization
A package designed for segmenting Uzbek texts into (a) words (with compound words and phrases), (b) syllables, (c) affixes, and (d) characters.
Features
- Word Tokenization: Divide the text into words (with compound words and phrases).
- Syllable Tokenization:Divide the word into syllables.
- Affix Tokenization: Divide the word into affixes.
- Char Tokenization: Divide the text into characters.
GitHub
To work with the project, it can be downloaded from GitHub:
git clone https://github.com/ddasturbek/UzbekTokenization.git
Install
Installing the libraries: To use the project, install this library from PyPI:
pip install UzbekTokenization
Usage
Using the library is very easy. You can perform tokenization processes through the following code examples.
Word Tokenization
from UzbekTokenization import WordTokenizer as wt
text = "Uzoq davom etgan janjaldan keyin mashinaning abjag‘i chiqdi."
print(wt.tokenize(text))
print(wt.tokenize(text, multi_word=True))
print(wt.tokenize(text, multi_word=True, pos=True))
""" Results
['Uzoq', 'davom etgan', 'janjaldan', 'keyin', 'mashinaning', 'abjag‘i chiqdi', '.']
['Uzoq', 'davom+etgan', 'janjaldan', 'keyin', 'mashinaning', 'abjag‘i+chiqdi', '.']
['Uzoq', 'davom+etgan(VERB)', 'janjaldan', 'keyin', 'mashinaning', 'abjag‘i+chiqdi', '.']
"""
This Word Tokenization program tokenizes Uzbek language texts into words. It separates compound words (verbs, adverbs, pronouns, and interjections) and KFSQ (Ko‘makchi Fe'lli So‘z Qo‘shilmasi, Compound Verb Phrases) as single units, for example, 'idrok etmoq', 'mana bu', 'hech narsa', 'sevib boshlamoq'. Furthermore, it also tokenizes Uzbek language phrases (idioms) separately.
Syllable Tokenization
from UzbekTokenization import SyllableTokenizer as st
print(st.tokenize("Gul")) # Gul
print(st.tokenize("Yulduz")) # Yul-duz
print(st.tokenize("shashlik")) # shash-lik
print(st.tokenize("BOG‘BON")) # BOG‘-BON
print(st.tokenize("kelinglar")) # ke-ling-lar
print(st.tokenize("yangilik")) # yan-gi-lik
print(st.tokenize("Agglyutinativ")) # ag-glyu-ti-na-tiv
print(st.tokenize("Salom barchaga")) # Salom barchaga
This Syllable Tokenization program tokenizes Uzbek language words into syllables. It correctly separates the letter combinations and symbols O‘o‘ G‘g‘ as well as the digraphs Shsh Chch Ngng. Furthermore, the ng may appear either as a digraph or as separate letters (n and g) within a word; in such cases, they are not separated if they form a digraph, but can be separated if they appear as individual letters. Some complex words do not follow the rules, which is why a list of them has been compiled within the program. It is case-sensitive.
Affix Tokenization
from UzbekTokenization import AffixTokenizer as at
print(at.tokenize("Serquyosh")) # Ser-quyosh
print(at.tokenize("KITOBLAR")) # KITOB-LAR
print(at.tokenize("o‘qiganman")) # o‘qi-gan-man
print(at.tokenize("Salom odamlar")) # Salom odamlar
This Affixes Tokenization program tokenizes Uzbek language words into affixes. Affixes in Uzbek are of two types: derivational (word-forming) and inflectional (form-forming). Inflectional affixes, in turn, are divided into two types: lexical inflectional and syntactic inflectional affixes.
The program specifically separates syntactic inflectional affixes that consist of two or more characters. This is because derivational affixes, lexical inflectional affixes, and single-character syntactic inflectional affixes resemble individual letters within the word stem, making it complicated to distinguish between the affix and the letters of the word stem in those cases.
The program only tokenizes words into affixes; if a text (a phrase or sentence) is provided, it returns it as is. It is case-sensitive.
Char Tokenization
from UzbekTokenization import CharTokenizer as ct
print(ct.tokenize("o‘g‘ri")) # ['o‘', 'g‘', 'r', 'i']
print(ct.tokenize("choshgoh")) # ['ch', 'o', 'sh', 'g', 'o', 'h']
print(ct.tokenize("bodiring")) # ['b', 'o', 'd', 'i', 'r', 'i', 'ng']
print(ct.tokenize("Salom, dunyo!")) # ['S', 'a', 'l', 'o', 'm', ',', 'd', 'u', 'n', 'y', 'o', '!']
print(ct.tokenize("Salom, dunyo!", True)) # ['S', 'a', 'l', 'o', 'm', ',', ' ', 'd', 'u', 'n', 'y', 'o', '!']
This Char Tokenization program works correctly for Uzbek language letters, because in Uzbek, the letter combinations and symbols O‘o‘ G‘g‘ and the digraphs Shsh Chch Ngng are considered as a single character. If the value True is provided as the second parameter to the tokenize function, it performs the tokenization while also taking into account the spaces.
License
This project is licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file uzbektokenization-2.0.tar.gz.
File metadata
- Download URL: uzbektokenization-2.0.tar.gz
- Upload date:
- Size: 52.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e0f128c1edb9dc748be608fca1840fe4264ccdbc33e0dc52252e1d28ea2c3f8
|
|
| MD5 |
a943ac352a5b0609e29f5d28de04301c
|
|
| BLAKE2b-256 |
ac7d8faab5914a17a0688f0a4871d575c20783ff9fe2776b2bfa36e594fa8e58
|
File details
Details for the file uzbektokenization-2.0-py3-none-any.whl.
File metadata
- Download URL: uzbektokenization-2.0-py3-none-any.whl
- Upload date:
- Size: 52.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bea6ca8211a500de8a06ee0c8205ca8fa9bbcbcaeb3c6342a7aea7c59737e3b8
|
|
| MD5 |
c4eb949496e489294448625548ffc2aa
|
|
| BLAKE2b-256 |
ad54c5c8227ecd4f29089089237d3e0b2d7c02ce7c7a13bb543a273e5f44b750
|