Uzbek Lemmatizer for Python
Project description
UzLemma
A Uzbek language lemmatizer for Python
All studies on uzbek language stems have a common statement: stemming of Uzbek language is hard. Uzbek is an agglutinative language with a highly rich morphological structure. Uzbek words are composed of a stem and of affix(es). In Uzbek language, there is two form of affixes: prefixes and suffixes. Affixes are affixed to the stem according to definite grammatical rules. In addition, both stem and affixes may be transformed according to the harmony rules. Those rules and their exceptions make stemming harder for Uzbek texts. For more about stemming Uzbek language please see the article titled "UZBEK AFFIX FINITE STATE MACHINE FOR STEMMING."
All text analysis studies require a stemmer at one point. This Python code attempts to stem Uzbek words with a simple approach. It first extracts syllables of the given word and then tries to identify the stem by comparing syllables with a list of affixes and their allomorphs. If any affix is identified it is removed and then remaining word is searched in a list of Uzbek words. If there is a match in the word list, it is returned as the stem. Otherwise function reiterates with the new word. If it can't stem, it returns the given word.
Once the functions are loaded into Python environment you can begin to stem by using stem
function:
stem("maktablarimizning")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file UzLemma-1.0-py3-none-any.whl
.
File metadata
- Download URL: UzLemma-1.0-py3-none-any.whl
- Upload date:
- Size: 2.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.56.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba67c9ba3c47643cee20c07c590b18d1c2dad187e0d31f843c95369699cdec4b |
|
MD5 | bfe9ae6147983b46c3bff624d033a346 |
|
BLAKE2b-256 | 9f7d24f645fefe3e3505590cb1056e491d6aaa55040403600bdf3187723902bb |