Part-of-Speech Tagger for the Uzbek Language based on Tahrirchi-BERT
Project description
====================================================================== UzbekTaggerBERT: High-Accuracy POS Tagger for the Uzbek Language
UzbekTaggerBERT is an open-source Part-of-Speech (POS) tagging library designed specifically for the highly agglutinative Uzbek language. It is built by fine-tuning the state-of-the-art Tahrirchi-BERT (RoBERTa architecture) on a comprehensive Uzbek morphological dataset.
This project solves major tokenization challenges inherent in agglutinative languages, providing researchers and developers with a robust, highly accurate tool for Natural Language Processing (NLP) in Uzbek.
MODEL PERFORMANCE & HUGGING FACE
The core model used in this tagger is publicly available on the Hugging Face Hub: URL: https://huggingface.co/MaksudSharipov/Uzbek-POS-Tagger-TahrirchiBERT
Evaluation metrics on the UzbekPOS test dataset demonstrate state-of-the-art results:
- Accuracy: 0.9810
- Weighted F1: 0.9811
KEY SCIENTIFIC FEATURES
- Contextual Homonym Disambiguation: Accurately distinguishes between morphologically identical but semantically different words based on context (e.g., "yuz" as a NOUN vs. "yuz" as a NUM).
- Smart Text Normalization: Automatically standardizes various Cyrillic/Latin apostrophes to prevent the tokenizer from incorrectly fragmenting words like "O'zbekiston".
- Overlap-based Offset Mapping: Employs an advanced mathematical overlap algorithm to perfectly align model-generated subword tokens back to their original root words, eliminating the "UNKNOWN" tag issue common in Transformer-based pipelines.
INSTALLATION
You can install the package directly from PyPI using pip:
pip install uzbek-tagger-bert
Dependencies: This package requires "transformers" and "torch". They will be installed automatically if not already present.
QUICK START & EXAMPLE
Using the tagger is straightforward. It abstracts away the complexities of subword tokenization and tensor operations.
from uzbek_tagger_bert import UzbekTaggerBERT
# 1. Initialize the tagger
tagger = UzbekTaggerBERT()
# 2. Input a complex sentence in Uzbek
text = "U yuz burib ketdi va yuz soat ishladi.\nAlisher sariq olma olib menga qizil olma olmading dedi."
# 3. Get the tagged output
result = tagger(text)
print(result)
[Expected Output]
U/PRON yuz/NOUN burib/VERB ketdi/VERB va/CCONJ yuz/NUM soat/NOUN ishladi/VERB ./PUNCT
Alisher/PROPN sariq/ADJ olma/NOUN olib/VERB menga/PRON qizil/ADJ olma/NOUN olmading/VERB dedi/VERB ./PUNCT
METHODOLOGY & TOKENIZATION HANDLING
Because Uzbek heavily relies on suffixes, standard tokenizers split words into multiple sub-tokens. This tool overrides standard pipeline constraints and uses dynamic boundary checking to reconstruct original morphological entities seamlessly before assigning the final POS tag.
ACADEMIC CITATION
This tagger model was fine-tuned by expanding the multi-domain dataset introduced in our published research. If you use this library, model, or dataset in your academic and research projects, please strictly cite the following paper:
[APA Style] Sharipov, M., Kuriyozov, E., & Vičič, J. (2026). UzbekPOS: A multi-domain dataset for Uzbek part-of-speech tagging. Data in Brief, 112640. https://doi.org/10.1016/j.dib.2026.112640
[BibTeX]
@article{sharipov2026uzbekpos,
title={UzbekPOS: A multi-domain dataset for Uzbek part-of-speech tagging},
author={Sharipov, Maksud and Kuriyozov, Elmurod and Vičič, Jernej},
journal={Data in Brief},
pages={112640},
year={2026},
publisher={Elsevier},
doi={10.1016/j.dib.2026.112640}
}
LICENSE
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file uzbek_tagger_bert-0.1.2.tar.gz.
File metadata
- Download URL: uzbek_tagger_bert-0.1.2.tar.gz
- Upload date:
- Size: 4.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e4f2a6393e18863409fab09c75ca2a52ffb4ba3868b037cd934bd7035e99b2d
|
|
| MD5 |
e32b07c2794e1f46c742ab97e48bdff9
|
|
| BLAKE2b-256 |
8140f7bbf25765cba77a7ca934c5eb8e26701b3ef744ae6a7d6f12789d4d616f
|
File details
Details for the file uzbek_tagger_bert-0.1.2-py3-none-any.whl.
File metadata
- Download URL: uzbek_tagger_bert-0.1.2-py3-none-any.whl
- Upload date:
- Size: 4.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1fea5394af87023149e67166ae8ef28e0838109f125d0eec6040edb4bc46e54
|
|
| MD5 |
09574f10445f539ba658ca64bcdef3cc
|
|
| BLAKE2b-256 |
8391dc2aa7cea76f94ecff6592ca2748218af916baf97b26f35f24a2facea70f
|