Skip to main content

Part-of-Speech Tagger for the Uzbek Language based on Tahrirchi-BERT

Project description

====================================================================== UzbekTaggerBERT: High-Accuracy POS Tagger for the Uzbek Language

UzbekTaggerBERT is an open-source Part-of-Speech (POS) tagging library designed specifically for the highly agglutinative Uzbek language. It is built by fine-tuning the state-of-the-art Tahrirchi-BERT (RoBERTa architecture) on a comprehensive Uzbek morphological dataset.

This project solves major tokenization challenges inherent in agglutinative languages, providing researchers and developers with a robust, highly accurate tool for Natural Language Processing (NLP) in Uzbek.

MODEL PERFORMANCE & HUGGING FACE

The core model used in this tagger is publicly available on the Hugging Face Hub: URL: https://huggingface.co/MaksudSharipov/Uzbek-POS-Tagger-TahrirchiBERT

Evaluation metrics on the UzbekPOS test dataset demonstrate state-of-the-art results:

  • Accuracy: 0.9810
  • Weighted F1: 0.9811

KEY SCIENTIFIC FEATURES

  • Contextual Homonym Disambiguation: Accurately distinguishes between morphologically identical but semantically different words based on context (e.g., "yuz" as a NOUN vs. "yuz" as a NUM).
  • Smart Text Normalization: Automatically standardizes various Cyrillic/Latin apostrophes to prevent the tokenizer from incorrectly fragmenting words like "O'zbekiston".
  • Overlap-based Offset Mapping: Employs an advanced mathematical overlap algorithm to perfectly align model-generated subword tokens back to their original root words, eliminating the "UNKNOWN" tag issue common in Transformer-based pipelines.

INSTALLATION

You can install the package directly from PyPI using pip:

pip install uzbek-tagger-bert

Dependencies: This package requires "transformers" and "torch". They will be installed automatically if not already present.

QUICK START & EXAMPLE

Using the tagger is straightforward. It abstracts away the complexities of subword tokenization and tensor operations.

from uzbek_tagger_bert import UzbekTaggerBERT

# 1. Initialize the tagger
tagger = UzbekTaggerBERT()

# 2. Input a complex sentence in Uzbek
text = "U yuz burib ketdi va yuz soat ishladi.\nAlisher sariq olma olib menga qizil olma olmading dedi."

# 3. Get the tagged output
result = tagger(text)

print(result)

[Expected Output]

U/PRON yuz/NOUN burib/VERB ketdi/VERB va/CCONJ yuz/NUM soat/NOUN ishladi/VERB ./PUNCT 
Alisher/PROPN sariq/ADJ olma/NOUN olib/VERB menga/PRON qizil/ADJ olma/NOUN olmading/VERB dedi/VERB ./PUNCT 

METHODOLOGY & TOKENIZATION HANDLING

Because Uzbek heavily relies on suffixes, standard tokenizers split words into multiple sub-tokens. This tool overrides standard pipeline constraints and uses dynamic boundary checking to reconstruct original morphological entities seamlessly before assigning the final POS tag.

ACADEMIC CITATION

This tagger model was fine-tuned by expanding the multi-domain dataset introduced in our published research. If you use this library, model, or dataset in your academic and research projects, please strictly cite the following paper:

[APA Style] Sharipov, M., Kuriyozov, E., & Vičič, J. (2026). UzbekPOS: A multi-domain dataset for Uzbek part-of-speech tagging. Data in Brief, 112640. https://doi.org/10.1016/j.dib.2026.112640

[BibTeX]

@article{sharipov2026uzbekpos,
  title={UzbekPOS: A multi-domain dataset for Uzbek part-of-speech tagging},
  author={Sharipov, Maksud and Kuriyozov, Elmurod and Vičič, Jernej},
  journal={Data in Brief},
  pages={112640},
  year={2026},
  publisher={Elsevier},
  doi={10.1016/j.dib.2026.112640}
}

LICENSE

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uzbek_tagger_bert-0.1.2.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uzbek_tagger_bert-0.1.2-py3-none-any.whl (4.8 kB view details)

Uploaded Python 3

File details

Details for the file uzbek_tagger_bert-0.1.2.tar.gz.

File metadata

  • Download URL: uzbek_tagger_bert-0.1.2.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for uzbek_tagger_bert-0.1.2.tar.gz
Algorithm Hash digest
SHA256 7e4f2a6393e18863409fab09c75ca2a52ffb4ba3868b037cd934bd7035e99b2d
MD5 e32b07c2794e1f46c742ab97e48bdff9
BLAKE2b-256 8140f7bbf25765cba77a7ca934c5eb8e26701b3ef744ae6a7d6f12789d4d616f

See more details on using hashes here.

File details

Details for the file uzbek_tagger_bert-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for uzbek_tagger_bert-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e1fea5394af87023149e67166ae8ef28e0838109f125d0eec6040edb4bc46e54
MD5 09574f10445f539ba658ca64bcdef3cc
BLAKE2b-256 8391dc2aa7cea76f94ecff6592ca2748218af916baf97b26f35f24a2facea70f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page