Skip to main content

Part-of-Speech Tagger for the Uzbek Language based on Tahrirchi-BERT

Project description

====================================================================== UzbekTaggerBERT: High-Accuracy POS Tagger for the Uzbek Language

UzbekTaggerBERT is an open-source Part-of-Speech (POS) tagging library designed specifically for the highly agglutinative Uzbek language. It is built by fine-tuning the state-of-the-art Tahrirchi-BERT (RoBERTa architecture) on a comprehensive Uzbek morphological dataset.

This project solves major tokenization challenges inherent in agglutinative languages, providing researchers and developers with a robust, highly accurate tool for Natural Language Processing (NLP) in Uzbek.

MODEL PERFORMANCE & HUGGING FACE

The core model used in this tagger is publicly available on the Hugging Face Hub: URL: https://huggingface.co/MaksudSharipov/Uzbek-POS-Tagger-TahrirchiBERT

Evaluation metrics on the UzbekPOS test dataset demonstrate state-of-the-art results:

  • Accuracy: 0.9810
  • Weighted F1: 0.9811

KEY SCIENTIFIC FEATURES

  • Contextual Homonym Disambiguation: Accurately distinguishes between morphologically identical but semantically different words based on context (e.g., "yuz" as a NOUN vs. "yuz" as a NUM).
  • Smart Text Normalization: Automatically standardizes various Cyrillic/Latin apostrophes to prevent the tokenizer from incorrectly fragmenting words like "O'zbekiston".
  • Overlap-based Offset Mapping: Employs an advanced mathematical overlap algorithm to perfectly align model-generated subword tokens back to their original root words, eliminating the "UNKNOWN" tag issue common in Transformer-based pipelines.

INSTALLATION

You can install the package directly from PyPI using pip:

pip install uzbek-tagger-bert

Dependencies: This package requires "transformers" and "torch". They will be installed automatically if not already present.

QUICK START & EXAMPLE

Using the tagger is straightforward. It abstracts away the complexities of subword tokenization and tensor operations.

from uzbek_tagger_bert import UzbekTaggerBERT

# 1. Initialize the tagger
tagger = UzbekTaggerBERT()

# 2. Input a complex sentence in Uzbek
text = "U yuz burib ketdi va yuz soat ishladi.\nAlisher sariq olma olib menga qizil olma olmading dedi. \nO'zbekiston eng baxtli davlatlar reytingida 53-o'rinda qoldi."

# 3. Get the tagged output
result = tagger(text)

print(result)

[Expected Output]

U/PRON yuz/NOUN burib/VERB ketdi/VERB va/CCONJ yuz/NUM soat/NOUN ishladi/VERB ./PUNCT 
Alisher/PROPN sariq/ADJ olma/NOUN olib/VERB menga/PRON qizil/ADJ olma/NOUN olmading/VERB dedi/VERB ./PUNCT 
O'zbekiston/PROPN eng/ADV baxtli/ADJ davlatlar/NOUN reytingida/NOUN 53-o'rinda/NUM qoldi/VERB ./PUNCT

METHODOLOGY & TOKENIZATION HANDLING

Because Uzbek heavily relies on suffixes, standard tokenizers split words into multiple sub-tokens. This tool overrides standard pipeline constraints and uses dynamic boundary checking to reconstruct original morphological entities seamlessly before assigning the final POS tag.

ACADEMIC CITATION

This tagger model was fine-tuned by expanding the multi-domain dataset introduced in our published research. If you use this library, model, or dataset in your academic and research projects, please strictly cite the following paper:

[APA Style] Sharipov, M., Kuriyozov, E., & Vičič, J. (2026). UzbekPOS: A multi-domain dataset for Uzbek part-of-speech tagging. Data in Brief, 112640. https://doi.org/10.1016/j.dib.2026.112640

[BibTeX]

@article{sharipov2026uzbekpos,
  title={UzbekPOS: A multi-domain dataset for Uzbek part-of-speech tagging},
  author={Sharipov, Maksud and Kuriyozov, Elmurod and Vičič, Jernej},
  journal={Data in Brief},
  pages={112640},
  year={2026},
  publisher={Elsevier},
  doi={10.1016/j.dib.2026.112640}
}

LICENSE

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uzbek_tagger_bert-0.1.1.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uzbek_tagger_bert-0.1.1-py3-none-any.whl (4.8 kB view details)

Uploaded Python 3

File details

Details for the file uzbek_tagger_bert-0.1.1.tar.gz.

File metadata

  • Download URL: uzbek_tagger_bert-0.1.1.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for uzbek_tagger_bert-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0a47df75961096090a4cb765c9b933c8072d83a843ec073fc2907feced670cbe
MD5 431133414df568961ba0caa3ecacc110
BLAKE2b-256 76132ff6d22c41719d69e643ecafa9a43aeab2c3442b108fee87e4a9131361ea

See more details on using hashes here.

File details

Details for the file uzbek_tagger_bert-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for uzbek_tagger_bert-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ec56c6ef02101094d0f0e4afbeae635eafe1e100844a19c9f42f030e87a0e1e6
MD5 a08a96b0eb8615197cc18b6adeb28d19
BLAKE2b-256 b1b0c44c8f9256b078827f3c3f5586fbe8c2653c54a87cbc8ab6c776336b0cc0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page