Skip to main content

Part-of-Speech Tagger for the Uzbek Language based on Tahrirchi-BERT

Project description

====================================================================== UzbekTaggerBERT: High-Accuracy POS Tagger for the Uzbek Language

UzbekTaggerBERT is an open-source Part-of-Speech (POS) tagging library designed specifically for the highly agglutinative Uzbek language. It is built by fine-tuning the state-of-the-art Tahrirchi-BERT (RoBERTa architecture) on a comprehensive Uzbek morphological dataset.

This project solves major tokenization challenges inherent in agglutinative languages, providing researchers and developers with a robust, highly accurate tool for Natural Language Processing (NLP) in Uzbek.

MODEL PERFORMANCE & HUGGING FACE

The core model used in this tagger is publicly available on the Hugging Face Hub: URL: https://huggingface.co/MaksudSharipov/Uzbek-POS-Tagger-TahrirchiBERT

Evaluation metrics on the UzbekPOS test dataset demonstrate state-of-the-art results:

  • Accuracy: 0.9810
  • Weighted F1: 0.9811

KEY SCIENTIFIC FEATURES

  • Contextual Homonym Disambiguation: Accurately distinguishes between morphologically identical but semantically different words based on context (e.g., "yuz" as a NOUN vs. "yuz" as a NUM).
  • Smart Text Normalization: Automatically standardizes various Cyrillic/Latin apostrophes to prevent the tokenizer from incorrectly fragmenting words like "O'zbekiston".
  • Overlap-based Offset Mapping: Employs an advanced mathematical overlap algorithm to perfectly align model-generated subword tokens back to their original root words, eliminating the "UNKNOWN" tag issue common in Transformer-based pipelines.

INSTALLATION

You can install the package directly from PyPI using pip:

pip install uzbek-tagger-bert

Dependencies: This package requires "transformers" and "torch". They will be installed automatically if not already present.

QUICK START & EXAMPLE

Using the tagger is straightforward. It abstracts away the complexities of subword tokenization and tensor operations.

from uzbek_tagger_bert import UzbekTaggerBERT

1. Initialize the tagger

tagger = UzbekTaggerBERT()

2. Input a complex sentence in Uzbek

text = "U yuz burib ketdi va yuz soat ishladi. Alisher sariq olma olib menga qizil olma olmading dedi. O'zbekiston eng baxtli davlatlar reytingida 53-o'rinda qoldi."

3. Get the tagged output

result = tagger(text)

print(result)

[Expected Output] U/PRON yuz/NOUN burib/VERB ketdi/VERB va/CCONJ yuz/NUM soat/NOUN ishladi/VERB ./PUNCT Alisher/PROPN sariq/ADJ olma/NOUN olib/VERB menga/PRON qizil/ADJ olma/NOUN olmading/VERB dedi/VERB ./PUNCT O'zbekiston/PROPN eng/ADV baxtli/ADJ davlatlar/NOUN reytingida/NOUN 53-o'rinda/NUM qoldi/VERB ./PUNCT

METHODOLOGY & TOKENIZATION HANDLING

Because Uzbek heavily relies on suffixes, standard tokenizers split words into multiple sub-tokens. This tool overrides standard pipeline constraints and uses dynamic boundary checking to reconstruct original morphological entities seamlessly before assigning the final POS tag.

ACADEMIC CITATION

This tagger model was fine-tuned by expanding the multi-domain dataset introduced in our published research. If you use this library, model, or dataset in your academic and research projects, please strictly cite the following paper:

[APA Style] Sharipov, M., Kuriyozov, E., & Vičič, J. (2026). UzbekPOS: A multi-domain dataset for Uzbek part-of-speech tagging. Data in Brief, 112640. https://doi.org/10.1016/j.dib.2026.112640

[BibTeX] @article{sharipov2026uzbekpos, title={UzbekPOS: A multi-domain dataset for Uzbek part-of-speech tagging}, author={Sharipov, Maksud and Kuriyozov, Elmurod and Vičič, Jernej}, journal={Data in Brief}, pages={112640}, year={2026}, publisher={Elsevier}, doi={10.1016/j.dib.2026.112640} }

LICENSE

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uzbek_tagger_bert-0.1.0.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uzbek_tagger_bert-0.1.0-py3-none-any.whl (4.8 kB view details)

Uploaded Python 3

File details

Details for the file uzbek_tagger_bert-0.1.0.tar.gz.

File metadata

  • Download URL: uzbek_tagger_bert-0.1.0.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for uzbek_tagger_bert-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fe537f610372eff0cbbad071fd69953cc3136fcc95108c24701bc2cdbb5f1cf7
MD5 9903c0962dfd24b89f85ed67b7979311
BLAKE2b-256 6e0b9094ce743db19d12c735317acadf0be710c247204fb69cfdebfeb60b80fd

See more details on using hashes here.

File details

Details for the file uzbek_tagger_bert-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for uzbek_tagger_bert-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0d472b372bff81d55227de2cde5d7235069e268bd976767add1941df4e871ffc
MD5 254e071419fb5764522f4a6faddb76e6
BLAKE2b-256 b81d6b5634d21e1bb31d53affbd4a44b312274f1a2a509d0c9090fb649cd35fe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page