Part-of-Speech Tagger for the Uzbek Language based on Tahrirchi-BERT

These details have not been verified by PyPI

Project links

Homepage

Project description

====================================================================== UzbekTaggerBERT: High-Accuracy POS Tagger for the Uzbek Language

UzbekTaggerBERT is an open-source Part-of-Speech (POS) tagging library designed specifically for the highly agglutinative Uzbek language. It is built by fine-tuning the state-of-the-art Tahrirchi-BERT (RoBERTa architecture) on a comprehensive Uzbek morphological dataset.

This project solves major tokenization challenges inherent in agglutinative languages, providing researchers and developers with a robust, highly accurate tool for Natural Language Processing (NLP) in Uzbek.

MODEL PERFORMANCE & HUGGING FACE

The core model used in this tagger is publicly available on the Hugging Face Hub: URL: https://huggingface.co/MaksudSharipov/Uzbek-POS-Tagger-TahrirchiBERT

Evaluation metrics on the UzbekPOS test dataset demonstrate state-of-the-art results:

Accuracy: 0.9810
Weighted F1: 0.9811

KEY SCIENTIFIC FEATURES

Contextual Homonym Disambiguation: Accurately distinguishes between morphologically identical but semantically different words based on context (e.g., "yuz" as a NOUN vs. "yuz" as a NUM).
Smart Text Normalization: Automatically standardizes various Cyrillic/Latin apostrophes to prevent the tokenizer from incorrectly fragmenting words like "O'zbekiston".
Overlap-based Offset Mapping: Employs an advanced mathematical overlap algorithm to perfectly align model-generated subword tokens back to their original root words, eliminating the "UNKNOWN" tag issue common in Transformer-based pipelines.

INSTALLATION

You can install the package directly from PyPI using pip:

pip install uzbek-tagger-bert

Dependencies: This package requires "transformers" and "torch". They will be installed automatically if not already present.

QUICK START & EXAMPLE

Using the tagger is straightforward. It abstracts away the complexities of subword tokenization and tensor operations.

from uzbek_tagger_bert import UzbekTaggerBERT

# 1. Initialize the tagger
tagger = UzbekTaggerBERT()

# 2. Input a complex sentence in Uzbek
text = "U yuz burib ketdi va yuz soat ishladi.\nAlisher sariq olma olib menga qizil olma olmading dedi. \nO'zbekiston eng baxtli davlatlar reytingida 53-o'rinda qoldi."

# 3. Get the tagged output
result = tagger(text)

print(result)

[Expected Output]

U/PRON yuz/NOUN burib/VERB ketdi/VERB va/CCONJ yuz/NUM soat/NOUN ishladi/VERB ./PUNCT 
Alisher/PROPN sariq/ADJ olma/NOUN olib/VERB menga/PRON qizil/ADJ olma/NOUN olmading/VERB dedi/VERB ./PUNCT 
O'zbekiston/PROPN eng/ADV baxtli/ADJ davlatlar/NOUN reytingida/NOUN 53-o'rinda/NUM qoldi/VERB ./PUNCT

METHODOLOGY & TOKENIZATION HANDLING

Because Uzbek heavily relies on suffixes, standard tokenizers split words into multiple sub-tokens. This tool overrides standard pipeline constraints and uses dynamic boundary checking to reconstruct original morphological entities seamlessly before assigning the final POS tag.

ACADEMIC CITATION

This tagger model was fine-tuned by expanding the multi-domain dataset introduced in our published research. If you use this library, model, or dataset in your academic and research projects, please strictly cite the following paper:

[APA Style] Sharipov, M., Kuriyozov, E., & Vičič, J. (2026). UzbekPOS: A multi-domain dataset for Uzbek part-of-speech tagging. Data in Brief, 112640. https://doi.org/10.1016/j.dib.2026.112640

[BibTeX]

@article{sharipov2026uzbekpos,
  title={UzbekPOS: A multi-domain dataset for Uzbek part-of-speech tagging},
  author={Sharipov, Maksud and Kuriyozov, Elmurod and Vičič, Jernej},
  journal={Data in Brief},
  pages={112640},
  year={2026},
  publisher={Elsevier},
  doi={10.1016/j.dib.2026.112640}
}

LICENSE

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.2

Mar 21, 2026

This version

0.1.1

Mar 21, 2026

0.1.0

Mar 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uzbek_tagger_bert-0.1.1.tar.gz (4.6 kB view details)

Uploaded Mar 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

uzbek_tagger_bert-0.1.1-py3-none-any.whl (4.8 kB view details)

Uploaded Mar 21, 2026 Python 3

File details

Details for the file uzbek_tagger_bert-0.1.1.tar.gz.

File metadata

Download URL: uzbek_tagger_bert-0.1.1.tar.gz
Upload date: Mar 21, 2026
Size: 4.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for uzbek_tagger_bert-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`0a47df75961096090a4cb765c9b933c8072d83a843ec073fc2907feced670cbe`
MD5	`431133414df568961ba0caa3ecacc110`
BLAKE2b-256	`76132ff6d22c41719d69e643ecafa9a43aeab2c3442b108fee87e4a9131361ea`

See more details on using hashes here.

File details

Details for the file uzbek_tagger_bert-0.1.1-py3-none-any.whl.

File metadata

Download URL: uzbek_tagger_bert-0.1.1-py3-none-any.whl
Upload date: Mar 21, 2026
Size: 4.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for uzbek_tagger_bert-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ec56c6ef02101094d0f0e4afbeae635eafe1e100844a19c9f42f030e87a0e1e6`
MD5	`a08a96b0eb8615197cc18b6adeb28d19`
BLAKE2b-256	`b1b0c44c8f9256b078827f3c3f5586fbe8c2653c54a87cbc8ab6c776336b0cc0`

See more details on using hashes here.

uzbek-tagger-bert 0.1.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

====================================================================== UzbekTaggerBERT: High-Accuracy POS Tagger for the Uzbek Language

MODEL PERFORMANCE & HUGGING FACE

KEY SCIENTIFIC FEATURES

INSTALLATION

QUICK START & EXAMPLE

METHODOLOGY & TOKENIZATION HANDLING

ACADEMIC CITATION

LICENSE

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes