text processing for uyghur script

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

uyplayer

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

Text Processing for Uyghur Script

ugtext_processor is a Python library for processing Uyghur text. It provides tools for normalization, phonemization, and tokenization.

Features

Normalizer: Cleans and normalizes Uyghur text by handling punctuation, abbreviations, currency, dates, and numbers.
Phonemizer: Converts Uyghur text into IPA or ULY Latin script representations.
Tokenizer: Supports various tokenization strategies, including word, character, BPE, WordPiece, and SentencePiece.

Installation

pip install ugtext-processor

Usage

Normalizer

The normalizer module provides a simple interface to clean and normalize Uyghur text.

from ugtext_processor.normalizer import normalize

text = "بۈگۈن 2024/07/26 سائەت 14:30، باھاسى ¥120.5، ئېغىرلىقى 2kg"
normalized_text = normalize(text)
print(normalized_text)

Phonemizer

The phonemizer module can convert Uyghur text to IPA or ULY Latin script.

from ugtext_processor.phonemizer import UgPhonemizer

# To ULY Latin script
phonemizer_uly = UgPhonemizer(mod=UgPhonemizer.Mod.ULY)
text = "ياخشىمۇسىز؟"
uly_phonemes = phonemizer_uly.phonemizer(text)
print(f"ULY: {''.join(uly_phonemes)}")

# To IPA
phonemizer_ipa = UgPhonemizer(mod=UgPhonemizer.Mod.IPA)
ipa_phonemes = phonemizer_ipa.phonemizer(text)
print(f"IPA: {''.join(ipa_phonemes)}")

Tokenizer

The tokenizer module provides a factory to create different types of tokenizers.

from ugtext_processor.tokenizer import TokenizerFactory, TokenizerType

# Word Tokenizer
word_tokenizer = TokenizerFactory.create_tokenizer(TokenizerType.WORD)
text = "بۇ بىر ئاددىي جۈملە."
tokens = word_tokenizer.tokenize(text)
print(f"Word Tokens: {tokens}")

# Character Tokenizer
char_tokenizer = TokenizerFactory.create_tokenizer(TokenizerType.CHARACTER)
tokens = char_tokenizer.tokenize(text)
print(f"Character Tokens: {tokens}")

Modules

`ugtext_processor.normalizer`

This module contains functions to normalize Uyghur text. The main function is normalize, which applies the following steps in order:

UyghurPunctuationNormalizer: Normalizes and cleans punctuation.
UyghurAbbreviation: Expands common abbreviations.
UyghurCurrency: Converts currency symbols to text.
UyghurDateNormalizer: Normalizes dates and times into spoken form.
UyghurNumberNormalizer: Converts numbers into spoken form.

`ugtext_processor.phonemizer`

This module provides the UgPhonemizer class for converting Uyghur text into phonetic representations.

UgPhonemizer(mod: Mod): The constructor takes a mod argument which can be UgPhonemizer.Mod.IPA or UgPhonemizer.Mod.ULY.
phonemizer(text: str): The main method that performs the conversion.

`ugtext_processor.tokenizer`

This module provides a TokenizerFactory for creating various tokenizers.

TokenizerFactory.create_tokenizer(tokenizer_type: TokenizerType, **kwargs): Creates a tokenizer instance.
TokenizerType: An enum with the following values:
- WORD
- CHARACTER
- BPE
- WORDPIECE
- SENTENCEPIECE

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

uyplayer

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

This version

0.1.8

Aug 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ugtext_processor-0.1.8.tar.gz (15.7 kB view details)

Uploaded Aug 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ugtext_processor-0.1.8-py3-none-any.whl (20.6 kB view details)

Uploaded Aug 10, 2025 Python 3

File details

Details for the file ugtext_processor-0.1.8.tar.gz.

File metadata

Download URL: ugtext_processor-0.1.8.tar.gz
Upload date: Aug 10, 2025
Size: 15.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for ugtext_processor-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`b05569a3819d933078aaedd20c409fd06658f873824c878b730664bc6913b1bf`
MD5	`2ed4c7375dad8f651c77e0da6d167227`
BLAKE2b-256	`616f0176b7d6f70068b964a160fc3d6724a34004529a0c246aa186665dd7266f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ugtext_processor-0.1.8.tar.gz:

Publisher: workflow.yml on uyplayer/ugtext_processor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ugtext_processor-0.1.8.tar.gz
- Subject digest: b05569a3819d933078aaedd20c409fd06658f873824c878b730664bc6913b1bf
- Sigstore transparency entry: 375335763
- Sigstore integration time: Aug 10, 2025
Source repository:
- Permalink: uyplayer/ugtext_processor@eba96e4202a07b8d20854a85c24e3f0568346d33
- Branch / Tag: refs/tags/v0.1.8
- Owner: https://github.com/uyplayer
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@eba96e4202a07b8d20854a85c24e3f0568346d33
- Trigger Event: push

File details

Details for the file ugtext_processor-0.1.8-py3-none-any.whl.

File metadata

Download URL: ugtext_processor-0.1.8-py3-none-any.whl
Upload date: Aug 10, 2025
Size: 20.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for ugtext_processor-0.1.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ebb56904d1d7906d8ed2449e7fd64fd0ff32f75a4f09fa4369dcba5fffab5d7f`
MD5	`9534785a106b9299aded23e75c511cc5`
BLAKE2b-256	`3f7715fa269aa7b1800e23da7881bcfab0d14c3e0b6058f6151505638078ebc5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ugtext_processor-0.1.8-py3-none-any.whl:

Publisher: workflow.yml on uyplayer/ugtext_processor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ugtext_processor-0.1.8-py3-none-any.whl
- Subject digest: ebb56904d1d7906d8ed2449e7fd64fd0ff32f75a4f09fa4369dcba5fffab5d7f
- Sigstore transparency entry: 375335768
- Sigstore integration time: Aug 10, 2025
Source repository:
- Permalink: uyplayer/ugtext_processor@eba96e4202a07b8d20854a85c24e3f0568346d33
- Branch / Tag: refs/tags/v0.1.8
- Owner: https://github.com/uyplayer
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@eba96e4202a07b8d20854a85c24e3f0568346d33
- Trigger Event: push

ugtext-processor 0.1.8

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Text Processing for Uyghur Script

Features

Installation

Usage

Normalizer

Phonemizer

Tokenizer

Modules

ugtext_processor.normalizer

ugtext_processor.phonemizer

ugtext_processor.tokenizer

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`ugtext_processor.normalizer`

`ugtext_processor.phonemizer`

`ugtext_processor.tokenizer`