Skip to main content

text processing for uyghur script

Project description

Text Processing for Uyghur Script

ugtext_processor is a Python library for processing Uyghur text. It provides tools for normalization, phonemization, and tokenization.

Features

  • Normalizer: Cleans and normalizes Uyghur text by handling punctuation, abbreviations, currency, dates, and numbers.
  • Phonemizer: Converts Uyghur text into IPA or ULY Latin script representations.
  • Tokenizer: Supports various tokenization strategies, including word, character, BPE, WordPiece, and SentencePiece.

Installation

pip install ugtext-processor

Usage

Normalizer

The normalizer module provides a simple interface to clean and normalize Uyghur text.

from ugtext_processor.normalizer import normalize

text = "بۈگۈن 2024/07/26 سائەت 14:30، باھاسى ¥120.5، ئېغىرلىقى 2kg"
normalized_text = normalize(text)
print(normalized_text)

Phonemizer

The phonemizer module can convert Uyghur text to IPA or ULY Latin script.

from ugtext_processor.phonemizer import UgPhonemizer

# To ULY Latin script
phonemizer_uly = UgPhonemizer(mod=UgPhonemizer.Mod.ULY)
text = "ياخشىمۇسىز؟"
uly_phonemes = phonemizer_uly.phonemizer(text)
print(f"ULY: {''.join(uly_phonemes)}")

# To IPA
phonemizer_ipa = UgPhonemizer(mod=UgPhonemizer.Mod.IPA)
ipa_phonemes = phonemizer_ipa.phonemizer(text)
print(f"IPA: {''.join(ipa_phonemes)}")

Tokenizer

The tokenizer module provides a factory to create different types of tokenizers.

from ugtext_processor.tokenizer import TokenizerFactory, TokenizerType

# Word Tokenizer
word_tokenizer = TokenizerFactory.create_tokenizer(TokenizerType.WORD)
text = "بۇ بىر ئاددىي جۈملە."
tokens = word_tokenizer.tokenize(text)
print(f"Word Tokens: {tokens}")

# Character Tokenizer
char_tokenizer = TokenizerFactory.create_tokenizer(TokenizerType.CHARACTER)
tokens = char_tokenizer.tokenize(text)
print(f"Character Tokens: {tokens}")

Modules

ugtext_processor.normalizer

This module contains functions to normalize Uyghur text. The main function is normalize, which applies the following steps in order:

  1. UyghurPunctuationNormalizer: Normalizes and cleans punctuation.
  2. UyghurAbbreviation: Expands common abbreviations.
  3. UyghurCurrency: Converts currency symbols to text.
  4. UyghurDateNormalizer: Normalizes dates and times into spoken form.
  5. UyghurNumberNormalizer: Converts numbers into spoken form.

ugtext_processor.phonemizer

This module provides the UgPhonemizer class for converting Uyghur text into phonetic representations.

  • UgPhonemizer(mod: Mod): The constructor takes a mod argument which can be UgPhonemizer.Mod.IPA or UgPhonemizer.Mod.ULY.
  • phonemizer(text: str): The main method that performs the conversion.

ugtext_processor.tokenizer

This module provides a TokenizerFactory for creating various tokenizers.

  • TokenizerFactory.create_tokenizer(tokenizer_type: TokenizerType, **kwargs): Creates a tokenizer instance.
  • TokenizerType: An enum with the following values:
    • WORD
    • CHARACTER
    • BPE
    • WORDPIECE
    • SENTENCEPIECE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ugtext_processor-0.1.8.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ugtext_processor-0.1.8-py3-none-any.whl (20.6 kB view details)

Uploaded Python 3

File details

Details for the file ugtext_processor-0.1.8.tar.gz.

File metadata

  • Download URL: ugtext_processor-0.1.8.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for ugtext_processor-0.1.8.tar.gz
Algorithm Hash digest
SHA256 b05569a3819d933078aaedd20c409fd06658f873824c878b730664bc6913b1bf
MD5 2ed4c7375dad8f651c77e0da6d167227
BLAKE2b-256 616f0176b7d6f70068b964a160fc3d6724a34004529a0c246aa186665dd7266f

See more details on using hashes here.

Provenance

The following attestation bundles were made for ugtext_processor-0.1.8.tar.gz:

Publisher: workflow.yml on uyplayer/ugtext_processor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ugtext_processor-0.1.8-py3-none-any.whl.

File metadata

File hashes

Hashes for ugtext_processor-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 ebb56904d1d7906d8ed2449e7fd64fd0ff32f75a4f09fa4369dcba5fffab5d7f
MD5 9534785a106b9299aded23e75c511cc5
BLAKE2b-256 3f7715fa269aa7b1800e23da7881bcfab0d14c3e0b6058f6151505638078ebc5

See more details on using hashes here.

Provenance

The following attestation bundles were made for ugtext_processor-0.1.8-py3-none-any.whl:

Publisher: workflow.yml on uyplayer/ugtext_processor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page