text processing for uyghur script
Project description
Text Processing for Uyghur Script
ugtext_processor is a Python library for processing Uyghur text. It provides tools for normalization, phonemization, and tokenization.
Features
- Normalizer: Cleans and normalizes Uyghur text by handling punctuation, abbreviations, currency, dates, and numbers.
- Phonemizer: Converts Uyghur text into IPA or ULY Latin script representations.
- Tokenizer: Supports various tokenization strategies, including word, character, BPE, WordPiece, and SentencePiece.
Installation
pip install ugtext-processor
Usage
Normalizer
The normalizer module provides a simple interface to clean and normalize Uyghur text.
from ugtext_processor.normalizer import normalize
text = "بۈگۈن 2024/07/26 سائەت 14:30، باھاسى ¥120.5، ئېغىرلىقى 2kg"
normalized_text = normalize(text)
print(normalized_text)
Phonemizer
The phonemizer module can convert Uyghur text to IPA or ULY Latin script.
from ugtext_processor.phonemizer import UgPhonemizer
# To ULY Latin script
phonemizer_uly = UgPhonemizer(mod=UgPhonemizer.Mod.ULY)
text = "ياخشىمۇسىز؟"
uly_phonemes = phonemizer_uly.phonemizer(text)
print(f"ULY: {''.join(uly_phonemes)}")
# To IPA
phonemizer_ipa = UgPhonemizer(mod=UgPhonemizer.Mod.IPA)
ipa_phonemes = phonemizer_ipa.phonemizer(text)
print(f"IPA: {''.join(ipa_phonemes)}")
Tokenizer
The tokenizer module provides a factory to create different types of tokenizers.
from ugtext_processor.tokenizer import TokenizerFactory, TokenizerType
# Word Tokenizer
word_tokenizer = TokenizerFactory.create_tokenizer(TokenizerType.WORD)
text = "بۇ بىر ئاددىي جۈملە."
tokens = word_tokenizer.tokenize(text)
print(f"Word Tokens: {tokens}")
# Character Tokenizer
char_tokenizer = TokenizerFactory.create_tokenizer(TokenizerType.CHARACTER)
tokens = char_tokenizer.tokenize(text)
print(f"Character Tokens: {tokens}")
Modules
ugtext_processor.normalizer
This module contains functions to normalize Uyghur text. The main function is normalize, which applies the following steps in order:
UyghurPunctuationNormalizer: Normalizes and cleans punctuation.UyghurAbbreviation: Expands common abbreviations.UyghurCurrency: Converts currency symbols to text.UyghurDateNormalizer: Normalizes dates and times into spoken form.UyghurNumberNormalizer: Converts numbers into spoken form.
ugtext_processor.phonemizer
This module provides the UgPhonemizer class for converting Uyghur text into phonetic representations.
UgPhonemizer(mod: Mod): The constructor takes amodargument which can beUgPhonemizer.Mod.IPAorUgPhonemizer.Mod.ULY.phonemizer(text: str): The main method that performs the conversion.
ugtext_processor.tokenizer
This module provides a TokenizerFactory for creating various tokenizers.
TokenizerFactory.create_tokenizer(tokenizer_type: TokenizerType, **kwargs): Creates a tokenizer instance.TokenizerType: An enum with the following values:WORDCHARACTERBPEWORDPIECESENTENCEPIECE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ugtext_processor-0.1.8.tar.gz.
File metadata
- Download URL: ugtext_processor-0.1.8.tar.gz
- Upload date:
- Size: 15.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b05569a3819d933078aaedd20c409fd06658f873824c878b730664bc6913b1bf
|
|
| MD5 |
2ed4c7375dad8f651c77e0da6d167227
|
|
| BLAKE2b-256 |
616f0176b7d6f70068b964a160fc3d6724a34004529a0c246aa186665dd7266f
|
Provenance
The following attestation bundles were made for ugtext_processor-0.1.8.tar.gz:
Publisher:
workflow.yml on uyplayer/ugtext_processor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ugtext_processor-0.1.8.tar.gz -
Subject digest:
b05569a3819d933078aaedd20c409fd06658f873824c878b730664bc6913b1bf - Sigstore transparency entry: 375335763
- Sigstore integration time:
-
Permalink:
uyplayer/ugtext_processor@eba96e4202a07b8d20854a85c24e3f0568346d33 -
Branch / Tag:
refs/tags/v0.1.8 - Owner: https://github.com/uyplayer
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@eba96e4202a07b8d20854a85c24e3f0568346d33 -
Trigger Event:
push
-
Statement type:
File details
Details for the file ugtext_processor-0.1.8-py3-none-any.whl.
File metadata
- Download URL: ugtext_processor-0.1.8-py3-none-any.whl
- Upload date:
- Size: 20.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ebb56904d1d7906d8ed2449e7fd64fd0ff32f75a4f09fa4369dcba5fffab5d7f
|
|
| MD5 |
9534785a106b9299aded23e75c511cc5
|
|
| BLAKE2b-256 |
3f7715fa269aa7b1800e23da7881bcfab0d14c3e0b6058f6151505638078ebc5
|
Provenance
The following attestation bundles were made for ugtext_processor-0.1.8-py3-none-any.whl:
Publisher:
workflow.yml on uyplayer/ugtext_processor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ugtext_processor-0.1.8-py3-none-any.whl -
Subject digest:
ebb56904d1d7906d8ed2449e7fd64fd0ff32f75a4f09fa4369dcba5fffab5d7f - Sigstore transparency entry: 375335768
- Sigstore integration time:
-
Permalink:
uyplayer/ugtext_processor@eba96e4202a07b8d20854a85c24e3f0568346d33 -
Branch / Tag:
refs/tags/v0.1.8 - Owner: https://github.com/uyplayer
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@eba96e4202a07b8d20854a85c24e3f0568346d33 -
Trigger Event:
push
-
Statement type: