Skip to main content

TozaText is a cleaning library for preprocessing raw Uzbek and multilingual text data.

Project description

🧹 TozaText

TozaText is a lightweight and extensible text-preprocessing pipeline built for cleaning noisy, transcribed, or user-generated text data.
It’s designed around a modifier-based architecture — each cleaning rule is a DocumentModifier that can be combined into a customizable Pipeline.


Features

  • Modular design – add or remove modifiers easily (e.g., repetition removal, transliteration)
  • Smart repetition cleaner – removes consecutive repeated words, even with punctuation or ellipses

Available Modifiers

TozaText currently includes the following modifiers out of the box:

Modifier Description Example Input Example Output
WordRepetitionFilter Removes consecutive repeated words, even when separated by punctuation or ellipses. bu. bu. bu. shu shu qila qila bu. shu qila
ParagraphRepetitionFilter Removes entire paragraphs if too many repeated paragraphs or characters are detected (useful for STT data with repeated intros). "Salom!\n\nSalom!\n\nSalom!" ""
TransliteratorModifier Converts Uzbek text between Cyrillic and Latin alphabets using UzTransliterator. "Салом дунё" "Salom dunyo"
UrlEmojiRemover Remove or normalize URLs and links from text. "Bu sayt: https://example.com 😎" "Bu sayt"

All modifiers inherit from:

class DocumentModifier:
    def modify_document(self, text: str, *args, **kwargs) -> str:
        ...

Installation

git clone https://gitlab.adliya.uz/shohrux1sakov/tozatext.git
cd TozaText
pip install -e .

Code Example

from datasets import load_dataset
from TozaText import Pipeline, WordRepetitionFilter, ParagraphRepetitionFilter

data = load_dataset("aktrmai/youtube_transcribe_data", split="train")

pipeline = Pipeline([
    WordRepetitionFilter(),
    ParagraphRepetitionFilter(),
])

cleaned = pipeline.process_hf_dataset(data, column="text")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tozatext-0.1.6.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tozatext-0.1.6-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file tozatext-0.1.6.tar.gz.

File metadata

  • Download URL: tozatext-0.1.6.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tozatext-0.1.6.tar.gz
Algorithm Hash digest
SHA256 318f44bcc392e678ba6768d0744d167698f0663a61288ef31bf3b12613b0f64b
MD5 cfe22c91f24ce60cd3f39ac3e6d624fd
BLAKE2b-256 bf4d19e12c9d31ce0d55c8cc454038ec0a0677cf1b1b696e5be8814908cbfcab

See more details on using hashes here.

File details

Details for the file tozatext-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: tozatext-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 10.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tozatext-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 eedd0ec2fb46edc64b09db12938fc590a14f376e54e55b66285ddac57c819389
MD5 a252297eb7dd2cd61d4117c99bdc5c94
BLAKE2b-256 b7d5dbd8ce7f227af38aee769d0aaf55d26dc21845bb9cc75f802279f0d86535

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page