Skip to main content

TozaText is a cleaning library for preprocessing raw Uzbek and multilingual text data.

Project description

🧹 TozaText

TozaText is a lightweight and extensible text-preprocessing pipeline built for cleaning noisy, transcribed, or user-generated text data.
It’s designed around a modifier-based architecture — each cleaning rule is a DocumentModifier that can be combined into a customizable Pipeline.


Features

  • Modular design – add or remove modifiers easily (e.g., repetition removal, transliteration)
  • Smart repetition cleaner – removes consecutive repeated words, even with punctuation or ellipses

Available Modifiers

TozaText currently includes the following modifiers out of the box:

Modifier Description Example Input Example Output
WordRepetitionFilter Removes consecutive repeated words, even when separated by punctuation or ellipses. bu. bu. bu. shu shu qila qila bu. shu qila
ParagraphRepetitionFilter Removes entire paragraphs if too many repeated paragraphs or characters are detected (useful for STT data with repeated intros). "Salom!\n\nSalom!\n\nSalom!" ""
TransliteratorModifier Converts Uzbek text between Cyrillic and Latin alphabets using UzTransliterator. "Салом дунё" "Salom dunyo"
UrlEmojiRemover Remove or normalize URLs and links from text. "Bu sayt: https://example.com 😎" "Bu sayt"

All modifiers inherit from:

class DocumentModifier:
    def modify_document(self, text: str, *args, **kwargs) -> str:
        ...

Installation

git clone https://gitlab.adliya.uz/shohrux1sakov/tozatext.git
cd TozaText
pip install -e .

Code Example

from datasets import load_dataset
from TozaText import Pipeline, WordRepetitionFilter, ParagraphRepetitionFilter

data = load_dataset("aktrmai/youtube_transcribe_data", split="train")

pipeline = Pipeline([
    WordRepetitionFilter(),
    ParagraphRepetitionFilter(),
])

cleaned = pipeline.process_hf_dataset(data, column="text")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tozatext-0.1.7.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tozatext-0.1.7-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file tozatext-0.1.7.tar.gz.

File metadata

  • Download URL: tozatext-0.1.7.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tozatext-0.1.7.tar.gz
Algorithm Hash digest
SHA256 e8e4a24b6df7ba0bf641f1f5b5750b16f99fd8175c30150010588546adb39d10
MD5 bb713120c63dfc6ed70ef073f7ca4b98
BLAKE2b-256 5477a1ff79a958bf3ea8574911e897744fbd386795232da54d25908987a4c3ff

See more details on using hashes here.

File details

Details for the file tozatext-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: tozatext-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for tozatext-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 2b50759ad84d0a3fe865b76bbe1dcba1565def8d61ea57c03443fe01ae9bdb4b
MD5 758360cf341538c76bf7a4083a4fee26
BLAKE2b-256 f84c3847fb6183817e1c43cc75e72a287a905cf1225f7a90370000e826c4c25d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page