TozaText is a cleaning library for preprocessing raw Uzbek and multilingual text data.

Project description

🧹 TozaText

TozaText is a lightweight and extensible text-preprocessing pipeline built for cleaning noisy, transcribed, or user-generated text data.
It’s designed around a modifier-based architecture — each cleaning rule is a DocumentModifier that can be combined into a customizable Pipeline.

Features

Modular design – add or remove modifiers easily (e.g., repetition removal, transliteration)
Smart repetition cleaner – removes consecutive repeated words, even with punctuation or ellipses

Available Modifiers

TozaText currently includes the following modifiers out of the box:

Modifier	Description	Example Input	Example Output
`WordRepetitionFilter`	Removes consecutive repeated words, even when separated by punctuation or ellipses.	`bu. bu. bu. shu shu qila qila`	`bu. shu qila`
`ParagraphRepetitionFilter`	Removes entire paragraphs if too many repeated paragraphs or characters are detected (useful for STT data with repeated intros).	`"Salom!\n\nSalom!\n\nSalom!"`	`""`
`TransliteratorModifier`	Converts Uzbek text between Cyrillic and Latin alphabets using `UzTransliterator`.	`"Салом дунё"`	`"Salom dunyo"`
`UrlCleaner`	Remove or normalize URLs and links from text.	`"Bu sayt: https://example.com"`	`"Bu sayt"`
`EmojiCleaner`	Strip or map emojis to descriptive tokens.	`"Zo‘r 😎"`	`"Zo‘r"`

All modifiers inherit from:

class DocumentModifier:
    def modify_document(self, text: str, *args, **kwargs) -> str:
        ...

Installation

git clone https://gitlab.adliya.uz/shohrux1sakov/tozatext.git
cd TozaText
pip install -e .

Code Example

from datasets import load_dataset
from TozaText import Pipeline, WordRepetitionFilter, ParagraphRepetitionFilter

data = load_dataset("aktrmai/youtube_transcribe_data", split="train")

pipeline = Pipeline([
    WordRepetitionFilter(),
    ParagraphRepetitionFilter(),
])

cleaned = pipeline.process_hf_dataset(data, column="text")

Project details

Release history Release notifications | RSS feed

0.1.7

Nov 14, 2025

0.1.6

Nov 13, 2025

0.1.5

Nov 12, 2025

0.1.4

Nov 12, 2025

0.1.3

Nov 12, 2025

0.1.1

Nov 11, 2025

This version

0.1.0

Nov 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tozatext-0.1.0.tar.gz (123.1 kB view details)

Uploaded Nov 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tozatext-0.1.0-py3-none-any.whl (8.4 kB view details)

Uploaded Nov 11, 2025 Python 3

File details

Details for the file tozatext-0.1.0.tar.gz.

File metadata

Download URL: tozatext-0.1.0.tar.gz
Upload date: Nov 11, 2025
Size: 123.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.8

File hashes

Hashes for tozatext-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`54bbc3dff53c6f68e5d229c174dc26432caf614aa35748d68f8349d034336ba1`
MD5	`0447c4a446169b185ddad8a02f5629cf`
BLAKE2b-256	`9400a23ab382578dc00c6bb3e92b550cb73390a1ab057a9e1a0c13415973e1f4`

See more details on using hashes here.

File details

Details for the file tozatext-0.1.0-py3-none-any.whl.

File metadata

Download URL: tozatext-0.1.0-py3-none-any.whl
Upload date: Nov 11, 2025
Size: 8.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.8

File hashes

Hashes for tozatext-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`68a16bc44a3e8c8e590d2bb33f52aad66035c0aa1fc4159d808bd6b6677d6a06`
MD5	`c7ab380728236f929a7076412808b7f7`
BLAKE2b-256	`8be70144e3c864e821355466ba3fd142482ae36197b9351f5534c12966f8fd6d`

See more details on using hashes here.

TozaText 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

🧹 TozaText

Features

Available Modifiers

Installation

Code Example

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes