TozaText is a cleaning library for preprocessing raw Uzbek and multilingual text data.
Project description
🧹 TozaText
TozaText is a lightweight and extensible text-preprocessing pipeline built for cleaning noisy, transcribed, or user-generated text data.
It’s designed around a modifier-based architecture — each cleaning rule is a DocumentModifier that can be combined into a customizable Pipeline.
Features
- Modular design – add or remove modifiers easily (e.g., repetition removal, transliteration)
- Smart repetition cleaner – removes consecutive repeated words, even with punctuation or ellipses
Available Modifiers
TozaText currently includes the following modifiers out of the box:
| Modifier | Description | Example Input | Example Output |
|---|---|---|---|
WordRepetitionFilter |
Removes consecutive repeated words, even when separated by punctuation or ellipses. | bu. bu. bu. shu shu qila qila |
bu. shu qila |
ParagraphRepetitionFilter |
Removes entire paragraphs if too many repeated paragraphs or characters are detected (useful for STT data with repeated intros). | "Salom!\n\nSalom!\n\nSalom!" |
"" |
TransliteratorModifier |
Converts Uzbek text between Cyrillic and Latin alphabets using UzTransliterator. |
"Салом дунё" |
"Salom dunyo" |
UrlEmojiRemover |
Remove or normalize URLs and links from text. | "Bu sayt: https://example.com 😎" |
"Bu sayt" |
All modifiers inherit from:
class DocumentModifier:
def modify_document(self, text: str, *args, **kwargs) -> str:
...
Installation
git clone https://gitlab.adliya.uz/shohrux1sakov/tozatext.git
cd TozaText
pip install -e .
Code Example
from datasets import load_dataset
from TozaText import Pipeline, WordRepetitionFilter, ParagraphRepetitionFilter
data = load_dataset("aktrmai/youtube_transcribe_data", split="train")
pipeline = Pipeline([
WordRepetitionFilter(),
ParagraphRepetitionFilter(),
])
cleaned = pipeline.process_hf_dataset(data, column="text")
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tozatext-0.1.6.tar.gz.
File metadata
- Download URL: tozatext-0.1.6.tar.gz
- Upload date:
- Size: 6.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
318f44bcc392e678ba6768d0744d167698f0663a61288ef31bf3b12613b0f64b
|
|
| MD5 |
cfe22c91f24ce60cd3f39ac3e6d624fd
|
|
| BLAKE2b-256 |
bf4d19e12c9d31ce0d55c8cc454038ec0a0677cf1b1b696e5be8814908cbfcab
|
File details
Details for the file tozatext-0.1.6-py3-none-any.whl.
File metadata
- Download URL: tozatext-0.1.6-py3-none-any.whl
- Upload date:
- Size: 10.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eedd0ec2fb46edc64b09db12938fc590a14f376e54e55b66285ddac57c819389
|
|
| MD5 |
a252297eb7dd2cd61d4117c99bdc5c94
|
|
| BLAKE2b-256 |
b7d5dbd8ce7f227af38aee769d0aaf55d26dc21845bb9cc75f802279f0d86535
|