Automatic HuggingFace dataset download, cleaning and tokenization pipeline for OMGFormer
Project description
omg-data
Automatic HuggingFace dataset pipeline for OMGFormer — download, clean, tokenize.
One call handles everything: finding the right datasets for your language & task, downloading them, cleaning the text, and producing a ready-to-train OMGDataset.
Installation
pip install omg-data
# with language detection support
pip install omg-data[langdetect]
Quick Start
from omg_data import DataPipeline
# Türkçe, 5 GB, GPT-2 tokenizer
pipe = DataPipeline(
language="tr",
size_gb=5,
tokenizer="gpt2",
)
dataset = pipe.build()
trainer.fit(dataset) # OMGFormer Trainer ile doğrudan uyumlu
Examples
Task-specific pipeline
pipe = DataPipeline(
language="en",
task="chat", # "text" | "chat" | "instruction" | "qa" | "code" | ...
size_gb=10,
tokenizer="meta-llama/Llama-2-7b-hf",
seq_len=2048,
)
dataset = pipe.build()
Custom datasets
pipe = DataPipeline(
language="en",
tokenizer="gpt2",
custom_datasets=["wikitext", "openwebtext"],
)
dataset = pipe.build()
Cleaning disabled
pipe = DataPipeline(language="en", tokenizer="gpt2", clean=False)
dataset = pipe.build()
Fine-grained cleaning control
pipe = DataPipeline(
language="tr",
tokenizer="gpt2",
clean=True,
clean_options={
"dedup": True,
"min_chars": 50,
"remove_urls": True,
"remove_html": True,
"lang_filter": True,
},
)
Raw text (no tokenizer)
pipe = DataPipeline(language="de", size_gb=2)
hf_dataset = pipe.build() # returns HuggingFace Dataset
Supported Languages
tr en de fr es ar ru ja zh ko pt it nl pl sv
Supported Tasks
text · lm · chat · conversation · instruction · instruct · qa · summarization · classification · code
Pipeline Steps
- Search — Finds suitable HuggingFace datasets for your language & task
- Download — Streams & caches datasets via HuggingFace
datasets - Clean — Removes HTML, URLs, duplicates, fixes Unicode, filters by length
- Tokenize — Chunks text into fixed-length token windows
- Return —
OMGDataset(PyTorchDataset) ready fortrainer.fit()
OMGDataset API
from omg_data import OMGDataset
# Compatible with any PyTorch DataLoader
loader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)
# Info
print(dataset.info())
# {'num_sequences': 125000, 'seq_len': 512, 'total_tokens': 64000000, 'approx_size_gb': 0.128}
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file omg_data-1.0.1-py3-none-any.whl.
File metadata
- Download URL: omg_data-1.0.1-py3-none-any.whl
- Upload date:
- Size: 16.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0fcf9b249d5c4c5549aee767010797864b2692b8342e79dd09c36ea1be8866d
|
|
| MD5 |
d74283f59c3ec40ba8db0f920bf33412
|
|
| BLAKE2b-256 |
c1cfd12b206316cdddb773ab0f6d61d54c51859005454cdbf1178f7e714aaba2
|