Skip to main content

Automatic HuggingFace dataset download, cleaning and tokenization pipeline for OMGFormer

Project description

omg-data

Automatic HuggingFace dataset pipeline for OMGFormer — download, clean, tokenize.

One call handles everything: finding the right datasets for your language & task, downloading them, cleaning the text, and producing a ready-to-train OMGDataset.


Installation

pip install omg-data
# with language detection support
pip install omg-data[langdetect]

Quick Start

from omg_data import DataPipeline

# Türkçe, 5 GB, GPT-2 tokenizer
pipe = DataPipeline(
    language="tr",
    size_gb=5,
    tokenizer="gpt2",
)
dataset = pipe.build()

trainer.fit(dataset)  # OMGFormer Trainer ile doğrudan uyumlu

Examples

Task-specific pipeline

pipe = DataPipeline(
    language="en",
    task="chat",               # "text" | "chat" | "instruction" | "qa" | "code" | ...
    size_gb=10,
    tokenizer="meta-llama/Llama-2-7b-hf",
    seq_len=2048,
)
dataset = pipe.build()

Custom datasets

pipe = DataPipeline(
    language="en",
    tokenizer="gpt2",
    custom_datasets=["wikitext", "openwebtext"],
)
dataset = pipe.build()

Cleaning disabled

pipe = DataPipeline(language="en", tokenizer="gpt2", clean=False)
dataset = pipe.build()

Fine-grained cleaning control

pipe = DataPipeline(
    language="tr",
    tokenizer="gpt2",
    clean=True,
    clean_options={
        "dedup": True,
        "min_chars": 50,
        "remove_urls": True,
        "remove_html": True,
        "lang_filter": True,
    },
)

Raw text (no tokenizer)

pipe = DataPipeline(language="de", size_gb=2)
hf_dataset = pipe.build()   # returns HuggingFace Dataset

Supported Languages

tr en de fr es ar ru ja zh ko pt it nl pl sv

Supported Tasks

text · lm · chat · conversation · instruction · instruct · qa · summarization · classification · code


Pipeline Steps

  1. Search — Finds suitable HuggingFace datasets for your language & task
  2. Download — Streams & caches datasets via HuggingFace datasets
  3. Clean — Removes HTML, URLs, duplicates, fixes Unicode, filters by length
  4. Tokenize — Chunks text into fixed-length token windows
  5. ReturnOMGDataset (PyTorch Dataset) ready for trainer.fit()

OMGDataset API

from omg_data import OMGDataset

# Compatible with any PyTorch DataLoader
loader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

# Info
print(dataset.info())
# {'num_sequences': 125000, 'seq_len': 512, 'total_tokens': 64000000, 'approx_size_gb': 0.128}

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omg_data-1.0.1-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file omg_data-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: omg_data-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for omg_data-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f0fcf9b249d5c4c5549aee767010797864b2692b8342e79dd09c36ea1be8866d
MD5 d74283f59c3ec40ba8db0f920bf33412
BLAKE2b-256 c1cfd12b206316cdddb773ab0f6d61d54c51859005454cdbf1178f7e714aaba2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page