Skip to main content

Automatic HuggingFace dataset download, cleaning and tokenization pipeline for OMGFormer

Project description

omg-data

Automatic HuggingFace dataset pipeline for OMGFormer — download, clean, tokenize.

One call handles everything: finding the right datasets for your language & task, downloading them, cleaning the text, and producing a ready-to-train OMGDataset.


Installation

pip install omg-data
# with language detection support
pip install omg-data[langdetect]

Quick Start

from omg_data import DataPipeline

# Türkçe, 5 GB, GPT-2 tokenizer
pipe = DataPipeline(
    language="tr",
    size_gb=5,
    tokenizer="gpt2",
)
dataset = pipe.build()

trainer.fit(dataset)  # OMGFormer Trainer ile doğrudan uyumlu

Examples

Task-specific pipeline

pipe = DataPipeline(
    language="en",
    task="chat",               # "text" | "chat" | "instruction" | "qa" | "code" | ...
    size_gb=10,
    tokenizer="meta-llama/Llama-2-7b-hf",
    seq_len=2048,
)
dataset = pipe.build()

Custom datasets

pipe = DataPipeline(
    language="en",
    tokenizer="gpt2",
    custom_datasets=["wikitext", "openwebtext"],
)
dataset = pipe.build()

Cleaning disabled

pipe = DataPipeline(language="en", tokenizer="gpt2", clean=False)
dataset = pipe.build()

Fine-grained cleaning control

pipe = DataPipeline(
    language="tr",
    tokenizer="gpt2",
    clean=True,
    clean_options={
        "dedup": True,
        "min_chars": 50,
        "remove_urls": True,
        "remove_html": True,
        "lang_filter": True,
    },
)

Raw text (no tokenizer)

pipe = DataPipeline(language="de", size_gb=2)
hf_dataset = pipe.build()   # returns HuggingFace Dataset

Supported Languages

tr en de fr es ar ru ja zh ko pt it nl pl sv

Supported Tasks

text · lm · chat · conversation · instruction · instruct · qa · summarization · classification · code


Pipeline Steps

  1. Search — Finds suitable HuggingFace datasets for your language & task
  2. Download — Streams & caches datasets via HuggingFace datasets
  3. Clean — Removes HTML, URLs, duplicates, fixes Unicode, filters by length
  4. Tokenize — Chunks text into fixed-length token windows
  5. ReturnOMGDataset (PyTorch Dataset) ready for trainer.fit()

OMGDataset API

from omg_data import OMGDataset

# Compatible with any PyTorch DataLoader
loader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

# Info
print(dataset.info())
# {'num_sequences': 125000, 'seq_len': 512, 'total_tokens': 64000000, 'approx_size_gb': 0.128}

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omg_data-1.0.0.tar.gz (14.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omg_data-1.0.0-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file omg_data-1.0.0.tar.gz.

File metadata

  • Download URL: omg_data-1.0.0.tar.gz
  • Upload date:
  • Size: 14.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for omg_data-1.0.0.tar.gz
Algorithm Hash digest
SHA256 efef96358b1e49b1e7c13fab850acbfb560289061ef64c72a01e1fcb6a4b5253
MD5 f48c15cd24f7399588a734c7986bd012
BLAKE2b-256 f54e38d0a11fe18ef019108c1f00717fb2a8c6795e20f14028c4a17971f87690

See more details on using hashes here.

File details

Details for the file omg_data-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: omg_data-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 15.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for omg_data-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f6fc917c89b0009a86ed3d1f91b7b2658428d1165b4fdbb58190987b5bd8832f
MD5 e99b08ceaa0a60264f1aeff9bca1ecc0
BLAKE2b-256 33778fca41744022beb5b8576cd23a7ffd405a4614075326def0075242ead428

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page