Skip to main content

A comprehensive dataset conversion toolkit for transforming between different dataset formats

Project description

DataBridge

A comprehensive dataset conversion toolkit for transforming between different dataset formats commonly used in machine learning and NLP tasks.

🚀 Features

DataBridge is a comprehensive dataset conversion toolkit that supports seamless transformation between different dataset formats commonly used in machine learning and NLP tasks.

Supported Formats

  • JSONL - JSON Lines format for text data
  • Megatron bin/idx - Binary format used by Megatron-LM
  • WebDataset - Tar-based dataset format for large-scale training
  • Energon Dataset - Megatron-Energon compatible format (fully compatible with VeOmni training framework)

Key Capabilities

  • Universal Conversion: Convert between any supported format pair
  • VeOmni Ready: Native support for Energon format used by VeOmni training framework
  • Progress Tracking: Real-time progress bars and detailed logging
  • Runtime Loading: Support for both offline conversion and online data loading

📦 Installation

From PyPI

pip install dataloader-bridge

From Source

cd DataBridge
pip install -e .

Dependencies

pip install -r requirements.txt

🚀 Quick Start

Command Line Interface

DataBridge provides a unified command-line interface for all format conversions:

# List all supported formats
databridge list-formats

# Supported Formats:
#   • jsonl
#   • webdataset
#   • binidx
#   • energon

# File Extensions:
#   • .jsonl → jsonl
#   • .json → jsonl
#   • .tar → energon
#   • .bin → binidx
#   • .idx → binidx

# Convert between any supported formats
databridge convert \
    --input-path /path/to/input \
    --output-path /path/to/output \
    --input-format <input_format> \
    --output-format <output_format> \
    --shard-size 1000

Common Conversion Examples

1. Convert Megatron bin/idx to Energon format (for VeOmni training):

# Convert bin/idx dataset to Energon format
databridge convert \
    --input-path /path/to/dataset \
    --output-path /path/to/output/energon_dataset \
    --input-format binidx \
    --output-format energon \
    --shard-size 1000

# Example with real paths:
databridge convert \
    --input-path /prodcpfs/user/weishi/data/text_data/pile_test \
    --output-path /prodcpfs/user/weishi/data/text_data_converted/pile_test/energon \
    --input-format binidx \
    --output-format energon \
    --shard-size 1000

2. Convert JSONL to WebDataset:

databridge convert \
    --input-path data.jsonl \
    --output-path webdataset/ \
    --input-format jsonl \
    --output-format webdataset \
    --shard-size 1000

# Example:
databridge convert -i data/sample.jsonl -o data/sample_webdataset --output-format webdataset

3. Convert bin/idx to JSONL:

databridge convert \
    --input-path dataset \
    --output-path data.jsonl \
    --input-format binidx \
    --output-format jsonl

Python API

Using the Registry

from data_bridge.formats.registry import registry

# Convert bin/idx to Energon format
registry.convert(
    input_path="/path/to/binidx/dataset",
    output_path="/path/to/output/energon",
    input_format="binidx",
    output_format="energon",
    shard_size=1000
)

# Convert JSONL to WebDataset
registry.convert(
    input_path="data.jsonl",
    output_path="webdataset/",
    input_format="jsonl",
    output_format="webdataset",
    shard_size=1000
)

# List available formats
formats = registry.list_formats()
print(f"Supported formats: {formats}")

Using Individual Format Handlers

from data_bridge import Document, JsonlFormatHandler, WebDatasetFormatHandler, BinIdxFormatHandler, EnergonFormatHandler

# Load data using specific handlers
jsonl_handler = JsonlFormatHandler()
binidx_handler = BinIdxFormatHandler()
webdataset_handler = WebDatasetFormatHandler()
energon_handler = EnergonFormatHandler()

# Load returns Document objects
documents = jsonl_handler.load("data.jsonl")
# or
documents = binidx_handler.load("/path/to/binidx/dataset")

# Save accepts Document objects
webdataset_handler.save(documents, "webdataset/", shard_size=1000)
# or
energon_handler.save(documents, "energon/", shard_size=1000)

# Work with Document objects directly
for doc in documents:
    print(f"Document {doc.doc_id}: {doc['text']}")
    # Access data like a dictionary
    if 'metadata' in doc:
        print(f"Metadata: {doc['metadata']}")

🎯 Common Use Cases

Converting Megatron bin/idx to Energon for VeOmni Training

This is the most common use case for DataBridge - converting existing Megatron bin/idx datasets to Energon format for use with VeOmni training framework.

Step 1: Prepare Your Data

Ensure your bin/idx dataset has the following structure:

/path/to/your/dataset/
├── dataset.bin          # Binary data file
└── dataset.idx          # Index file

Step 2: Convert to Energon Format

# Convert bin/idx to Energon format
databridge convert \
    --input-path /path/to/your/dataset \
    --output-path /path/to/output/energon_dataset \
    --input-format binidx \
    --output-format energon \
    --shard-size 1000

Step 3: Verify the Output

The output Energon dataset will have the following structure:

/path/to/output/energon_dataset/
├── .nv-meta/
│   ├── .info.json       # Dataset metadata
│   ├── dataset.yaml     # Dataset configuration
│   ├── split.yaml       # Split configuration
│   ├── index.sqlite     # Index database
│   └── index.uuid       # Unique identifier
├── shard_000000.tar     # Data shards
├── shard_000000.tar.idx # Shard indices
├── shard_000001.tar
├── shard_000001.tar.idx
└── ...

Step 4: Use with VeOmni

Update your VeOmni training script to use the Energon dataset:

# In your VeOmni debug.sh or training script
DATA_PATH=/path/to/output/energon_dataset
DATA_SET_TYPE=energon

# Run training
bash train.sh tasks/train_torch.py $CONFIG \
    --data.train_path $DATA_PATH \
    --data.datasets_type $DATA_SET_TYPE \
    --train.global_batch_size 128 \
    --train.lr 5e-7

Runtime Data Loading(WIP)

DataBridge also supports runtime dataset loading for training frameworks:

PyTorch Integration

from data_bridge import create_pytorch_loader

# Create PyTorch data loader
loader = create_pytorch_loader(
    dataset_path="data.jsonl",
    batch_size=32,
    shuffle=True,
    num_workers=4
)

# Use in training loop
for batch in loader:
    texts = batch['text']  # List of texts
    ids = batch['id']      # Tensor of IDs
    # Process batch...

HuggingFace Integration

from data_bridge import create_huggingface_loader

# Create HuggingFace loader
loader = create_huggingface_loader(dataset_path="data.jsonl")

# Convert to HuggingFace Dataset
hf_dataset = loader.to_huggingface_dataset()

# Use in training
for doc in loader:
    text = doc['text']
    doc_id = doc['id']
    # Process document...

Megatron Integration

from data_bridge import create_megatron_loader

# Create Megatron loader with tokenizer
loader = create_megatron_loader(
    dataset_path="data.jsonl",
    tokenizer_path="/path/to/tokenizer"
)

# Get tokenized data
for doc in loader:
    tokens = doc['tokens']
    # Process tokenized document...

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=src/databridge

# Run specific test file
pytest tests/test_converters.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataloader_bridge-0.1.0-py3-none-any.whl (27.4 kB view details)

Uploaded Python 3

File details

Details for the file dataloader_bridge-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dataloader_bridge-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c5ba9a36c28bbd4541c8713a96836cd9536e505e19b362f6538808e25c04d190
MD5 5bf6d5cc3067ec4bb126be83ba18f2a7
BLAKE2b-256 e8f7f0d1cb0886baf3bfff59e71e0e4cd5986b21c82b60585b4c9910e09ce2e8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page