A comprehensive dataset conversion toolkit for transforming between different dataset formats

These details have not been verified by PyPI

Project links

Project description

DataBridge

A comprehensive dataset conversion toolkit for transforming between different dataset formats commonly used in machine learning and NLP tasks.

🚀 Features

DataBridge is a comprehensive dataset conversion toolkit that supports seamless transformation between different dataset formats commonly used in machine learning and NLP tasks.

Supported Formats

JSONL - JSON Lines format for text data
Megatron bin/idx - Binary format used by Megatron-LM
WebDataset - Tar-based dataset format for large-scale training
Energon Dataset - Megatron-Energon compatible format (fully compatible with VeOmni training framework)

Key Capabilities

Universal Conversion: Convert between any supported format pair
VeOmni Ready: Native support for Energon format used by VeOmni training framework
Progress Tracking: Real-time progress bars and detailed logging
Runtime Loading: Support for both offline conversion and online data loading

📦 Installation

From PyPI

pip install dataloader-bridge

From Source

cd DataBridge
pip install -e .

Dependencies

pip install -r requirements.txt

🚀 Quick Start

Command Line Interface

DataBridge provides a unified command-line interface for all format conversions:

# List all supported formats
databridge list-formats

# Supported Formats:
#   • jsonl
#   • webdataset
#   • binidx
#   • energon

# File Extensions:
#   • .jsonl → jsonl
#   • .json → jsonl
#   • .tar → energon
#   • .bin → binidx
#   • .idx → binidx

# Convert between any supported formats
databridge convert \
    --input-path /path/to/input \
    --output-path /path/to/output \
    --input-format <input_format> \
    --output-format <output_format> \
    --shard-size 1000

Common Conversion Examples

1. Convert Megatron bin/idx to Energon format (for VeOmni training):

# Convert bin/idx dataset to Energon format
databridge convert \
    --input-path /path/to/dataset \
    --output-path /path/to/output/energon_dataset \
    --input-format binidx \
    --output-format energon \
    --shard-size 1000

# Example with real paths:
databridge convert \
    --input-path /prodcpfs/user/weishi/data/text_data/pile_test \
    --output-path /prodcpfs/user/weishi/data/text_data_converted/pile_test/energon \
    --input-format binidx \
    --output-format energon \
    --shard-size 1000

2. Convert JSONL to WebDataset:

databridge convert \
    --input-path data.jsonl \
    --output-path webdataset/ \
    --input-format jsonl \
    --output-format webdataset \
    --shard-size 1000

# Example:
databridge convert -i data/sample.jsonl -o data/sample_webdataset --output-format webdataset

3. Convert bin/idx to JSONL:

databridge convert \
    --input-path dataset \
    --output-path data.jsonl \
    --input-format binidx \
    --output-format jsonl

Python API

Using the Registry

from data_bridge.formats.registry import registry

# Convert bin/idx to Energon format
registry.convert(
    input_path="/path/to/binidx/dataset",
    output_path="/path/to/output/energon",
    input_format="binidx",
    output_format="energon",
    shard_size=1000
)

# Convert JSONL to WebDataset
registry.convert(
    input_path="data.jsonl",
    output_path="webdataset/",
    input_format="jsonl",
    output_format="webdataset",
    shard_size=1000
)

# List available formats
formats = registry.list_formats()
print(f"Supported formats: {formats}")

Using Individual Format Handlers

from data_bridge import Document, JsonlFormatHandler, WebDatasetFormatHandler, BinIdxFormatHandler, EnergonFormatHandler

# Load data using specific handlers
jsonl_handler = JsonlFormatHandler()
binidx_handler = BinIdxFormatHandler()
webdataset_handler = WebDatasetFormatHandler()
energon_handler = EnergonFormatHandler()

# Load returns Document objects
documents = jsonl_handler.load("data.jsonl")
# or
documents = binidx_handler.load("/path/to/binidx/dataset")

# Save accepts Document objects
webdataset_handler.save(documents, "webdataset/", shard_size=1000)
# or
energon_handler.save(documents, "energon/", shard_size=1000)

# Work with Document objects directly
for doc in documents:
    print(f"Document {doc.doc_id}: {doc['text']}")
    # Access data like a dictionary
    if 'metadata' in doc:
        print(f"Metadata: {doc['metadata']}")

🎯 Common Use Cases

Converting Megatron bin/idx to Energon for VeOmni Training

This is the most common use case for DataBridge - converting existing Megatron bin/idx datasets to Energon format for use with VeOmni training framework.

Step 1: Prepare Your Data

Ensure your bin/idx dataset has the following structure:

/path/to/your/dataset/
├── dataset.bin          # Binary data file
└── dataset.idx          # Index file

Step 2: Convert to Energon Format

# Convert bin/idx to Energon format
databridge convert \
    --input-path /path/to/your/dataset \
    --output-path /path/to/output/energon_dataset \
    --input-format binidx \
    --output-format energon \
    --shard-size 1000

Step 3: Verify the Output

The output Energon dataset will have the following structure:

/path/to/output/energon_dataset/
├── .nv-meta/
│   ├── .info.json       # Dataset metadata
│   ├── dataset.yaml     # Dataset configuration
│   ├── split.yaml       # Split configuration
│   ├── index.sqlite     # Index database
│   └── index.uuid       # Unique identifier
├── shard_000000.tar     # Data shards
├── shard_000000.tar.idx # Shard indices
├── shard_000001.tar
├── shard_000001.tar.idx
└── ...

Step 4: Use with VeOmni

Update your VeOmni training script to use the Energon dataset:

# In your VeOmni debug.sh or training script
DATA_PATH=/path/to/output/energon_dataset
DATA_SET_TYPE=energon

# Run training
bash train.sh tasks/train_torch.py $CONFIG \
    --data.train_path $DATA_PATH \
    --data.datasets_type $DATA_SET_TYPE \
    --train.global_batch_size 128 \
    --train.lr 5e-7

Runtime Data Loading(WIP)

DataBridge also supports runtime dataset loading for training frameworks:

PyTorch Integration

from data_bridge import create_pytorch_loader

# Create PyTorch data loader
loader = create_pytorch_loader(
    dataset_path="data.jsonl",
    batch_size=32,
    shuffle=True,
    num_workers=4
)

# Use in training loop
for batch in loader:
    texts = batch['text']  # List of texts
    ids = batch['id']      # Tensor of IDs
    # Process batch...

HuggingFace Integration

from data_bridge import create_huggingface_loader

# Create HuggingFace loader
loader = create_huggingface_loader(dataset_path="data.jsonl")

# Convert to HuggingFace Dataset
hf_dataset = loader.to_huggingface_dataset()

# Use in training
for doc in loader:
    text = doc['text']
    doc_id = doc['id']
    # Process document...

Megatron Integration

from data_bridge import create_megatron_loader

# Create Megatron loader with tokenizer
loader = create_megatron_loader(
    dataset_path="data.jsonl",
    tokenizer_path="/path/to/tokenizer"
)

# Get tokenized data
for doc in loader:
    tokens = doc['tokens']
    # Process tokenized document...

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=src/databridge

# Run specific test file
pytest tests/test_converters.py

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Sep 18, 2025

This version

0.1.0

Sep 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataloader_bridge-0.1.0-py3-none-any.whl (27.4 kB view details)

Uploaded Sep 18, 2025 Python 3

File details

Details for the file dataloader_bridge-0.1.0-py3-none-any.whl.

File metadata

Download URL: dataloader_bridge-0.1.0-py3-none-any.whl
Upload date: Sep 18, 2025
Size: 27.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for dataloader_bridge-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c5ba9a36c28bbd4541c8713a96836cd9536e505e19b362f6538808e25c04d190`
MD5	`5bf6d5cc3067ec4bb126be83ba18f2a7`
BLAKE2b-256	`e8f7f0d1cb0886baf3bfff59e71e0e4cd5986b21c82b60585b4c9910e09ce2e8`

See more details on using hashes here.

dataloader-bridge 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DataBridge

🚀 Features

Supported Formats

Key Capabilities

📦 Installation

From PyPI

From Source

Dependencies

🚀 Quick Start

Command Line Interface

Common Conversion Examples

Python API

Using the Registry

Using Individual Format Handlers

🎯 Common Use Cases

Converting Megatron bin/idx to Energon for VeOmni Training

Step 1: Prepare Your Data

Step 2: Convert to Energon Format

Step 3: Verify the Output

Step 4: Use with VeOmni

Runtime Data Loading(WIP)

PyTorch Integration

HuggingFace Integration

Megatron Integration

Testing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes