Skip to main content

A comprehensive dataset conversion toolkit for transforming between different dataset formats

Project description

DataBridge

A comprehensive dataset conversion toolkit for transforming between different dataset formats commonly used in machine learning and NLP tasks.

🚀 Features

DataBridge is a comprehensive dataset conversion toolkit that supports seamless transformation between different dataset formats commonly used in machine learning and NLP tasks.

Supported Formats

  • JSONL - JSON Lines format for text data
  • Megatron bin/idx - Binary format used by Megatron-LM
  • WebDataset - Tar-based dataset format for large-scale training
  • Energon Dataset - Megatron-Energon compatible format (fully compatible with VeOmni training framework)

Key Capabilities

  • Universal Conversion: Convert between any supported format pair
  • VeOmni Ready: Native support for Energon format used by VeOmni training framework
  • Progress Tracking: Real-time progress bars and detailed logging
  • Runtime Loading: Support for both offline conversion and online data loading

📦 Installation

From PyPI

pip install dataloader-bridge

From Source

cd DataBridge
pip install -e .

Dependencies

pip install -r requirements.txt

🚀 Quick Start

Command Line Interface

DataBridge provides a unified command-line interface for all format conversions:

# List all supported formats
databridge list-formats

# Supported Formats:
#   • jsonl
#   • webdataset
#   • binidx
#   • energon

# File Extensions:
#   • .jsonl → jsonl
#   • .json → jsonl
#   • .tar → energon
#   • .bin → binidx
#   • .idx → binidx

# Convert between any supported formats
databridge convert \
    --input-path /path/to/input \
    --output-path /path/to/output \
    --input-format <input_format> \
    --output-format <output_format> \
    --shard-size 1000

Common Conversion Examples

1. Convert Megatron bin/idx to Energon format (for VeOmni training):

# Convert bin/idx dataset to Energon format
databridge convert \
    --input-path /path/to/dataset \
    --output-path /path/to/output/energon_dataset \
    --input-format binidx \
    --output-format energon \
    --shard-size 1000

# Example with real paths:
databridge convert \
    --input-path /prodcpfs/user/weishi/data/text_data/pile_test \
    --output-path /prodcpfs/user/weishi/data/text_data_converted/pile_test/energon \
    --input-format binidx \
    --output-format energon \
    --shard-size 1000

2. Convert JSONL to WebDataset:

databridge convert \
    --input-path data.jsonl \
    --output-path webdataset/ \
    --input-format jsonl \
    --output-format webdataset \
    --shard-size 1000

# Example:
databridge convert -i data/sample.jsonl -o data/sample_webdataset --output-format webdataset

3. Convert bin/idx to JSONL:

databridge convert \
    --input-path dataset \
    --output-path data.jsonl \
    --input-format binidx \
    --output-format jsonl

Python API

Using the Registry

from data_bridge.formats.registry import registry

# Convert bin/idx to Energon format
registry.convert(
    input_path="/path/to/binidx/dataset",
    output_path="/path/to/output/energon",
    input_format="binidx",
    output_format="energon",
    shard_size=1000
)

# Convert JSONL to WebDataset
registry.convert(
    input_path="data.jsonl",
    output_path="webdataset/",
    input_format="jsonl",
    output_format="webdataset",
    shard_size=1000
)

# List available formats
formats = registry.list_formats()
print(f"Supported formats: {formats}")

Using Individual Format Handlers

from data_bridge import Document, JsonlFormatHandler, WebDatasetFormatHandler, BinIdxFormatHandler, EnergonFormatHandler

# Load data using specific handlers
jsonl_handler = JsonlFormatHandler()
binidx_handler = BinIdxFormatHandler()
webdataset_handler = WebDatasetFormatHandler()
energon_handler = EnergonFormatHandler()

# Load returns Document objects
documents = jsonl_handler.load("data.jsonl")
# or
documents = binidx_handler.load("/path/to/binidx/dataset")

# Save accepts Document objects
webdataset_handler.save(documents, "webdataset/", shard_size=1000)
# or
energon_handler.save(documents, "energon/", shard_size=1000)

# Work with Document objects directly
for doc in documents:
    print(f"Document {doc.doc_id}: {doc['text']}")
    # Access data like a dictionary
    if 'metadata' in doc:
        print(f"Metadata: {doc['metadata']}")

🎯 Common Use Cases

Converting Megatron bin/idx to Energon for VeOmni Training

This is the most common use case for DataBridge - converting existing Megatron bin/idx datasets to Energon format for use with VeOmni training framework.

Step 1: Prepare Your Data

Ensure your bin/idx dataset has the following structure:

/path/to/your/dataset/
├── dataset.bin          # Binary data file
└── dataset.idx          # Index file

Step 2: Convert to Energon Format

# Convert bin/idx to Energon format
databridge convert \
    --input-path /path/to/your/dataset \
    --output-path /path/to/output/energon_dataset \
    --input-format binidx \
    --output-format energon \
    --shard-size 1000

Step 3: Verify the Output

The output Energon dataset will have the following structure:

/path/to/output/energon_dataset/
├── .nv-meta/
│   ├── .info.json       # Dataset metadata
│   ├── dataset.yaml     # Dataset configuration
│   ├── split.yaml       # Split configuration
│   ├── index.sqlite     # Index database
│   └── index.uuid       # Unique identifier
├── shard_000000.tar     # Data shards
├── shard_000000.tar.idx # Shard indices
├── shard_000001.tar
├── shard_000001.tar.idx
└── ...

Step 4: Use with VeOmni

Update your VeOmni training script to use the Energon dataset:

# In your VeOmni debug.sh or training script
DATA_PATH=/path/to/output/energon_dataset
DATA_SET_TYPE=energon

# Run training
bash train.sh tasks/train_torch.py $CONFIG \
    --data.train_path $DATA_PATH \
    --data.datasets_type $DATA_SET_TYPE \
    --train.global_batch_size 128 \
    --train.lr 5e-7

Runtime Data Loading(WIP)

DataBridge also supports runtime dataset loading for training frameworks:

PyTorch Integration

from data_bridge import create_pytorch_loader

# Create PyTorch data loader
loader = create_pytorch_loader(
    dataset_path="data.jsonl",
    batch_size=32,
    shuffle=True,
    num_workers=4
)

# Use in training loop
for batch in loader:
    texts = batch['text']  # List of texts
    ids = batch['id']      # Tensor of IDs
    # Process batch...

HuggingFace Integration

from data_bridge import create_huggingface_loader

# Create HuggingFace loader
loader = create_huggingface_loader(dataset_path="data.jsonl")

# Convert to HuggingFace Dataset
hf_dataset = loader.to_huggingface_dataset()

# Use in training
for doc in loader:
    text = doc['text']
    doc_id = doc['id']
    # Process document...

Megatron Integration

from data_bridge import create_megatron_loader

# Create Megatron loader with tokenizer
loader = create_megatron_loader(
    dataset_path="data.jsonl",
    tokenizer_path="/path/to/tokenizer"
)

# Get tokenized data
for doc in loader:
    tokens = doc['tokens']
    # Process tokenized document...

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=src/databridge

# Run specific test file
pytest tests/test_converters.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataloader_bridge-0.1.1-py3-none-any.whl (27.5 kB view details)

Uploaded Python 3

File details

Details for the file dataloader_bridge-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for dataloader_bridge-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 90568fcf6f5c3804dbbd0ddc281b4a5c684e2eeb4c0d7a08c3ce81bf2242d0c7
MD5 e7b04969a9d2220e27cd9e3db5c4fa82
BLAKE2b-256 1f255d93470cc9ca6b44dba14a18091c85aacfdfb4899b0e0ff532294ad1c4cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page