Skip to main content

Megatron's multi-modal data loader

Project description

Megatron's multi-modal data loader

Megatron Energon

Tests Documentation
Report Bug · Request Feature


DISCLAIMER: This package contains research code. APIs may change.

What is this?

Megatron Energon is the multi-modal data loader of Megatron (you can also use it independently).

It's best at

  • loading large training data to train large multi-modal models
  • blending many different datasets together
  • distributing the work across many nodes and processes of a cluster
  • ensuring reproducibility and resumability
  • adapting easily to various types of data samples and processing

Try using it together with Megatron Core.

Quickstart

Megatron Energon is a pip-installable python package that offers

  • dataset-related classes that you can import in your project
  • a command line utility for data preprocessing and conversion

This document is just a quick start. Please also check out the documentation.

Installation

To install the latest stable version:

pip install megatron-energon

Or to install the current development version:

pip install git+https://github.com/NVIDIA/Megatron-Energon.git

NOTE: We encourage you to install the package (and not just import a local copy). This will ensure you have all the needed dependencies and that you can use the command line tool.

For more details on installing this package, see here.

Usage of command line tool

After installation, the command energon will be available.

Here are some examples for things you can do:

Command Description
energon prepare DATASET_ROOT Take an existing WebDataset and add the required yaml files to turn it into an energon-compatible dataset
energon lint DATASET_ROOT Verify that the dataset complies with the energon dataset format and that all samples are loadable

Usage of the library

To get started, pick a WebDataset-compliant dataset and run energon prepare DATASET_ROOT on it, to run the interactive assistant and create the .nv-meta folder. As an alternative to WebDataset, Energon also supports the JSONL format, see here.

Once done, try to load it from your Python program:

from megatron.energon import get_train_dataset, get_loader, WorkerConfig


simple_worker_config = WorkerConfig(rank=0, world_size=1, num_workers=2)


train_ds = get_train_dataset(
    '/my/dataset/path',
    batch_size=2,
    shuffle_buffer_size=None,
    max_samples_per_sequence=None,
    worker_config=simple_worker_config,
)

train_loader = get_loader(train_ds)

for batch in train_loader:
    # Do something with batch
    # Infer, gradient step, ...
    pass

For more details, read the documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

megatron_energon-7.3.2.tar.gz (206.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

megatron_energon-7.3.2-py3-none-any.whl (283.5 kB view details)

Uploaded Python 3

File details

Details for the file megatron_energon-7.3.2.tar.gz.

File metadata

  • Download URL: megatron_energon-7.3.2.tar.gz
  • Upload date:
  • Size: 206.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for megatron_energon-7.3.2.tar.gz
Algorithm Hash digest
SHA256 24b6605731c374afa3369fae3fa284c8923ee7a11dc2d137d278b98cc1bb4713
MD5 416a2b265fe8bfab418a4700eadf3f8f
BLAKE2b-256 fbfbbcc8a5b457887b1b477789464e8d16468d977deb73a931874e714e7ace6d

See more details on using hashes here.

Provenance

The following attestation bundles were made for megatron_energon-7.3.2.tar.gz:

Publisher: release.yml on NVIDIA/Megatron-Energon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file megatron_energon-7.3.2-py3-none-any.whl.

File metadata

File hashes

Hashes for megatron_energon-7.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 151aeed2dbdfb1c168529c07dd3d123271658b2e557fc64625b4dbd2f3a9f31a
MD5 6d7ba6c786ac8515aa298d6f7773c71e
BLAKE2b-256 3e9b857fc9cc81c209abb90a9cbf127ceeebdf7761683cb1797fc7d3b27f18ad

See more details on using hashes here.

Provenance

The following attestation bundles were made for megatron_energon-7.3.2-py3-none-any.whl:

Publisher: release.yml on NVIDIA/Megatron-Energon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page