Megatron's multi-modal data loader
Project description
DISCLAIMER: This package contains research code. APIs may change.
What is this?
Megatron Energon is the multi-modal data loader of Megatron (you can also use it independently).
It's best at
- loading large training data to train large multi-modal models
- blending many different datasets together
- distributing the work across many nodes and processes of a cluster
- ensuring reproducibility and resumability
- adapting easily to various types of data samples and processing
Try using it together with Megatron Core.
Quickstart
Megatron Energon is a pip-installable python package that offers
- dataset-related classes that you can import in your project
- a command line utility for data preprocessing and conversion
This document is just a quick start. Please also check out the documentation.
Installation
To install the latest stable version:
pip install megatron-energon
Or to install the current development version:
pip install git+https://github.com/NVIDIA/Megatron-Energon.git
NOTE: We encourage you to install the package (and not just import a local copy). This will ensure you have all the needed dependencies and that you can use the command line tool.
For more details on installing this package, see here.
Usage of command line tool
After installation, the command energon will be available.
Here are some examples for things you can do:
| Command | Description |
|---|---|
energon prepare DATASET_ROOT |
Take an existing WebDataset and add the required yaml files to turn it into an energon-compatible dataset |
energon lint DATASET_ROOT |
Verify that the dataset complies with the energon dataset format and that all samples are loadable |
Usage of the library
To get started, pick a WebDataset-compliant dataset and run energon prepare DATASET_ROOT on it, to run the interactive assistant and create the .nv-meta folder. As an alternative to WebDataset, Energon also supports the JSONL format, see here.
Once done, try to load it from your Python program:
from megatron.energon import get_train_dataset, get_loader, WorkerConfig
simple_worker_config = WorkerConfig(rank=0, world_size=1, num_workers=2)
train_ds = get_train_dataset(
'/my/dataset/path',
batch_size=2,
shuffle_buffer_size=None,
max_samples_per_sequence=None,
worker_config=simple_worker_config,
)
train_loader = get_loader(train_ds)
for batch in train_loader:
# Do something with batch
# Infer, gradient step, ...
pass
For more details, read the documentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file megatron_energon-7.3.0.tar.gz.
File metadata
- Download URL: megatron_energon-7.3.0.tar.gz
- Upload date:
- Size: 206.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c720e933612f9c2f96973f40798089d2fbfec1c466601bf87092f1daf6e98ae1
|
|
| MD5 |
84b9198742b84301d6279f08896b73bd
|
|
| BLAKE2b-256 |
b19772ae4555bd4ad6e1023c987cbc03a7f57f7fcd467af32ea3a69b68eb2d73
|
Provenance
The following attestation bundles were made for megatron_energon-7.3.0.tar.gz:
Publisher:
release.yml on NVIDIA/Megatron-Energon
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
megatron_energon-7.3.0.tar.gz -
Subject digest:
c720e933612f9c2f96973f40798089d2fbfec1c466601bf87092f1daf6e98ae1 - Sigstore transparency entry: 831414041
- Sigstore integration time:
-
Permalink:
NVIDIA/Megatron-Energon@bb68392383d6423805b50cc2bc256fcc3245f40b -
Branch / Tag:
refs/tags/7.3.0 - Owner: https://github.com/NVIDIA
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bb68392383d6423805b50cc2bc256fcc3245f40b -
Trigger Event:
release
-
Statement type:
File details
Details for the file megatron_energon-7.3.0-py3-none-any.whl.
File metadata
- Download URL: megatron_energon-7.3.0-py3-none-any.whl
- Upload date:
- Size: 283.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8f12137885d4563f99bcb76c9a3984c8d62eb08574c1e5b85bc09f688601e93
|
|
| MD5 |
4aa709e733bb54f051f183a3dee3f849
|
|
| BLAKE2b-256 |
adfe51f0b6ce5aac6a43ba5b76911da2b079848a45f8a729d128683ba6666868
|
Provenance
The following attestation bundles were made for megatron_energon-7.3.0-py3-none-any.whl:
Publisher:
release.yml on NVIDIA/Megatron-Energon
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
megatron_energon-7.3.0-py3-none-any.whl -
Subject digest:
a8f12137885d4563f99bcb76c9a3984c8d62eb08574c1e5b85bc09f688601e93 - Sigstore transparency entry: 831414044
- Sigstore integration time:
-
Permalink:
NVIDIA/Megatron-Energon@bb68392383d6423805b50cc2bc256fcc3245f40b -
Branch / Tag:
refs/tags/7.3.0 - Owner: https://github.com/NVIDIA
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bb68392383d6423805b50cc2bc256fcc3245f40b -
Trigger Event:
release
-
Statement type: