Skip to main content

Megatron Core - a library for efficient and scalable training of transformer based models

Project description

Megatron-LM & Megatron Core

GPU-optimized library for training transformer models at scale

Documentation version license

โšก Quick Start

# 1. Install Megatron Core with required dependencies
pip install --no-build-isolation megatron-core[mlm,dev]

# 2. Clone repository for examples
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
pip install --no-build-isolation .[mlm,dev]

โ†’ Complete Installation Guide - Docker, pip variants (dev,lts,etc.), source installation, and system requirements

Latest News

  • ๐Ÿ“ฃ NEW! Megatron Dev Branch - early access branch with experimental features.
  • ๐Ÿ”„ Megatron Bridge - Bidirectional converter for interoperability between Hugging Face and Megatron checkpoints, featuring production-ready recipes for popular models.
  • [2025/08] MoE Q3-Q4 2025 Roadmap - Comprehensive roadmap for MoE features including DeepSeek-V3, Qwen3, advanced parallelism strategies, FP8 optimizations, and Blackwell performance enhancements.
  • [2025/08] GPT-OSS Model - Advanced features including YaRN RoPE scaling, attention sinks, and custom activation functions are being integrated into Megatron Core.
  • [2025/06] Megatron MoE Model Zoo - Best practices and optimized configurations for training DeepSeek-V3, Mixtral, and Qwen3 MoE models with performance benchmarking and checkpoint conversion tools.
  • [2025/05] Megatron Core v0.11.0 brings new capabilities for multi-data center LLM training (blog).
Previous News
  • [2024/07] Megatron Core v0.7 improves scalability and training resiliency and adds support for multimodal training (blog).
  • [2024/06] Megatron Core added supports for Mamba-based models. Check out our paper An Empirical Study of Mamba-based Language Models and code example.
  • [2024/01 Announcement] NVIDIA has released the core capabilities in Megatron-LM into Megatron Core in this repository. Megatron Core expands upon Megatron-LM's GPU-optimized techniques with more cutting-edge innovations on system-level optimizations, featuring composable and modular APIs. Explore the [Megatron Core intro](#Megatron Core) for more details.
Table of Contents

Getting Started

Core Features

Training

Resources

Megatron Overview

Project Structure

Megatron-LM/
โ”œโ”€โ”€ megatron/                    
โ”‚   โ”œโ”€โ”€ core/                    # Megatron Core (kernels, parallelism, building blocks)
โ”‚   โ”‚   โ”œโ”€โ”€ models/              # Transformer models
โ”‚   โ”‚   โ”œโ”€โ”€ transformer/         # Transformer building blocks
โ”‚   โ”‚   โ”œโ”€โ”€ tensor_parallel/     # Tensor parallelism
โ”‚   โ”‚   โ”œโ”€โ”€ pipeline_parallel/   # Pipeline parallelism
โ”‚   โ”‚   โ”œโ”€โ”€ distributed/         # Distributed training (FSDP, DDP)
โ”‚   โ”‚   โ”œโ”€โ”€ optimizer/           # Optimizers
โ”‚   โ”‚   โ”œโ”€โ”€ datasets/            # Dataset loaders
โ”‚   โ”‚   โ”œโ”€โ”€ inference/           # Inference engines
โ”‚   โ”‚   โ””โ”€โ”€ export/              # Model export (e.g. TensorRT-LLM)
โ”‚   โ”œโ”€โ”€ training/                # Training scripts
โ”‚   โ”œโ”€โ”€ inference/               # Inference server
โ”‚   โ”œโ”€โ”€ legacy/                  # Legacy components
โ”‚   โ””โ”€โ”€ post_training/           # Post-training (RLHF, etc.)
โ”œโ”€โ”€ examples/                    # Ready-to-use training examples
โ”œโ”€โ”€ tools/                       # Utility tools
โ”œโ”€โ”€ tests/                       # Comprehensive test suite
โ””โ”€โ”€ docs/                        # Documentation

Megatron-LM: Reference Implementation

Reference implementation that includes Megatron Core plus everything needed to train models.

Best for:

  • Training state-of-the-art foundation models at scale with cutting-edge performance on latest NVIDIA hardware
  • Research teams exploring new architectures and training techniques
  • Learning distributed training concepts and best practices
  • Quick experimentation with proven model configurations

What you get:

  • Pre-configured training scripts for GPT, LLama, DeepSeek, Qwen, and more.
  • End-to-end examples from data prep to evaluation
  • Research-focused tools and utilities

Megatron Core: Composable Library

Composable library with GPU-optimized building blocks for custom training frameworks.

Best for:

  • Framework developers building on top of modular and optimized components
  • Research teams needing custom training loops, optimizers, or data pipelines
  • ML engineers requiring fault-tolerant training pipelines

What you get:

  • Composable transformer building blocks (attention, MLP, etc.)
  • Advanced parallelism strategies (TP, PP, DP, EP, CP)
  • Pipeline schedules and distributed optimizers
  • Mixed precision support (FP16, BF16, FP8)
  • GPU-optimized kernels and memory management
  • High-performance dataloaders and dataset utilities
  • Model architectures (LLaMA, Qwen, GPT, Mixtral, Mamba, etc.)

Ecosystem Libraries

Libraries used by Megatron Core:

Libraries using Megatron Core:

  • Megatron Bridge - Training library with bidirectional Hugging Face โ†” Megatron checkpoint conversion, flexible training loops, and production-ready recipes
  • NeMo RL - Scalable toolkit for efficient reinforcement learning with RLHF, DPO, and other post-training methods
  • NeMo Framework - Enterprise framework with cloud-native support and end-to-end examples
  • TensorRT Model Optimizer (ModelOpt) - Model optimization toolkit for quantization, pruning, and distillation

Compatible with: Hugging Face Accelerate, Colossal-AI, DeepSpeed

Installation

๐Ÿณ Docker (Recommended)

We strongly recommend using the previous releases of PyTorch NGC Container rather than the latest one for optimal compatibility with Megatron Core release and testing. Our releases are always based on the previous month's NGC container, so this ensures compatibility and stability.

Note: The NGC PyTorch container constraints the python environment globally via PIP_CONSTRAINT. In the following examples we will unset the variable.

This container comes with all dependencies pre-installed with compatible versions and optimized configurations for NVIDIA GPUs:

  • PyTorch (latest stable version)
  • CUDA, cuDNN, NCCL (latest stable versions)
  • Support for FP8 on NVIDIA Hopper, Ada, and Blackwell GPUs
  • For best performance, use NVIDIA Turing GPU architecture generations and later
# Run container with mounted directories
docker run --runtime --nvidia --gpus all -it --rm \
  -v /path/to/megatron:/workspace/megatron \
  -v /path/to/dataset:/workspace/dataset \
  -v /path/to/checkpoints:/workspace/checkpoints \
  -e PIP_CONSTRAINT= \
  nvcr.io/nvidia/pytorch:25.04-py3

Pip Installation

Megatron Core offers support for two NGC PyTorch containers:

  • dev: Moving head that supports the most recent upstream dependencies
  • lts: Long-term support of NGC PyTorch 24.01

Both containers can be combined with mlm which adds package dependencies for Megatron-LM on top of Megatron Core.

# Install the latest release dependencies
pip install "setuptools<80.0.0,>=77.0.0" "packaging>=24.2"
pip install --no-build-isolation megatron-core[dev]
# For running an M-LM application:
pip install "setuptools<80.0.0,>=77.0.0" "packaging>=24.2"
pip install --no-build-isolation megatron-core[mlm,dev]
# Install packages for LTS support NGC PyTorch 24.01
pip install "setuptools<80.0.0,>=77.0.0" "packaging>=24.2"
pip install --no-build-isolation megatron-core[lts]
# For running an M-LM application:
pip install "setuptools<80.0.0,>=77.0.0" "packaging>=24.2"
pip install --no-build-isolation megatron-core[mlm,lts]

For a version of Megatron Core with only torch, run:

pip install megatron-core

System Requirements

Hardware Requirements

  • FP8 Support: NVIDIA Hopper, Ada, Blackwell GPUs
  • Recommended: NVIDIA Turing architecture or later

Software Requirements

  • CUDA/cuDNN/NCCL: Latest stable versions
  • PyTorch: Latest stable version
  • Transformer Engine: Latest stable version
  • Python: 3.12 recommended

Performance Benchmarking

For our latest performance benchmarking results, please refer to NVIDIA NeMo Framework Performance Summary.

Our codebase efficiently trains models from 2B to 462B parameters across thousands of GPUs, achieving up to 47% Model FLOP Utilization (MFU) on H100 clusters.

Model table

Benchmark Configuration:

  • Vocabulary size: 131,072 tokens
  • Sequence length: 4096 tokens
  • Model scaling: Varied hidden size, attention heads, and layers to achieve target parameter counts
  • Communication optimizations: Fine-grained overlapping with DP (--overlap-grad-reduce, --overlap-param-gather), TP (--tp-comm-overlap), and PP (enabled by default)

Key Results:

  • 6144 H100 GPUs: Successfully benchmarked 462B parameter model training
  • Superlinear scaling: MFU increases from 41% to 47-48% with model size
  • End-to-end measurement: Throughputs include all operations (data loading, optimizer steps, communication, logging)
  • Production ready: Full training pipeline with checkpointing and fault tolerance
  • Note: Performance results measured without training to convergence

Weak Scaling Results

Our weak scaled results show superlinear scaling (MFU increases from 41% for the smallest model considered to 47-48% for the largest models); this is because larger GEMMs have higher arithmetic intensity and are consequently more efficient to execute.

Weak scaling

Strong Scaling Results

We also strong scaled the standard GPT-3 model (our version has slightly more than 175 billion parameters due to larger vocabulary size) from 96 H100 GPUs to 4608 GPUs, using the same batch size of 1152 sequences throughout. Communication becomes more exposed at larger scale, leading to a reduction in MFU from 47% to 42%.

Strong scaling

Training

Getting Started

Simple Training Example

# Distributed training example (2 GPUs, mock data)
torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py

LLama-3 Training Example

# 8 GPUs, FP8 precision, mock data
./examples/llama/train_llama3_8b_fp8.sh

Data Preparation

JSONL Data Format

{"text": "Your training text here..."}
{"text": "Another training sample..."}

Basic Preprocessing

python tools/preprocess_data.py \
    --input data.jsonl \
    --output-prefix processed_data \
    --tokenizer-type HuggingFaceTokenizer \
    --tokenizer-model /path/to/tokenizer.model \
    --workers 8 \
    --append-eod

Key Arguments

  • --input: Path to input JSON/JSONL file
  • --output-prefix: Prefix for output binary files (.bin and .idx)
  • --tokenizer-type: Tokenizer type (HuggingFaceTokenizer, GPT2BPETokenizer, etc.)
  • --tokenizer-model: Path to tokenizer model file
  • --workers: Number of parallel workers for processing
  • --append-eod: Add end-of-document token

Parallelism Strategies

Data Parallelism (DP)

Standard Data Parallel

# Standard DDP - replicate model on each GPU
torchrun --nproc_per_node=8 pretrain_gpt.py \
    --data-parallel-sharding-strategy no_shard

Fully Sharded Data Parallel (FSDP)

# Megatron's optimized FSDP (~15% faster than PyTorch FSDP2)
--use-custom-fsdp

# PyTorch FSDP2
--use-torch-fsdp2

# Sharding strategies
--data-parallel-sharding-strategy optim              # Shard optimizer states (ZeRO-1)
--data-parallel-sharding-strategy optim_grads        # Shard gradients + optimizer (ZeRO-2)
--data-parallel-sharding-strategy optim_grads_params # Shard parameters + gradients + optimizer (ZeRO-3)

Tensor Parallelism (TP)

Split individual model layers across GPUs:

--tensor-model-parallel-size 4  # 4-way tensor parallelism
--sequence-parallel             # Enable sequence parallelism (recommended with TP)

Pipeline Parallelism (PP)

Split model depth across GPUs:

--pipeline-model-parallel-size 8     # 8 pipeline stages
--virtual-pipeline-model-parallel-size 4  # Virtual pipeline for better load balancing

Context Parallelism (CP)

Split long sequences across GPUs for handling long contexts:

--context-parallel-size 2                    # 2-way context parallelism
--cp-comm-type p2p                          # Communication: p2p, a2a, allgather, a2a+p2p
--hierarchical-context-parallel-sizes 2 4   # Hierarchical context parallelism

Expert Parallelism (EP)

For Mixture of Experts (MoE) models:

--expert-model-parallel-size 4  # 4-way expert parallelism
--num-experts 8                 # 8 experts per MoE layer
--moe-grouped-gemm              # Optimize expert computation

Combining Parallelism Strategies

Parallelism Selection Guide

Based on NVIDIA NeMo production configurations:

Model Size GPUs TP PP CP EP Notes
LLama-3 8B 8 1 1 2 1 CP for long seqlen (8K)
LLama-3 70B 64 4 4 2 1 TP+PP
LLama-3.1 405B 1024 8 8 2 1 3D parallelism for scale
GPT-3 175B 128-512 4 8 1 1 Large model config
Mixtral 8x7B 64 1 4 1 8 EP for MoE
Mixtral 8x22B 256 4 4 8 8 Combined TP+EP for large MoE
DeepSeek-V3 671B 1024 2 16 1 64 Large MoE config

MoE-Specific Requirements

Important: When combining Expert Parallelism (EP) with Tensor Parallelism (TP), Sequence Parallelism (SP) must be enabled.

Performance Optimizations

Feature Flag Benefit
FlashAttention --attention-backend Faster attention and lower memory usage
FP8 Training --fp8-hybrid Faster training
Activation Checkpointing --recompute-activations Reduced memory usage
Data Parallelism Communication Overlap --overlap-grad-reduce Faster distributed training
Distributed Optimizer --use-distributed-optimizer Reduced checkpointing time

โ†’ NVIDIA NeMo Framework Performance Tuning Guide - Comprehensive performance optimization guide covering advanced tuning techniques, communication overlaps, memory optimizations, and profiling options.

FlashAttention

FlashAttention is a fast and memory-efficient attention algorithm. We recommend the default usage, which uses cuDNN for attention via Transformer Engine and provides up to 50% speedups on forward and 84% on backward propagation with FP8 kernels. The flash-attn package is also supported via --use-flash-attn.

Mixed Precision Training

--fp16                    # Standard FP16
--bf16                    # BFloat16 (recommended for large models)
--fp8-hybrid              # FP8 training (Hopper, Ada, and Blackwell GPUs)

Activation Checkpointing and Recomputation

# For limited memory
--recompute-activations

# For extreme memory constraints
--recompute-granularity full \
--recompute-method uniform

Data Parallelism Communication Overlap

--overlap-grad-reduce
--overlap-param-gather

Distributed Optimizer

--use-distributed-optimizer

Roadmaps

Stay up-to-date with our development roadmaps and planned features:

  • MoE Q3-Q4 2025 Roadmap - Comprehensive MoE feature development including DeepSeek-V3, Qwen3, advanced parallelism, FP8 optimizations, and Blackwell enhancements
  • GPT-OSS Implementation Tracker - Advanced features including YaRN RoPE scaling, attention sinks, and custom activation functions

More roadmap trackers will be added soon.

Community & Support

Getting Help

  • ๐Ÿ“– Documentation - Official documentation
  • ๐Ÿ› Issues - Bug reports and feature requests

Contributing

We โค๏ธ contributions! Ways to contribute:

  • ๐Ÿ› Report bugs - Help us improve reliability
  • ๐Ÿ’ก Suggest features - Shape the future of Megatron Core
  • ๐Ÿ“ Improve docs - Make Megatron Core more accessible
  • ๐Ÿ”ง Submit PRs - Contribute code improvements

โ†’ Contributing Guide

Citation

@article{megatron-lm,
  title={Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism},
  author={Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan},
  journal={arXiv preprint arXiv:1909.08053},
  year={2019}
}

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

megatron_core-0.15.3.tar.gz (878.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

megatron_core-0.15.3-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

megatron_core-0.15.3-cp313-cp313-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

megatron_core-0.15.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

megatron_core-0.15.3-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

megatron_core-0.15.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

megatron_core-0.15.3-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

megatron_core-0.15.3-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

megatron_core-0.15.3-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file megatron_core-0.15.3.tar.gz.

File metadata

  • Download URL: megatron_core-0.15.3.tar.gz
  • Upload date:
  • Size: 878.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for megatron_core-0.15.3.tar.gz
Algorithm Hash digest
SHA256 907db0915494d9d6bf8c5a420d8752812cb0aedd37973eb77d12edb8e2b844da
MD5 de31e47b60c3e9b333e4852da3fb5345
BLAKE2b-256 22e81f86517df6e2949a133476b9f6f0e7b5d8ca23732fb8a1968ba1002429d7

See more details on using hashes here.

File details

Details for the file megatron_core-0.15.3-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for megatron_core-0.15.3-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 92ea84b3ff914258092092f8d1107073923963274007ae042de614d5e552b5e5
MD5 d90cc639816c187d82643669521b568c
BLAKE2b-256 98f0cf3bc1810915f23117d76b7015a516e4f078854d29c6968b11413c56c6e2

See more details on using hashes here.

File details

Details for the file megatron_core-0.15.3-cp313-cp313-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for megatron_core-0.15.3-cp313-cp313-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 6c8814e525ae40c5360ca20bd2b953dc8da4a38d521b01c1e78624cf7785dbbe
MD5 27a5eb37fe25cb753e8d5cac114d14d9
BLAKE2b-256 46d97915c2096c55c84746fd2a8aa669219c08da336d25d317a27d09f85f563a

See more details on using hashes here.

File details

Details for the file megatron_core-0.15.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for megatron_core-0.15.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1229221f2b8b37cfefbf3119ff920388f02fefe28aec97f69cfaf8b929459606
MD5 3511fbd5c41d8864b5bdd1431380fd1e
BLAKE2b-256 7ebec91df038c8fa86421f08fe1a1cc6fc201f91cf2429b876ba1a86cc8dcda8

See more details on using hashes here.

File details

Details for the file megatron_core-0.15.3-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for megatron_core-0.15.3-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 8163d17cc2e1a3ef9d30b811dbbeffb5e122c184b164b107d9a5fbba1af34e2b
MD5 ef43c732f3b5b6c8e06366c0b206c594
BLAKE2b-256 028908e94bbf4e6535dac867e32393d6cab8c9aab2fc1cdda3a22753730a72ff

See more details on using hashes here.

File details

Details for the file megatron_core-0.15.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for megatron_core-0.15.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b025a355e1783c4246997a0b1c24953c6b57f0ff803a99521ed63c7cdb69f067
MD5 0a11632efb70d555d065c6c97823366b
BLAKE2b-256 a1d6c3beaab4222f3e1b8117ca4f03ea0536ac1596d1754b09a349c3642786a0

See more details on using hashes here.

File details

Details for the file megatron_core-0.15.3-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for megatron_core-0.15.3-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 45e4f857a1d660b5ab849e113316f1b147df0b683687efda697d91516276112c
MD5 c10eeda67bffab7ab6063d1c5376370c
BLAKE2b-256 1c7f16b754c0f6eac1ce1e2ab941bf7c9f1349a5ddbf1f7a01ac5627a130b908

See more details on using hashes here.

File details

Details for the file megatron_core-0.15.3-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for megatron_core-0.15.3-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a0effba8798f3f1323c6a01d03aba916f944c30571166b1ab5d17812b70841af
MD5 f7674b01d493307f71b349253161041f
BLAKE2b-256 21338cccdc424d62574f6696a673f76635eeff763600c7c9edcd4498eed64c25

See more details on using hashes here.

File details

Details for the file megatron_core-0.15.3-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for megatron_core-0.15.3-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 8caaf8dd8d9fa29322e25d6a1e7d88d460482bb86480c1b4abf5334997ac40e5
MD5 8d5118b23b09205a0467ba2e261ecaaa
BLAKE2b-256 2f6878b3fe2a85b935a92f2c4c155d65bdf84480aa1452835e873798935b908f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page