Skip to main content

Megatron Core - a library for efficient and scalable training of transformer based models

Project description

Megatron-LM and Megatron Core

GPU-optimized library for training transformer models at scale

Documentation version license

About

This repository contains two components: Megatron-LM and Megatron Core.

Megatron-LM is a reference example that includes Megatron Core plus pre-configured training scripts. Best for research teams, learning distributed training, and quick experimentation.

Megatron Core is a composable library with GPU-optimized building blocks for custom training frameworks. It provides transformer building blocks, advanced parallelism strategies (TP, PP, DP, EP, CP), mixed precision support (FP16, BF16, FP8, FP4), and model architectures. Best for framework developers and ML engineers building custom training pipelines.

Megatron Bridge provides bidirectional Hugging Face ↔ Megatron checkpoint conversion with production-ready recipes.

Getting Started

Install from PyPI:

uv pip install megatron-core

Or clone and install from source:

git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
uv pip install -e .

Note: Building from source can use a lot of memory. If the build runs out of memory, limit parallel compilation jobs by setting MAX_JOBS (e.g. MAX_JOBS=4 uv pip install -e .).

For NGC container setup and all installation options, see the Installation Guide.

Latest News

  • [2026/03] Deprecating Python 3.10 support: We're officially dropping Python 3.10 support with the upcoming 0.17.0 release. Downstream applications must raise their lower boundary to 3.12 to stay compatible with MCore.
  • [2026/01] Dynamic Context Parallelism - Up to 1.48x speedup for variable-length sequence training with adaptive CP sizing.
  • [2025/12] Megatron Core development has moved to GitHub! All development and CI now happens in the open. We welcome community contributions.
  • [2025/10] Megatron Dev Branch - early access branch with experimental features.
  • [2025/10] Megatron Bridge - Bidirectional converter for interoperability between Hugging Face and Megatron checkpoints, featuring production-ready recipes for popular models.
  • [2025/08] MoE Q3-Q4 2025 Roadmap - Comprehensive roadmap for MoE features including DeepSeek-V3, Qwen3, advanced parallelism strategies, FP8 optimizations, and Blackwell performance enhancements.
  • [2025/08] GPT-OSS Model - Advanced features including YaRN RoPE scaling, attention sinks, and custom activation functions are being integrated into Megatron Core.
  • [2025/06] Megatron MoE Model Zoo - Best practices and optimized configurations for training DeepSeek-V3, Mixtral, and Qwen3 MoE models with performance benchmarking and checkpoint conversion tools.
  • [2025/05] Megatron Core v0.11.0 brings new capabilities for multi-data center LLM training (blog).
Previous News
  • [2024/07] Megatron Core v0.7 improves scalability and training resiliency and adds support for multimodal training (blog).
  • [2024/06] Megatron Core added supports for Mamba-based models. Check out our paper An Empirical Study of Mamba-based Language Models and code example.
  • [2024/01 Announcement] NVIDIA has released the core capabilities in Megatron-LM into Megatron Core in this repository. Megatron Core expands upon Megatron-LM's GPU-optimized techniques with more cutting-edge innovations on system-level optimizations, featuring composable and modular APIs.

Project Structure

Megatron-LM/
├── megatron/
│   ├── core/                    # Megatron Core (kernels, parallelism, building blocks)
│   │   ├── models/              # Transformer models
│   │   ├── transformer/         # Transformer building blocks
│   │   ├── tensor_parallel/     # Tensor parallelism
│   │   ├── pipeline_parallel/   # Pipeline parallelism
│   │   ├── distributed/         # Distributed training (FSDP, DDP)
│   │   ├── optimizer/           # Optimizers
│   │   ├── datasets/            # Dataset loaders
│   │   ├── inference/           # Inference engines and server
│   │   └── export/              # Model export (e.g. TensorRT-LLM)
│   ├── training/                # Training scripts
│   ├── legacy/                  # Legacy components
│   ├── post_training/           # Post-training (quantization, distillation, pruning, etc.)
│   └── rl/                      # Reinforcement learning (RLHF, etc.)
├── examples/                    # Ready-to-use training examples
├── tools/                       # Utility tools
├── tests/                       # Comprehensive test suite
└── docs/                        # Documentation

Performance Benchmarking

For our latest performance benchmarking results, please refer to NVIDIA Megatron Bridge Performance Summary.

Our codebase efficiently trains models from 2B to 462B parameters across thousands of GPUs, achieving up to 47% Model FLOP Utilization (MFU) on H100 clusters.

Model table

Benchmark Configuration:

  • Vocabulary size: 131,072 tokens
  • Sequence length: 4096 tokens
  • Model scaling: Varied hidden size, attention heads, and layers to achieve target parameter counts
  • Communication optimizations: Fine-grained overlapping with DP (--overlap-grad-reduce, --overlap-param-gather), TP (--tp-comm-overlap), and PP (enabled by default)

Key Results:

  • 6144 H100 GPUs: Successfully benchmarked 462B parameter model training
  • Superlinear scaling: MFU increases from 41% to 47-48% with model size
  • End-to-end measurement: Throughputs include all operations (data loading, optimizer steps, communication, logging)
  • Production ready: Full training pipeline with checkpointing and fault tolerance
  • Note: Performance results measured without training to convergence

Weak Scaling Results

Our weak scaled results show superlinear scaling (MFU increases from 41% for the smallest model considered to 47-48% for the largest models); this is because larger GEMMs have higher arithmetic intensity and are consequently more efficient to execute.

Weak scaling

Strong Scaling Results

We also strong scaled the standard GPT-3 model (our version has slightly more than 175 billion parameters due to larger vocabulary size) from 96 H100 GPUs to 4608 GPUs, using the same batch size of 1152 sequences throughout. Communication becomes more exposed at larger scale, leading to a reduction in MFU from 47% to 42%.

Strong scaling

Roadmaps

  • MoE Roadmap - DeepSeek-V3, Qwen3, advanced parallelism, FP8 optimizations, and Blackwell enhancements

Resources

Getting Help

Contributing

We ❤️ contributions! Ways to contribute:

  • 🐛 Report bugs - Help us improve reliability
  • 💡 Suggest features - Shape the future of Megatron Core
  • 📝 Improve docs - Make Megatron Core more accessible
  • 🔧 Submit PRs - Contribute code improvements

Contributing Guide

Citation

If you use Megatron in your research or project, we appreciate that you use the following citations:

@article{megatron-lm,
  title={Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism},
  author={Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan},
  journal={arXiv preprint arXiv:1909.08053},
  year={2019}
}

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

megatron_core-0.17.0rc0.tar.gz (1.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

megatron_core-0.17.0rc0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

megatron_core-0.17.0rc0-cp313-cp313-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.7 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

megatron_core-0.17.0rc0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

megatron_core-0.17.0rc0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

megatron_core-0.17.0rc0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

megatron_core-0.17.0rc0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (1.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file megatron_core-0.17.0rc0.tar.gz.

File metadata

  • Download URL: megatron_core-0.17.0rc0.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for megatron_core-0.17.0rc0.tar.gz
Algorithm Hash digest
SHA256 637c470c16b8a56e33fbd4453388614463463b5eeb942376da3521a1a9ff1c07
MD5 e479d0e4d68062992d2223bcf1e1e806
BLAKE2b-256 791acd9a16a5f781c8ae1c70bb57cd0c36ea9753e5ae0fd25da04ba659093a56

See more details on using hashes here.

File details

Details for the file megatron_core-0.17.0rc0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for megatron_core-0.17.0rc0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 dcd3c946975bccc8e8892ee863b9da8ff82bbdc1c4b693e4814622448e8f6b26
MD5 624d3ebae2d3a6a06a2fd99d35f98aea
BLAKE2b-256 bba263f449a7f6df4512357b1b70ff6b27ca0846dfb7ec8b83d5bb65c56433d5

See more details on using hashes here.

File details

Details for the file megatron_core-0.17.0rc0-cp313-cp313-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for megatron_core-0.17.0rc0-cp313-cp313-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 822487f1135edfd08d44e1f26aa47416f605db28066a3d4b6dcbd6c209caed1f
MD5 9df404e9bbc08891febee9d00cb180d9
BLAKE2b-256 c5460dd7851968da055328ad8177eabc5e50019fa568bfa12971a3f078fd5125

See more details on using hashes here.

File details

Details for the file megatron_core-0.17.0rc0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for megatron_core-0.17.0rc0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 715735bddf6c59ab00bb74e743c3d299147d7216adfbca51dbff322bb94b6c4e
MD5 8e147efde101c258c80b4781372c1c1f
BLAKE2b-256 51e5ec0628b7e9bc9ace79992e047be7eed7ddc0ccdf1a646e9f862bd988e7d8

See more details on using hashes here.

File details

Details for the file megatron_core-0.17.0rc0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for megatron_core-0.17.0rc0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 60303abf66a4177c410abf1b4f808eafd2c21dafa07f50edec51679cb7ee90f0
MD5 752c437f3e0ad07625bc20f263176f95
BLAKE2b-256 a8c7220394783c8f53eaf0c724e4b5ea7ed79d0e6045d9ddf1fb0f4c1156b360

See more details on using hashes here.

File details

Details for the file megatron_core-0.17.0rc0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for megatron_core-0.17.0rc0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 22dbad69235aae2c6a08e3e2aacd4a7fb007ab4ab7d90488621aa2b206d37cdd
MD5 15adbef467d411058d8bf2e813fad792
BLAKE2b-256 422d188804129fc39b43590347664307c604937e3b9db2e78a28bf617adef387

See more details on using hashes here.

File details

Details for the file megatron_core-0.17.0rc0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for megatron_core-0.17.0rc0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 c4a83f8229cf7cd72101c1105d548753366a053acc68c271eb4063562950f944
MD5 2738bb4b8be6558f349b7eba9b53b974
BLAKE2b-256 79c3d9fe3bf7a5ed32515d4dd5da4a8ce355797badc415906367bb370730caae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page