Skip to main content

Megatron Core - a library for efficient and scalable training of transformer based models

Project description

Megatron-LM and Megatron Core

GPU-optimized library for training transformer models at scale

Documentation version license

About

This repository contains two components: Megatron-LM and Megatron Core.

Megatron-LM is a reference example that includes Megatron Core plus pre-configured training scripts. Best for research teams, learning distributed training, and quick experimentation.

Megatron Core is a composable library with GPU-optimized building blocks for custom training frameworks. It provides transformer building blocks, advanced parallelism strategies (TP, PP, DP, EP, CP), mixed precision support (FP16, BF16, FP8, FP4), and model architectures. Best for framework developers and ML engineers building custom training pipelines.

Megatron Bridge provides bidirectional Hugging Face ↔ Megatron checkpoint conversion with production-ready recipes.

Getting Started

Install from PyPI:

uv pip install megatron-core

Or clone and install from source:

git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
uv pip install -e .

Note: Building from source can use a lot of memory. If the build runs out of memory, limit parallel compilation jobs by setting MAX_JOBS (e.g. MAX_JOBS=4 uv pip install -e .).

For NGC container setup and all installation options, see the Installation Guide.

Latest News

  • [2026/03] Deprecating Python 3.10 support: We're officially dropping Python 3.10 support with the upcoming 0.17.0 release. Downstream applications must raise their lower boundary to 3.12 to stay compatible with MCore.
  • [2026/01] Dynamic Context Parallelism - Up to 1.48x speedup for variable-length sequence training with adaptive CP sizing.
  • [2025/12] Megatron Core development has moved to GitHub! All development and CI now happens in the open. We welcome community contributions.
  • [2025/10] Megatron Dev Branch - early access branch with experimental features.
  • [2025/10] Megatron Bridge - Bidirectional converter for interoperability between Hugging Face and Megatron checkpoints, featuring production-ready recipes for popular models.
  • [2025/08] MoE Q3-Q4 2025 Roadmap - Comprehensive roadmap for MoE features including DeepSeek-V3, Qwen3, advanced parallelism strategies, FP8 optimizations, and Blackwell performance enhancements.
  • [2025/08] GPT-OSS Model - Advanced features including YaRN RoPE scaling, attention sinks, and custom activation functions are being integrated into Megatron Core.
  • [2025/06] Megatron MoE Model Zoo - Best practices and optimized configurations for training DeepSeek-V3, Mixtral, and Qwen3 MoE models with performance benchmarking and checkpoint conversion tools.
  • [2025/05] Megatron Core v0.11.0 brings new capabilities for multi-data center LLM training (blog).
Previous News
  • [2024/07] Megatron Core v0.7 improves scalability and training resiliency and adds support for multimodal training (blog).
  • [2024/06] Megatron Core added supports for Mamba-based models. Check out our paper An Empirical Study of Mamba-based Language Models and code example.
  • [2024/01 Announcement] NVIDIA has released the core capabilities in Megatron-LM into Megatron Core in this repository. Megatron Core expands upon Megatron-LM's GPU-optimized techniques with more cutting-edge innovations on system-level optimizations, featuring composable and modular APIs.

Project Structure

Megatron-LM/
├── megatron/
│   ├── core/                    # Megatron Core (kernels, parallelism, building blocks)
│   │   ├── models/              # Transformer models
│   │   ├── transformer/         # Transformer building blocks
│   │   ├── tensor_parallel/     # Tensor parallelism
│   │   ├── pipeline_parallel/   # Pipeline parallelism
│   │   ├── distributed/         # Distributed training (FSDP, DDP)
│   │   ├── optimizer/           # Optimizers
│   │   ├── datasets/            # Dataset loaders
│   │   ├── inference/           # Inference engines and server
│   │   └── export/              # Model export (e.g. TensorRT-LLM)
│   ├── training/                # Training scripts
│   ├── legacy/                  # Legacy components
│   ├── post_training/           # Post-training (quantization, distillation, pruning, etc.)
│   └── rl/                      # Reinforcement learning (RLHF, etc.)
├── examples/                    # Ready-to-use training examples
├── tools/                       # Utility tools
├── tests/                       # Comprehensive test suite
└── docs/                        # Documentation

Performance Benchmarking

For our latest performance benchmarking results, please refer to NVIDIA Megatron Bridge Performance Summary.

Our codebase efficiently trains models from 2B to 462B parameters across thousands of GPUs, achieving up to 47% Model FLOP Utilization (MFU) on H100 clusters.

Model table

Benchmark Configuration:

  • Vocabulary size: 131,072 tokens
  • Sequence length: 4096 tokens
  • Model scaling: Varied hidden size, attention heads, and layers to achieve target parameter counts
  • Communication optimizations: Fine-grained overlapping with DP (--overlap-grad-reduce, --overlap-param-gather), TP (--tp-comm-overlap), and PP (enabled by default)

Key Results:

  • 6144 H100 GPUs: Successfully benchmarked 462B parameter model training
  • Superlinear scaling: MFU increases from 41% to 47-48% with model size
  • End-to-end measurement: Throughputs include all operations (data loading, optimizer steps, communication, logging)
  • Production ready: Full training pipeline with checkpointing and fault tolerance
  • Note: Performance results measured without training to convergence

Weak Scaling Results

Our weak scaled results show superlinear scaling (MFU increases from 41% for the smallest model considered to 47-48% for the largest models); this is because larger GEMMs have higher arithmetic intensity and are consequently more efficient to execute.

Weak scaling

Strong Scaling Results

We also strong scaled the standard GPT-3 model (our version has slightly more than 175 billion parameters due to larger vocabulary size) from 96 H100 GPUs to 4608 GPUs, using the same batch size of 1152 sequences throughout. Communication becomes more exposed at larger scale, leading to a reduction in MFU from 47% to 42%.

Strong scaling

Roadmaps

  • MoE Roadmap - DeepSeek-V3, Qwen3, advanced parallelism, FP8 optimizations, and Blackwell enhancements

Resources

Getting Help

Contributing

We ❤️ contributions! Ways to contribute:

  • 🐛 Report bugs - Help us improve reliability
  • 💡 Suggest features - Shape the future of Megatron Core
  • 📝 Improve docs - Make Megatron Core more accessible
  • 🔧 Submit PRs - Contribute code improvements

Contributing Guide

Citation

If you use Megatron in your research or project, we appreciate that you use the following citations:

@article{megatron-lm,
  title={Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism},
  author={Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan},
  journal={arXiv preprint arXiv:1909.08053},
  year={2019}
}

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

megatron_core-0.18.0.tar.gz (1.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

megatron_core-0.18.0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

megatron_core-0.18.0-cp313-cp313-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

megatron_core-0.18.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

megatron_core-0.18.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

megatron_core-0.18.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

megatron_core-0.18.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (2.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file megatron_core-0.18.0.tar.gz.

File metadata

  • Download URL: megatron_core-0.18.0.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for megatron_core-0.18.0.tar.gz
Algorithm Hash digest
SHA256 8feff0cd371e5ef26dc36636a23e197983ddddf10b43847871697b89be8cd6ea
MD5 ab8313debf543cba8895372b285e9ce3
BLAKE2b-256 4e5f31dbdff0e758598050fbee8bacad0f034447817db34b6399100acd5d3929

See more details on using hashes here.

File details

Details for the file megatron_core-0.18.0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for megatron_core-0.18.0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 6c216609c139c6a75bd3db1ce1e98f776cfa2a0736fdb3d0e90df74a1a6d4da6
MD5 c2698aba16017caa15d6673f236ba58f
BLAKE2b-256 32225e3f4a61aed3e55b8e02f4c1e065eb0fd4b2676b5962b0bc71326cd5f46e

See more details on using hashes here.

File details

Details for the file megatron_core-0.18.0-cp313-cp313-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for megatron_core-0.18.0-cp313-cp313-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 69e0d767300c57c8a84d7c35d86e33eb3dd9f7fa7cb78ad5b668e10c4d077bb4
MD5 0dd4f5f75d58f8520647e947d717428b
BLAKE2b-256 c7416e5db57ab837b32af86241c32d5f12be1badf1964562d9727e25ede3b8c4

See more details on using hashes here.

File details

Details for the file megatron_core-0.18.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for megatron_core-0.18.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 beb6f7b1462a82723cef31773dae60f54a2f7bc809e5e050cf9f59c344b5a373
MD5 74b67e5a9fe74d039f5e0fef5e8fb953
BLAKE2b-256 f11544d3ed47df82ba40be75965d26cee8230ab7ab54f9cbfa39b668544c75eb

See more details on using hashes here.

File details

Details for the file megatron_core-0.18.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for megatron_core-0.18.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 14a3df6bf4de0c3cc648562c56bf79aba2330ef6bae2affd2e7d60d1c3e25a45
MD5 215f25f5d7eb5a501519353a8fbf4923
BLAKE2b-256 dc8791eb38d2f8fac7fda3319b4e0862078f8ed6e40463b09e013a4d33ceea7f

See more details on using hashes here.

File details

Details for the file megatron_core-0.18.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for megatron_core-0.18.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 429eddb2c6ce003099cf7d2130081eeba29fdcc5292654fec7fffa0f2894d559
MD5 0d5737520e906110215dfedb1baaeae3
BLAKE2b-256 4683b5deadc8cf7d62e6f89d0d191e51c0d1fc9c4aff95d68c4c27392edf44dc

See more details on using hashes here.

File details

Details for the file megatron_core-0.18.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for megatron_core-0.18.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 6d2bd31ba8598f9dfcdabdd5c21cba627f5b71f53ff1281187a6315e5e562287
MD5 c16ea4ba044671fa69b180e538eb036b
BLAKE2b-256 a2a956618898912855fb1871f951aec967548960ec76c87e38deff58e5ac8b56

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page