Thanks to visit codestin.com
Credit goes to GitHub.com

Skip to content

NVIDIA/Megatron-LM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Megatron-LM and Megatron Core

GPU-optimized library for training transformer models at scale

Documentation version license

About

Megatron-Core (MCore): Composable library with GPU-optimized building blocks for custom training frameworks. You can install this library using pip or use it within the Megatron-LM GitHub repository.

Megatron-LM: Reference implementation that includes end-to-end examples utilizing Megatron Core.

Megatron-Bridge: Training library with bidirectional Hugging Face ↔ Megatron checkpoint conversion, flexible training loops, and example model training recipes.

For more information, refer to Megatron Bridge.

Quick Start

Install Megatron Core with pip:

  1. Install Megatron Core with required dependencies:

    pip install --no-build-isolation megatron-core[mlm,dev]
  2. Clone repository for examples:

    git clone https://github.com/NVIDIA/Megatron-LM.git
    cd Megatron-LM
    pip install --no-build-isolation .[mlm,dev]

Latest News

  • [2025/12] πŸŽ‰ Megatron Core development has moved to GitHub! All development and CI now happens in the open. We welcome community contributions.
  • [2025/10] Megatron Dev Branch - early access branch with experimental features.
  • [2025/10] Megatron Bridge - Bidirectional converter for interoperability between Hugging Face and Megatron checkpoints, featuring production-ready recipes for popular models.
  • [2025/08] MoE Q3-Q4 2025 Roadmap - Comprehensive roadmap for MoE features including DeepSeek-V3, Qwen3, advanced parallelism strategies, FP8 optimizations, and Blackwell performance enhancements.
  • [2025/08] GPT-OSS Model - Advanced features including YaRN RoPE scaling, attention sinks, and custom activation functions are being integrated into Megatron Core.
  • [2025/06] Megatron MoE Model Zoo - Best practices and optimized configurations for training DeepSeek-V3, Mixtral, and Qwen3 MoE models with performance benchmarking and checkpoint conversion tools.
  • [2025/05] Megatron Core v0.11.0 brings new capabilities for multi-data center LLM training (blog).
Previous News
  • [2024/07] Megatron Core v0.7 improves scalability and training resiliency and adds support for multimodal training (blog).
  • [2024/06] Megatron Core added supports for Mamba-based models. Check out our paper An Empirical Study of Mamba-based Language Models and code example.
  • [2024/01 Announcement] NVIDIA has released the core capabilities in Megatron-LM into Megatron Core in this repository. Megatron Core expands upon Megatron-LM's GPU-optimized techniques with more cutting-edge innovations on system-level optimizations, featuring composable and modular APIs. Explore the [Megatron Core intro](#Megatron Core) for more details.

Project Structure

Megatron-LM/
β”œβ”€β”€ megatron/
β”‚   β”œβ”€β”€ core/                    # Megatron Core (kernels, parallelism, building blocks)
β”‚   β”‚   β”œβ”€β”€ models/              # Transformer models
β”‚   β”‚   β”œβ”€β”€ transformer/         # Transformer building blocks
β”‚   β”‚   β”œβ”€β”€ tensor_parallel/     # Tensor parallelism
β”‚   β”‚   β”œβ”€β”€ pipeline_parallel/   # Pipeline parallelism
β”‚   β”‚   β”œβ”€β”€ distributed/         # Distributed training (FSDP, DDP)
β”‚   β”‚   β”œβ”€β”€ optimizer/           # Optimizers
β”‚   β”‚   β”œβ”€β”€ datasets/            # Dataset loaders
β”‚   β”‚   β”œβ”€β”€ inference/           # Inference engines
β”‚   β”‚   └── export/              # Model export (e.g. TensorRT-LLM)
β”‚   β”œβ”€β”€ training/                # Training scripts
β”‚   β”œβ”€β”€ inference/               # Inference server
β”‚   β”œβ”€β”€ legacy/                  # Legacy components
β”‚   └── post_training/           # Post-training (RLHF, etc.)
β”œβ”€β”€ examples/                    # Ready-to-use training examples
β”œβ”€β”€ tools/                       # Utility tools
β”œβ”€β”€ tests/                       # Comprehensive test suite
└── docs/                        # Documentation

Performance Benchmarking

For our latest performance benchmarking results, please refer to NVIDIA NeMo Framework Performance Summary.

Our codebase efficiently trains models from 2B to 462B parameters across thousands of GPUs, achieving up to 47% Model FLOP Utilization (MFU) on H100 clusters.

Model table

Benchmark Configuration:

  • Vocabulary size: 131,072 tokens
  • Sequence length: 4096 tokens
  • Model scaling: Varied hidden size, attention heads, and layers to achieve target parameter counts
  • Communication optimizations: Fine-grained overlapping with DP (--overlap-grad-reduce, --overlap-param-gather), TP (--tp-comm-overlap), and PP (enabled by default)

Key Results:

  • 6144 H100 GPUs: Successfully benchmarked 462B parameter model training
  • Superlinear scaling: MFU increases from 41% to 47-48% with model size
  • End-to-end measurement: Throughputs include all operations (data loading, optimizer steps, communication, logging)
  • Production ready: Full training pipeline with checkpointing and fault tolerance
  • Note: Performance results measured without training to convergence

Weak Scaling Results

Our weak scaled results show superlinear scaling (MFU increases from 41% for the smallest model considered to 47-48% for the largest models); this is because larger GEMMs have higher arithmetic intensity and are consequently more efficient to execute.

Weak scaling

Strong Scaling Results

We also strong scaled the standard GPT-3 model (our version has slightly more than 175 billion parameters due to larger vocabulary size) from 96 H100 GPUs to 4608 GPUs, using the same batch size of 1152 sequences throughout. Communication becomes more exposed at larger scale, leading to a reduction in MFU from 47% to 42%.

Strong scaling

Resources

Getting Help

  • πŸ“– Documentation - Official documentation
  • πŸ› Issues - Bug reports and feature requests

Contributing

We ❀️ contributions! Ways to contribute:

  • πŸ› Report bugs - Help us improve reliability
  • πŸ’‘ Suggest features - Shape the future of Megatron Core
  • πŸ“ Improve docs - Make Megatron Core more accessible
  • πŸ”§ Submit PRs - Contribute code improvements

β†’ Contributing Guide

Citation

If you use Megatron in your research or project, we appreciate that you use the following citations:

@article{megatron-lm,
  title={Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism},
  author={Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan},
  journal={arXiv preprint arXiv:1909.08053},
  year={2019}
}