
Share
Exploring the technical intricacies behind IBM and Meta's achievement in boosting training throughput, this article delves into how FSDP enabled near-linear scaling for a massive 7B parameter Llama model across hundreds of GPUs.
In this article, we dive into the technical details of how IBM and Meta's PyTorch teams maximized training throughput using Fully Sharded Data Parallel (FSDP) to train a 7 billion parameter Llama model. The key highlights include achieving a remarkable training speed of 3,700 tokens per second per GPU, or 40 billion tokens per day on 128 A100 GPUs, with near-linear scaling up to 512 GPUs. This setup translates to a model FLOPS utilization (MFU) and hardware FLOPS utilization (HFU) of 57%.
Fully Sharded Data Parallel (FSDP) is a PyTorch technique that optimizes memory usage and parallelism across multiple GPUs. It allows for efficient training of large models by sharding both the model parameters and gradients, ensuring that each GPU only holds a fraction of the total model state.
To achieve the high throughput of 3,700 tokens per second per GPU, several techniques were employed:
The Llama 2 7B model was pre-trained on 2 trillion tokens. The training code and methodology are available on GitHub, providing a comprehensive resource for practitioners looking to replicate or adapt these techniques.

The researchers shared specific configuration settings that work well for Llama models ranging from 7B to 70B parameters, both on A100 and H100 GPUs. These configurations include:
The techniques and optimizations described in this article are not only applicable to the Llama 2 7B model but can be extended to other large language models. The PyTorch FSDP framework, combined with these advanced techniques, provides a robust foundation for training large-scale models efficiently.
One key benefit of using PyTorch's native pathways is the ability to seamlessly train on multiple hardware backends. This flexibility is demonstrated by the recent end-to-end stack released by AllenAI through OLMo, which leverages PyTorch FSDP for training on both AMD and NVIDIA GPUs.
The collaboration between IBM and Meta's PyTorch teams has resulted in significant advancements in training large language models efficiently. By leveraging FSDP, SDPA Flash Attention, computation-communication overlap, and selective activation checkpointing, they achieved a remarkable throughput of 3,700 tokens per second per GPU. This setup not only scales well but also provides a solid foundation for future research and development in the field of large
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
14 March 2024
88 articles
Related Articles
Related Articles
More Stories