Boosting Training Throughput with PyTorch FSDP: A 7B Llama Model Case Study

Models & Research

The Engineer

14 Mar 2024 · 4 min read

Exploring the technical intricacies behind IBM and Meta's achievement in boosting training throughput, this article delves into how FSDP enabled near-linear scaling for a massive 7B parameter Llama model across hundreds of GPUs.

In this article, we dive into the technical details of how IBM and Meta's PyTorch teams maximized training throughput using Fully Sharded Data Parallel (FSDP) to train a 7 billion parameter Llama model. The key highlights include achieving a remarkable training speed of 3,700 tokens per second per GPU, or 40 billion tokens per day on 128 A100 GPUs, with near-linear scaling up to 512 GPUs. This setup translates to a model FLOPS utilization (MFU) and hardware FLOPS utilization (HFU) of 57%.

Technical Breakdown

1. FSDP and Its Benefits

Fully Sharded Data Parallel (FSDP) is a PyTorch technique that optimizes memory usage and parallelism across multiple GPUs. It allows for efficient training of large models by sharding both the model parameters and gradients, ensuring that each GPU only holds a fraction of the total model state.

Scalability: FSDP demonstrates near-linear scaling from 128 to 512 GPUs, making it highly effective for large-scale training.
Memory Efficiency: By sharding the model, FSDP reduces the memory footprint on each GPU, allowing for larger models to be trained with fewer resources.

2. Key Techniques for High Throughput

To achieve the high throughput of 3,700 tokens per second per GPU, several techniques were employed:

SDPA Flash Attention: This technique fuses attention computations into a single kernel, significantly reducing computational overhead and improving efficiency.
Computation and Communication Overlap: By overlapping computation with communication, FSDP ensures that the GPUs are utilized more effectively, minimizing idle time.
Selective Activation Checkpointing: A novel mechanism introduced by IBM researchers, this technique selectively checkpoints activations to trade off between GPU memory usage and computational load. This results in a 10% performance boost over out-of-the-box FSDP.

3. Model and Training Details

The Llama 2 7B model was pre-trained on 2 trillion tokens. The training code and methodology are available on GitHub, providing a comprehensive resource for practitioners looking to replicate or adapt these techniques.

Model Quality: The trained model demonstrates comparable performance to the original Llama 2 on various academic benchmarks.
Hardware Configuration: The setup used 128 A100 GPUs, achieving an impressive throughput of 3,700 tokens per second per GPU. Scaling up to 512 GPUs would allow for training a 7B model in just under two weeks.

4. Configuration Knobs and Best Practices

The researchers shared specific configuration settings that work well for Llama models ranging from 7B to 70B parameters, both on A100 and H100 GPUs. These configurations include:

Optimal Batch Sizes: Tuning batch sizes to balance memory usage and computational efficiency.
Gradient Accumulation: Using gradient accumulation to handle larger effective batch sizes without exceeding GPU memory limits.
Mixed Precision Training: Leveraging mixed precision training to speed up computations while maintaining model accuracy.

Practical Implications

The techniques and optimizations described in this article are not only applicable to the Llama 2 7B model but can be extended to other large language models. The PyTorch FSDP framework, combined with these advanced techniques, provides a robust foundation for training large-scale models efficiently.

One key benefit of using PyTorch's native pathways is the ability to seamlessly train on multiple hardware backends. This flexibility is demonstrated by the recent end-to-end stack released by AllenAI through OLMo, which leverages PyTorch FSDP for training on both AMD and NVIDIA GPUs.

Conclusion

The collaboration between IBM and Meta's PyTorch teams has resulted in significant advancements in training large language models efficiently. By leveraging FSDP, SDPA Flash Attention, computation-communication overlap, and selective activation checkpointing, they achieved a remarkable throughput of 3,700 tokens per second per GPU. This setup not only scales well but also provides a solid foundation for future research and development in the field of large