
Share
Dive into how distributed training optimizes large-scale deep learning models by breaking down complex computations across multiple nodes, speeding up training without sacrificing accuracy.
Deep learning computations are often represented as dataflow graphs in popular machine learning frameworks. In these graphs, edges represent multi-dimensional tensors, and nodes correspond to computational operators like matrix multiplication, which transform input tensors into outputs. A single iteration of deep learning (DL) model training involves a forward pass with a batch of data, computing a loss, and then a backward pass to compute gradients for updating the model weights. This process is repeated until the model's loss reaches a global minimum.
Typically, developers define the structure of these dataflow graphs, and an execution engine optimizes and runs them on GPUs. However, with the growing size of datasets and the complexity of models, distributed training has become essential to handle these demands efficiently.
The AI landscape is evolving towards larger models, which generally offer better performance but at a higher computational cost. The table below illustrates the size of some popular models and their training times on a single Nvidia A100 GPU:
| Model Name | Size (Parameters) | Training Time (on Nvidia A100) | | --- | --- | --- | | ResNet-101 | 45M | 44 hours | | BERT-Base | 108M | 84 hours | | GPT-3 175B | 175B | 3,100,000 hours |
Training the GPT-3 175B model on a single GPU would take an impractical 355 years. This highlights the necessity of distributed training for several reasons:
There are two primary types of parallelism that can be leveraged to speed up the training process:
In this article, we will focus on pipeline parallelism, an efficient method for training large models.

To achieve scalable communication in distributed systems, several schemes are used:
The goal is to find a scalable alternative to the bottlenecked parameter server approach. AllReduce allows for aggregation without a central server, making it highly efficient for distributed training.
Here's a simple pseudocode for AllReduce:
for i in range(N):
AllReduce(work[i])
AllReduce is designed to minimize both time and bandwidth complexity. The key idea is to distribute the computation and communication load evenly across all devices, avoiding any single point of failure or bottleneck.
By efficiently managing these resources, AllReduce ensures that the training process remains scalable and efficient, even as the size of models and datasets continues to grow.
Distributed training is no longer a luxury but a necessity in the world of deep learning. With the increasing complexity and size of models, techniques like AllReduce are crucial for maintaining productivity, reducing time to market, and ensuring cost efficiency. By understanding and leveraging these methods, researchers and developers can continue to push the boundaries of what's possible with AI.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
4 October 2024
88 articles
Related Articles
Related Articles
More Stories