Understanding Distributed Training for Large-Scale Deep Learning Models

Models & Research

The Engineer

4 Oct 2024 · 3 min read

Dive into how distributed training optimizes large-scale deep learning models by breaking down complex computations across multiple nodes, speeding up training without sacrificing accuracy.

Introduction

Deep learning computations are often represented as dataflow graphs in popular machine learning frameworks. In these graphs, edges represent multi-dimensional tensors, and nodes correspond to computational operators like matrix multiplication, which transform input tensors into outputs. A single iteration of deep learning (DL) model training involves a forward pass with a batch of data, computing a loss, and then a backward pass to compute gradients for updating the model weights. This process is repeated until the model's loss reaches a global minimum.

Typically, developers define the structure of these dataflow graphs, and an execution engine optimizes and runs them on GPUs. However, with the growing size of datasets and the complexity of models, distributed training has become essential to handle these demands efficiently.

Motivation

The AI landscape is evolving towards larger models, which generally offer better performance but at a higher computational cost. The table below illustrates the size of some popular models and their training times on a single Nvidia A100 GPU:

| Model Name | Size (Parameters) | Training Time (on Nvidia A100) | | --- | --- | --- | | ResNet-101 | 45M | 44 hours | | BERT-Base | 108M | 84 hours | | GPT-3 175B | 175B | 3,100,000 hours |

Training the GPT-3 175B model on a single GPU would take an impractical 355 years. This highlights the necessity of distributed training for several reasons:

Developer/Researcher Productivity: Faster iterations and experiments.
Shorter Time to Market: Quicker deployment of models in production.
Cost Efficiency: Reduced computational resources per unit time.

There are two primary types of parallelism that can be leveraged to speed up the training process:

Data Parallelism: Splitting data across multiple devices while keeping the model architecture unchanged.
Model Parallelism: Splitting the model itself across different devices.

In this article, we will focus on pipeline parallelism, an efficient method for training large models.

Distributed Communication

AllReduce

To achieve scalable communication in distributed systems, several schemes are used:

Scatter: Sends a tensor from one device to all other devices.
Gather: Collects tensors from all devices to one device.
Reduce: Similar to gather but includes an operation like sum or average.
AllReduce: Performs reduce operations on all devices.

The goal is to find a scalable alternative to the bottlenecked parameter server approach. AllReduce allows for aggregation without a central server, making it highly efficient for distributed training.

Here's a simple pseudocode for AllReduce:

for i in range(N):
    AllReduce(work[i])

Time and Bandwidth Complexity

AllReduce is designed to minimize both time and bandwidth complexity. The key idea is to distribute the computation and communication load evenly across all devices, avoiding any single point of failure or bottleneck.

Time Complexity: (O(\log N)) for a ring-allreduce algorithm, where (N) is the number of devices.
Bandwidth Complexity: (O(N)) for the total amount of data transferred.

By efficiently managing these resources, AllReduce ensures that the training process remains scalable and efficient, even as the size of models and datasets continues to grow.

Conclusion

Distributed training is no longer a luxury but a necessity in the world of deep learning. With the increasing complexity and size of models, techniques like AllReduce are crucial for maintaining productivity, reducing time to market, and ensuring cost efficiency. By understanding and leveraging these methods, researchers and developers can continue to push the boundaries of what's possible with AI.