
Share
Researchers unveil DeMo, a method that slashes communication overhead in distributed training by decoupling momentum updates from weight synchronization, preserving performance without the bandwidth burden.
In the world of large-scale neural network training, communication overhead has become a significant bottleneck. Synchronous data-parallelism is essential for scaling up training, but full-precision gradient all-reduce operations can severely limit performance due to high bandwidth requirements. A new paper from researchers at various institutions introduces Decoupled Momentum Optimization (DeMo), a technique that dramatically reduces this communication overhead while maintaining the convergence properties of traditional momentum-based optimizers like AdamW.
The key innovation in DeMo is its decoupling of local momentum updates, which allows for more efficient gradient aggregation. Here’s how it works:
Decoupled Local Momentum Updates: Instead of performing full-precision all-reduce operations, DeMo keeps local momentum buffers on each worker node. This means that the momentum update step can be done independently on each node without immediate communication.
Fast Orthonormal Transform and Top-k Sparsification: After computing the gradients, DeMo applies a fast orthonormal transform (such as the Discrete Cosine Transform, DCT) to the gradients. The transformed gradients are then sparsified by selecting the top-k largest values. This step significantly reduces the amount of data that needs to be communicated.
Momentum Buffer for Error Feedback: To ensure that no information is lost due to sparsification, DeMo reuses the local momentum buffer as an error feedback mechanism. Specifically, it subtracts the sparse gradient updates from the momentum buffer before applying them. This helps in maintaining the convergence properties of the optimizer.
For practitioners, this means:
Reduced Communication Overhead: DeMo can reduce per-step communication by up to two orders of magnitude compared to traditional methods like AdamW with Distributed Data Parallel (DDP). For example, experiments on 300M and 1B-parameter language models show that DeMo transmits up to 85 times less data per GPU than AdamW-DDP while achieving comparable loss and accuracy.
Minimal Computational Overhead: The additional steps of applying the orthonormal transform and sparsification are computationally lightweight, making DeMo a practical drop-in replacement for existing optimizers.
Topology-Agnostic: DeMo is designed to work across various network topologies, including multi-datacenter and Ethernet-based setups. This flexibility makes it suitable for a wide range of distributed training environments.

To give you a better idea of how DeMo works under the hood:
Gradient Computation: Each worker node computes gradients locally.
Orthonormal Transform: The computed gradients are transformed using an orthonormal transform like DCT. This step helps in concentrating the gradient information into fewer coefficients, making sparsification more effective.
Sparsification: Only the top-k largest values from the transformed gradients are selected for communication. The value of k can be adjusted based on the available bandwidth and desired performance trade-offs.
Error Feedback: The sparse updates are subtracted from the local momentum buffer before being applied to the model parameters. This ensures that any information lost due to sparsification is accounted for in subsequent steps.
The researchers conducted experiments on language models with 300M and 1B parameters. Here are some key findings:
Communication Efficiency: DeMo reduced the amount of data transmitted per GPU by up to 85 times compared to AdamW-DDP.
Convergence and Accuracy: Despite the significant reduction in communication, DeMo achieved comparable loss and accuracy metrics to AdamW-DDP.
Decoupled Momentum Optimization (DeMo) offers a promising solution to the communication bottleneck in distributed training. By decoupling local momentum updates and applying efficient sparsification techniques, DeMo maintains the convergence properties of traditional optimizers while significantly reducing communication overhead. This makes it an attractive option for large-scale neural network training across various network topologies.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
3 December 2024
88 articles
Related Articles
Related Articles
More Stories