
Share
Researchers introduce a method allowing diffusion models to train across independent GPU clusters without the need for high-bandwidth networks, democratizing AI model training.
Training state-of-the-art diffusion models typically requires thousands of GPUs, with computation distributed and gradients synchronized at each optimization step. This process incurs a massive networking load, necessitating centralized facilities equipped with specialized hardware and extensive power delivery systems. However, this setup is cost-prohibitive for academic labs and even challenging for large companies due to fundamental limits on power delivery and networking bandwidth.
A new approach, Decentralized Diffusion Models (DDMs), addresses these issues by training a series of expert diffusion models in communication isolation from one another. This method allows for training across different locations and hardware configurations. At inference time, the models are ensembled through a lightweight learned router, optimizing the same objective as a single monolithic model trained over the entire dataset. DDMs not only match but often outperform monolithic models FLOP-for-FLOP by leveraging sparse computation at both training and test times. They scale gracefully to billions of parameters and produce excellent results with reduced pretraining budgets.
The authors demonstrate the effectiveness of DDMs by training a large-scale model using just eight independent GPU nodes in less than a week. This setup is significantly more cost-effective and accessible compared to traditional centralized clusters. The results show that DDMs can achieve state-of-the-art performance without the need for specialized hardware.

One of the key advantages of DDMs is their ability to simplify training systems. By removing the dependency on high-bandwidth networking, researchers and practitioners can utilize compute resources where they are available, whether in different data centers or across the internet. This flexibility makes it easier to scale up models without hitting the fundamental limits imposed by centralized clusters.
To better understand diffusion models and rectified flows, consider them as special cases of more general geometric transformations. Diffusion models can be seen as a process where data points are gradually transformed from a simple distribution (like Gaussian noise) to a complex target distribution. This transformation is guided by a series of learned functions that progressively refine the data.
Decentralized Diffusion Models represent a significant advancement in the field of machine learning. By enabling the training of large-scale models across independent GPU clusters without networking bottlenecks, DDMs offer a practical solution to the challenges faced by both academic labs and large companies. The method's ability to produce state-of-the-art results with reduced pretraining budgets makes it an exciting development for the future of diffusion model research.
Tags
Original Sources
↗ https://decentralizeddiffusion.github.io/?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
14 January 2025
88 articles
Related Articles
Related Articles
More Stories