LLaDA: An 8B-Scale Diffusion Model Rivals LLaMA3 in Performance

Models & Research

The Engineer

18 Feb 2025 · 3 min read

Researchers unveil LLaDA, an 8-billion parameter diffusion model that matches LLaMA3's capabilities, pushing boundaries beyond traditional autoregressive models and challenging existing performance benchmarks.

Introduction

In a significant advancement for large language models (LLMs), researchers from Renmin University of China and Ant Group have introduced LLaDA (Large Language Diffusion with Masking), an 8-billion parameter diffusion model trained from scratch. This new model rivals the performance of LLaMA3 8B, challenging the dominance of autoregressive models in the field.

Motivation

The team behind LLaDA posits that the intelligence of LLMs-manifested through scalability, instruction-following, in-context learning, conversational ability, and compression-stems from the core principle of generative modeling rather than the autoregressive mechanism itself. Generative models approximate the true language distribution using maximum likelihood estimation (MLE), a principle that LLaDA leverages to achieve its remarkable capabilities.

Method

Model Architecture

Masked Diffusion: LLaDA is a masked diffusion model, which means it uses a combination of masking and diffusion techniques during training.
- Pretraining: During pretraining, tokens are randomly masked at a ratio ( t \sim U[0,1] ). This introduces variability in the training data, making the model more robust.
- Supervised Fine-Tuning (SFT): In SFT, only response tokens are masked. This ensures that the model focuses on generating coherent and contextually relevant responses.

Sampling via Diffusion

The model simulates diffusion from full masking (( t = 1 )) to unmasking (( t = 0 )).
At each step, it predicts all masks simultaneously with flexible remasking. This allows for a more dynamic and adaptive generation process compared to traditional left-to-right approaches.

Scalability

LLaDA demonstrates impressive scalability, showing highly competitive performance trends when compared to autoregressive baselines trained on the same data. This scalability is crucial for handling large datasets and complex tasks, making LLaDA a viable alternative to existing models in various applications.

Case Studies

Math Problem

User: Emma decides to run 4 sprints 5 times a week. She runs 75 meters each sprint. How many total meters does she run a week?

LLaDA-Instruct: Emma runs 4 sprints each time, so she runs a total of 4 x 5 = 20 sprints a week. Each sprint is 75 meters long, so Emma runs a total of 20 x 75 = 1500 meters a week. The final result is 1500.

Daily Life

User: Please recommend me three famous movies

LLaDA-Instruct: Certainly! Here are three famous movies that have achieved widespread popularity and critical acclaim:

The Shawshank Redemption (1994) - Directed by Frank Darabont, this film tells the story of Andy Dufresne, a man wrongfully convicted of murder who forms a deep friendship with fellow inmate Red.
The Godfather (1972) - Directed by Francis Ford Coppola, this iconic film follows the Corleone family as they navigate the intricacies of the New York mafia.
Forrest Gump (1994) - Directed by Robert Zemeckis, this heartwarming tale follows the life of Forrest Gump, a simple man with a kind heart who finds himself entangled in key historical events.

Conclusion

LLaDA represents a significant step forward in the development of large language models. By leveraging diffusion and masking techniques, it achieves performance on par with leading autoregressive models like LLaMA3 8B. This opens up new possibilities for scalable and contextually rich language generation, making it a valuable addition to the AI research landscape.