
Share
A new model called Emu3 challenges the dominance of specialized diffusion models in multimodal tasks, proving that a single transformer trained with next-token prediction can excel across diverse data types like images, text, and video.
Next-token prediction has long been seen as a key step towards achieving AGI, but it's struggled to match the performance of specialized models in multimodal tasks. Diffusion models like Stable Diffusion and compositional approaches (e.g., CLIP + LLMs) have dominated these domains. However, a new suite of models called Emu3, developed by BAAI Vision, is changing that narrative.
Emu3 introduces a novel approach to multimodal tasks by training a single transformer using only next-token prediction. This method tokenizes images, text, and videos into a discrete space, allowing the model to handle a diverse mix of modalities without the need for diffusion or compositional architectures. The result? Emu3 outperforms well-established models like SDXL, LLaVA-1.6, and OpenSora-1.2 in both generation and perception tasks.
Emu3 excels in generating high-quality images by predicting the next vision token. This approach allows for flexible resolutions and styles, making it versatile for various applications. Unlike traditional diffusion models that generate images from noise, Emu3's causal generation process ensures a more natural and coherent output.

Emu3's capabilities extend to video generation as well. Unlike Sora, which uses a video diffusion model to generate videos from noise, Emu3 generates videos causally by predicting the next token. This approach ensures that the generated content is coherent and contextually appropriate.
The significance of Emu3 lies in its ability to unify multimodal tasks under a single, scalable architecture. By focusing on tokens, Emu3 simplifies the design and training process, making it easier to build and deploy general multimodal intelligence. This could have far-reaching implications for applications ranging from content creation to AI research.
Emu3 represents a significant step forward in the field of multimodal AI. By leveraging next-token prediction, BAAI Vision has created a model that not only outperforms specialized models but also simplifies the architecture needed to handle diverse tasks. This approach could pave the way for more general and versatile AI systems in the future.
Tags
Original Sources
↗ https://emu.baai.ac.cn/about?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
1 October 2024
88 articles
Related Articles
Related Articles
More Stories