Emu3: Next-Token Prediction Takes On Multimodal Tasks with a Single Transformer

Models & Research

The Engineer

1 Oct 2024 · 3 min read

A new model called Emu3 challenges the dominance of specialized diffusion models in multimodal tasks, proving that a single transformer trained with next-token prediction can excel across diverse data types like images, text, and video.

Next-token prediction has long been seen as a key step towards achieving AGI, but it's struggled to match the performance of specialized models in multimodal tasks. Diffusion models like Stable Diffusion and compositional approaches (e.g., CLIP + LLMs) have dominated these domains. However, a new suite of models called Emu3, developed by BAAI Vision, is changing that narrative.

Overview

Emu3 introduces a novel approach to multimodal tasks by training a single transformer using only next-token prediction. This method tokenizes images, text, and videos into a discrete space, allowing the model to handle a diverse mix of modalities without the need for diffusion or compositional architectures. The result? Emu3 outperforms well-established models like SDXL, LLaVA-1.6, and OpenSora-1.2 in both generation and perception tasks.

Key Technical Details

Tokenization: Images, text, and videos are converted into discrete tokens.
Transformer Architecture: A single transformer model is trained from scratch on a mixture of multimodal sequences.
Scalability: The token-based approach simplifies complex multimodal designs, making it easier to scale both during training and inference.

Image Generation

Emu3 excels in generating high-quality images by predicting the next vision token. This approach allows for flexible resolutions and styles, making it versatile for various applications. Unlike traditional diffusion models that generate images from noise, Emu3's causal generation process ensures a more natural and coherent output.

Performance: Outperforms SDXL and other flagship image generation models.
Flexibility: Supports a wide range of resolutions and artistic styles.

Video Generation

Emu3's capabilities extend to video generation as well. Unlike Sora, which uses a video diffusion model to generate videos from noise, Emu3 generates videos causally by predicting the next token. This approach ensures that the generated content is coherent and contextually appropriate.

Causal Generation: Predicts the next token in a sequence, ensuring coherence.
Performance: Matches or exceeds the quality of specialized video generation models.

Why It Matters

The significance of Emu3 lies in its ability to unify multimodal tasks under a single, scalable architecture. By focusing on tokens, Emu3 simplifies the design and training process, making it easier to build and deploy general multimodal intelligence. This could have far-reaching implications for applications ranging from content creation to AI research.

Conclusion

Emu3 represents a significant step forward in the field of multimodal AI. By leveraging next-token prediction, BAAI Vision has created a model that not only outperforms specialized models but also simplifies the architecture needed to handle diverse tasks. This approach could pave the way for more general and versatile AI systems in the future.