
Share
Researchers unveil Mixture-of-Transformers, a new sparse architecture for foundation models that slashes the cost of training multi-modal systems without sacrificing accuracy or versatility.
The landscape of large language models (LLMs) has been rapidly evolving, with a growing focus on multi-modal systems that can process text, images, and speech within a unified framework. However, training these multi-modal models requires significantly larger datasets and computational resources compared to their text-only counterparts. To tackle this challenge, researchers from various institutions have introduced Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture designed to reduce pretraining computational costs while maintaining high performance.
The key innovation in MoT is its sparse architecture, which decouples non-embedding parameters of the model by modality. This means that feed-forward networks (FFNs), attention matrices, and layer normalization are modality-specific, allowing each type of input to be processed efficiently while maintaining global self-attention over the entire input sequence.
The sparse architecture of MoT significantly reduces the computational requirements for pretraining multi-modal models. This is crucial because training large, dense multi-modal models can be prohibitively expensive in terms of both time and resources.
MoT's efficiency is demonstrated across multiple settings and model scales, highlighting its versatility and practical benefits.

Mixture-of-Transformers (MoT) represents a significant step forward in the development of multi-modal foundation models. By decoupling non-embedding parameters and maintaining global self-attention, MoT achieves comparable performance to dense baselines while significantly reducing computational costs. This makes it an attractive solution for researchers and practitioners looking to build efficient and scalable multi-modal systems.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
12 November 2024
88 articles
Related Articles
Related Articles
More Stories