Mixture-of-Transformers: A Sparse and Scalable Multi-Modal Architecture for Foundation Models

Models & Research

The Engineer

12 Nov 2024 · 3 min read

Researchers unveil Mixture-of-Transformers, a new sparse architecture for foundation models that slashes the cost of training multi-modal systems without sacrificing accuracy or versatility.

The landscape of large language models (LLMs) has been rapidly evolving, with a growing focus on multi-modal systems that can process text, images, and speech within a unified framework. However, training these multi-modal models requires significantly larger datasets and computational resources compared to their text-only counterparts. To tackle this challenge, researchers from various institutions have introduced Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture designed to reduce pretraining computational costs while maintaining high performance.

What Changed Technically?

Sparse Architecture for Multi-Modality

The key innovation in MoT is its sparse architecture, which decouples non-embedding parameters of the model by modality. This means that feed-forward networks (FFNs), attention matrices, and layer normalization are modality-specific, allowing each type of input to be processed efficiently while maintaining global self-attention over the entire input sequence.

Modality-Specific Processing: Each modality (text, image, speech) has its own FFNs, attention mechanisms, and normalization layers. This allows for specialized processing tailored to the unique characteristics of each data type.
Global Self-Attention: Despite the modality-specific components, MoT maintains global self-attention across the entire input sequence, ensuring that interactions between different modalities are captured.

Why It Matters

Reduced Computational Costs

The sparse architecture of MoT significantly reduces the computational requirements for pretraining multi-modal models. This is crucial because training large, dense multi-modal models can be prohibitively expensive in terms of both time and resources.

Chameleon 7B Setting: In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the performance of a dense baseline using only 55.8% of the FLOPs.
Speech Extension: When extended to include speech, MoT achieves comparable speech performance with just 37.2% of the FLOPs required by the dense baseline.

Improved Efficiency in Various Settings

MoT's efficiency is demonstrated across multiple settings and model scales, highlighting its versatility and practical benefits.

Transfusion Setting: In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one-third of the FLOPs. Additionally, a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics.
System Profiling: System profiling shows that MoT achieves dense baseline image quality in 47.2% of the wall-clock time and text quality in 75.6% of the wall-clock time, measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs.

Implementation Details

Decoupling Non-Embedding Parameters: By separating FFNs, attention matrices, and normalization layers for each modality, MoT ensures that each type of input is processed efficiently without redundancy.
Global Self-Attention Mechanism: The global self-attention mechanism allows the model to capture interactions between different modalities, maintaining the coherence and context across the entire input sequence.

Benchmarks

Chameleon 7B Setting:
- Performance: Matches dense baseline
- FLOPs: 55.8% of dense baseline
Speech Extension:
- Performance: Comparable to dense baseline
- FLOPs: 37.2% of dense baseline
Transfusion Setting:
- 7B MoT Model:
  - Image Modality Performance: Matches dense baseline
  - FLOPs: One-third of dense baseline
- 760M MoT Model:
  - Outperforms 1.4B dense baseline in key image generation metrics

Conclusion

Mixture-of-Transformers (MoT) represents a significant step forward in the development of multi-modal foundation models. By decoupling non-embedding parameters and maintaining global self-attention, MoT achieves comparable performance to dense baselines while significantly reducing computational costs. This makes it an attractive solution for researchers and practitioners looking to build efficient and scalable multi-modal systems.