Distilling and Accelerating Hybrid Models with Linear RNNs: The Mamba in the Llama

Models & Research

The Engineer

30 Aug 2024 · 3 min read

This study reveals how massive Transformer models can be distilled into efficient linear RNNs, using the Mamba architecture to outperform traditional LLaMA models while significantly reducing computational costs.

The latest research from Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, and Tri Dao explores an intriguing approach to distilling large Transformer models into linear Recurrent Neural Networks (RNNs), specifically using the Mamba architecture. This work, titled "The Mamba in the Llama: Distilling and Accelerating Hybrid Models," was recently published on arXiv and presented at NeurIPS 2024.

Technical Changes and Why They Matter

1. Distillation of Transformers into Linear RNNs

Key Insight: The authors demonstrate that it's feasible to distill large Transformer models, like Llama3-8B-Instruct, into linear RNNs (Mamba) using academic GPU resources.
Why It Matters: This approach allows for the deployment of more efficient and faster models without sacrificing performance. Linear RNNs have several advantages over Transformers:
- Inference Efficiency: Linear RNNs are generally more lightweight and can be deployed on devices with limited computational resources.
- Training Simplicity: They require fewer parameters, making them easier to train from scratch.

2. Hybrid Model Architecture

Architecture Details:
- The hybrid model retains a quarter of the attention layers from the original Transformer.
- It reuses the linear projection weights from the attention layers in the Transformer, which helps in maintaining performance while reducing complexity.
Performance Benchmarks:
- In chat benchmarks, the hybrid model performs comparably to the original Transformer.
- It outperforms open-source hybrid Mamba models trained from scratch with trillions of tokens in both chat and general benchmarks.

3. Hardware-Aware Speculative Decoding Algorithm

Key Feature: The authors introduce a speculative decoding algorithm that accelerates inference speed for Mamba and hybrid models.
Why It Matters:
- This algorithm is hardware-aware, meaning it optimizes performance based on the specific characteristics of the deployment environment.
- It significantly speeds up inference, making these models more practical for real-world applications.

Results and Benchmarks

Top-Performing Model:

Distilled from Llama3-8B-Instruct:
- Achieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4.
- Scores 7.35 on MT-Bench, surpassing the best 8B scale instruction-tuned linear RNN model.

Length Extrapolation:

The distilled model shows almost perfect accuracy in the needle-in-a-haystack test at 20 times the distillation length, demonstrating its robustness and adaptability to longer sequences.

Open-Sourcing

The authors have open-sourced both the code and pre-trained checkpoints for their models:

Code Repository: GitHub - MambaInLlama
Speculative Decoding Algorithm: GitHub - speculative_mamba

Conclusion

This research highlights the potential of distilling large Transformer models into linear RNNs, particularly using the Mamba architecture. By retaining a quarter of the attention layers and reusing weights from the original model, the authors achieve comparable performance with significant efficiency gains. The introduction of a hardware-aware speculative decoding algorithm further enhances the practicality of these hybrid models for deployment in resource-constrained environments.