Yuan 2.0-M32: Mixture of Experts with Attention Router Boosts Efficiency and Performance

Models & Research

The Engineer

31 May 2024 · 3 min read

Yuan 2.0-M32 slashes computational costs and boosts performance with a novel attention router in its mixture-of-experts design, outpacing larger models without breaking the bank.

Yuan 2.0-M32, a new model from the Yuan family, introduces significant advancements in compute efficiency and performance by leveraging a mixture-of-experts (MoE) architecture with an innovative attention router. This model is particularly noteworthy for its ability to outperform much larger models like Llama3-70B while using only a fraction of the computational resources.

What Changed?

Yuan 2.0-M32 builds on the base architecture of Yuan-2.0 2B but introduces several key improvements:

Mixture-of-Experts (MoE) with 32 Experts: The model uses an MoE setup where 32 experts are available, but only 2 are activated per token. This sparse activation significantly reduces computational load while maintaining high performance.
Attention Router: A new router network called the Attention Router is introduced to efficiently select the active experts. This router improves accuracy compared to traditional routing mechanisms.

Why It Matters

For practitioners, Yuan 2.0-M32 offers a compelling trade-off between efficiency and performance:

Compute Efficiency: The model requires only 9.25% of the compute resources needed for training a dense model of similar parameter scale.
Competitive Performance: Despite its smaller active parameter size (3.7B out of 40B total), Yuan 2.0-M32 matches or exceeds the performance of larger models like Llama3-70B in various benchmarks.

Technical Details

Architecture

Base Model: Similar to Yuan-2.0 2B, which is a transformer-based model.
MoE Setup:
- Total Experts: 32
- Active Experts per Token: 2
- Parameter Distribution: Out of the total 40B parameters, only 3.7B are active during inference.

Router Network

Attention Router: This new router uses attention mechanisms to dynamically select the most relevant experts for each token. It outperforms classical routing methods by making more informed decisions based on the input context.

Training

Data: Trained from scratch using 2000B tokens.
Compute Consumption: Only 9.25% of the compute required for a dense model with the same parameter scale.
Forward Computation: 7.4 GFlops per token, which is significantly lower compared to Llama3-70B (140.6 GFlops per token).

Performance Benchmarks

MATH Benchmark:
- Yuan 2.0-M32: 55.89% accuracy
- Llama3-70B: Lower accuracy (specific numbers not provided)
ARC-Challenge Benchmark:
- Yuan 2.0-M32: 95.8% accuracy
- Llama3-70B: Lower accuracy (specific numbers not provided)

Key Takeaways

Efficiency: Yuan 2.0-M32 is a highly compute-efficient model, making it suitable for resource-constrained environments.
Performance: It demonstrates competitive or superior performance in various benchmarks, particularly in coding and math tasks.
Flexibility: The MoE architecture allows for dynamic scaling of computational resources based on the input, providing a flexible solution for different use cases.