Qwen2.5-Max: A Large-scale MoE Model Trained on 20 Trillion Tokens

Models & Research

The Engineer

30 Jan 2025 · 3 min read

Training massive models like Qwen2.5-Max on 20 trillion tokens pushes the boundaries of AI scalability, offering new insights into efficient MoE architectures and fine-tuning techniques that could revolutionize natural language processing.

January 28, 2025 · 3 min read · 561 words · By Qwen Team

The quest for more intelligent models often involves scaling both data and model sizes. However, the practical challenges of training extremely large models-whether dense or Mixture-of-Expert (MoE)-remain significant. Recent advancements in this area, such as DeepSeek V3's detailed disclosures, have provided valuable insights. In parallel, we've been developing Qwen2.5-Max, a large-scale MoE model trained on over 20 trillion tokens and further refined with Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). Today, we're excited to share the performance results of Qwen2.5-Max and announce its availability via Alibaba Cloud's API.

Performance Highlights

Qwen2.5-Max has been evaluated against leading models, both proprietary and open-weight, across a range of benchmarks that are crucial for assessing model intelligence. These include:

MMLU-Pro: College-level knowledge tests
LiveCodeBench: Coding capabilities assessment
LiveBench: General capability evaluation
Arena-Hard: Human preference approximation

We focus on the performance of instruct models, which are suitable for downstream applications like chat and coding. Here’s how Qwen2.5-Max stacks up against state-of-the-art models:

Qwen2.5-Max outperforms DeepSeek V3 in several key benchmarks:
- Arena-Hard: Human preference approximation
- LiveBench: General capability evaluation
- LiveCodeBench: Coding capabilities assessment
- GPQA-Diamond: Knowledge test
Competitive Results:
- Qwen2.5-Max also shows competitive performance in other assessments, including MMLU-Pro.

When comparing base models, we face limitations due to proprietary restrictions on models like GPT-4o and Claude-3.5-Sonnet. Therefore, our evaluations include:

Qwen2.5-Max vs. DeepSeek V3
Llama-3.1-405B: The largest open-weight dense model
Qwen2.5-72B: Another top-tier open-weight model

Technical Details and Training Process

Qwen2.5-Max leverages the MoE architecture to efficiently scale to a large number of parameters without incurring excessive computational costs. Key technical details include:

Training Data:
- Pretrained on over 20 trillion tokens
- Post-trained with curated SFT and RLHF methodologies
Model Architecture:
- Utilizes an MoE approach, which dynamically routes inputs to expert sub-networks based on their relevance
- This architecture allows for efficient scaling by distributing the computational load across multiple experts
Quantization:
- Applied quantization techniques to reduce model size and improve inference efficiency without significant loss in performance
Benchmarks:
- MMLU-Pro: College-level knowledge test, evaluating a wide range of subjects
- LiveCodeBench: Assessing coding capabilities through practical tasks
- LiveBench: Comprehensive evaluation of general capabilities, including reasoning and problem-solving
- Arena-Hard: Approximating human preferences in various scenarios

Availability and Future Directions

Qwen2.5-Max is now available via Alibaba Cloud's API, making it accessible for developers and researchers to integrate into their applications. We also invite you to explore Qwen2.5-Max on Qwen Chat for a hands-on experience.

Looking ahead, we will continue to refine and expand Qwen2.5-Max's capabilities, focusing on further improvements in efficiency, performance, and usability. We believe that large-scale MoE models like Qwen2.5-Max represent a significant step forward in the quest for more intelligent AI systems.