Qwen1.5-32B: A Balanced 30B Parameter Model for Performance and Efficiency

Models & Research

The Engineer

8 Apr 2024 · 3 min read

Qwen1.5-32B addresses the open-source community's need for a model that excels in performance while maintaining efficiency and affordability, striking a balance between resource-intensive giants like Qwen1.5-72B and lighter alternatives.

The open-source community has been on a quest to find a model that balances performance, efficiency, and memory footprint. While models like Qwen1.5-72B and DBRX have pushed the boundaries of what's possible, they often come with significant drawbacks such as high memory consumption, slow inference speed, and expensive finetuning costs.

In response to this challenge, the Qwen team is excited to introduce the latest additions to the Qwen1.5 language model series: Qwen1.5-32B and Qwen1.5-32B-Chat. These models aim to hit the "sweet spot" of around 30 billion parameters, offering strong performance while keeping resource requirements manageable.

Technical Highlights

Model Architecture

Qwen1.5-32B is built on the same architecture as its predecessors but with a few key enhancements:

Grouped Query Attention (GQA): This technique improves inference efficiency by reducing the computational load during serving.
Memory Efficiency: The model is optimized to reduce memory usage, making it more feasible for deployment in resource-constrained environments.

Performance Benchmarks

Qwen1.5-32B has been rigorously tested against other state-of-the-art (SOTA) models with similar parameter counts. Here’s how it stacks up:

| Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Llama2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - | | Yi-34B | 76.3 | 81.4 | 67.2 | 14.4 | 23.2 | 41.0 | 54.3 | 83.7 | | Mixtral-8x7B | 70.6 | - | 74.4 | 28.4 | 40.2 | 60.7 | - | - | | Qwen1.5-72B | 77.5 | 84.1 | 79.5 | 34.1 | 41.5 | 53.4 | 65.5 | 83.5 | | Qwen1.5-32B | 73.4 | 83.5 | 77.4 | 36.1 | 37.2 | 49.4 | 66.8 | 82.3 |

Key Takeaways

Competitive Performance: Qwen1.5-32B demonstrates competitive performance across a variety of tasks, including MMLU, GSM8K, HumanEval, and more.
Efficiency: The inclusion of GQA and memory optimization techniques make it a more efficient choice for deployment.
Scalability: The model is designed to be scalable, making it suitable for both research and production environments.

Post-Training Techniques

To enhance the conversational capabilities of Qwen1.5-32B, we have focused on post-training techniques:

Reinforcement Learning from Human Feedback (RLHF): This approach has been instrumental in improving the model’s ability to generate more human-like responses and handle complex conversations.

Community and Availability

Qwen1.5-32B is fully open-source, allowing researchers and developers to experiment with and contribute to its development. The model is available on multiple platforms:

GitHub: https://github.com/QwenLM/Qwen1.5
Hugging Face: https://huggingface.co/Qwen
ModelScope: https://modelscope.cn/organization/qwen
Demo: [