Qwen2: The Multilingual, High-Performance Evolution of Qwen1.5

Models & Research

The Engineer

7 Jun 2024 · 3 min read

Qwen2 debuts with five new model sizes, spearheaded by its colossal 72-billion-parameter variant, offering unparalleled multilingual proficiency and benchmark-leading performance in coding and math tasks.

After months of intensive development, the Qwen team is excited to announce the release of Qwen2, a significant leap forward from its predecessor, Qwen1.5. This new iteration brings enhanced multilingual support, state-of-the-art benchmark performance, and improved capabilities in coding and mathematics. Let's dive into the technical details and what they mean for practitioners.

Key Technical Changes

Model Sizes and Architecture

Qwen2 introduces five different model sizes: Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and the largest, Qwen2-72B. Each model is designed to cater to a range of use cases, from resource-constrained environments to high-performance applications. Here's a breakdown:

Qwen2-0.5B: 0.49B parameters, 0.35B non-embedding parameters
Qwen2-1.5B: 1.54B parameters, 1.31B non-embedding parameters
Qwen2-7B: 7.07B parameters, 5.98B non-embedding parameters
Qwen2-57B-A14B: 57.41B parameters, 56.32B non-embedding parameters
Qwen2-72B: 72.71B parameters, 70.21B non-embedding parameters

All models use Group Query Attention (GQA), which improves inference speed and reduces memory usage. For smaller models, embedding weights are tied to further optimize resource utilization.

Multilingual Support

One of the standout features of Qwen2 is its expanded multilingual capabilities. In addition to English and Chinese, Qwen2 has been trained on data in 27 additional languages. This makes it a powerful tool for applications requiring broad language coverage, such as translation services, content generation, and cross-lingual information retrieval.

Benchmark Performance

Qwen2 has achieved state-of-the-art performance across various benchmarks. Notably, the Qwen2-72B model excels in coding and mathematics tasks, demonstrating significant improvements over its predecessor. This enhanced performance is crucial for applications like code completion, bug detection, and mathematical problem-solving.

Extended Context Length

Another key improvement is the extended context length support. The Qwen2-7B-Instruct and Qwen2-72B-Instruct models can handle up to 128K tokens, a substantial increase from the 32K token limit of their base counterparts. This capability is essential for tasks that require understanding long contexts, such as summarizing lengthy documents or generating coherent narratives.

Model Information

Here's a detailed overview of the Qwen2 models:

| Models | Qwen2-0.5B | Qwen2-1.5B | Qwen2-7B | Qwen2-57B-A14B | Qwen2-72B | | --- | --- | --- | --- | --- | --- | | # Params | 0.49B | 1.54B | 7.07B | 57.41B | 72.71B | | # Non-Emb Params | 0.35B | 1.31B | 5.98B | 56.32B | 70.21B | | GQA | True | True | True | True | True | | Tie Embedding | True | True | False | False | False | | Context Length | 32K | 32K | 128K | 64K | 128K |

Implementation Notes

GQA: Group Query Attention is applied to all model sizes, enhancing inference speed and memory efficiency.
Tie Embedding: Smaller models (Qwen2-0.5B and Qwen2-1.5B) tie embedding weights to optimize resource usage, while larger models do not due to the high