Qwen2-Math: Enhancing Mathematical Reasoning with Specialized Large Language Models

Models & Research

The Engineer

9 Aug 2024 · 3 min read

Qwen2-Math tackles complex mathematical reasoning with unprecedented accuracy, surpassing both open-source and proprietary competitors, thanks to specialized training focused on arithmetic and logic.

Introducing Qwen2-Math

The Qwen team has been hard at work over the past year, focusing on enhancing the reasoning capabilities of large language models (LLMs) for solving arithmetic and mathematical problems. Today, we are excited to introduce Qwen2-Math, a series of specialized math LLMs that significantly outperform existing open-source and even closed-source models like GPT-4o. This new suite includes Qwen2-Math and its instruction-tuned variants, Qwen2-Math-Instruct-1.5B/7B/72B.

Technical Overview

Base Models: Qwen2-Math

The base models of Qwen2-Math are initialized with the Qwen2 LLMs in three sizes: 1.5B, 7B, and 72B parameters. These models are then pretrained on a carefully curated Mathematics-specific Corpus. The corpus includes:

High-quality mathematical web texts: Covering a wide range of topics from various sources.
Mathematical books: Textbooks and other educational materials.
Codes: Algorithms and programming examples related to mathematics.
Exam questions: Problems from standardized tests like the SAT, GRE, and others.
Synthetic data: Generated by Qwen2 to augment the training set.

The Qwen2-Math base models are evaluated on several widely used benchmarks:

English math benchmarks:
- GSM8K: A dataset of 8.5K grade school math word problems.
- Math: A collection of mathematical questions and answers.
- MMLU-STEM: Multiple-choice questions covering STEM topics.
Chinese math benchmarks:
- CMATH: Comprehensive math problems in Chinese.
- GaoKao Math Cloze: Fill-in-the-blank questions from the Chinese college entrance exam.
- GaoKao Math QA: Question-and-answer format from the same exam.

All evaluations are conducted using few-shot chain-of-thought prompting to assess the models' reasoning capabilities.

Instruction-Tuned Models: Qwen2-Math-Instruct

To further enhance the performance of Qwen2-Math, we developed instruction-tuned variants. The process involves:

Reward model training: We trained a math-specific reward model based on Qwen2-Math-72B. This model provides dense rewards for correct answers and binary signals indicating whether the answer is right or wrong.
Supervised Fine-Tuning (SFT) data construction: Using Rejection Sampling, we generated high-quality SFT data by filtering out low-reward responses.
Reinforcement Learning with Group Relative Policy Optimization (GRPO): After SFT, we applied GRPO to further refine the models. This method optimizes policies based on relative performance within groups of similar tasks.

Benchmark Performance

The largest model, Qwen2-Math-72B-Instruct, demonstrates superior performance across all benchmarks compared to state-of-the-art models:

GSM8K: Outperforms GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, and Llama-3.1-405B.
Math: Consistently delivers higher accuracy in solving complex mathematical problems.
MMLU-STEM: Shows significant improvement in STEM-related multiple-choice questions.

For the Chinese benchmarks, Qwen2-Math also excels:

CMATH: High accuracy in solving a wide range of math problems.
GaoKao Math Cloze and QA: Superior performance on college entrance exam questions.

Conclusion

Qwen2-Math represents a significant step forward in enhancing the mathematical reasoning capabilities of LLMs. By focusing on specialized training and instruction tuning, we have created models that can effectively solve complex mathematical problems, outperforming even leading commercial models. We hope these advancements will contribute to the broader AI community and help solve real-world challenges.