DeepScaleR-1.5B-Preview: Scaling Reinforcement Learning to Outperform O1-Preview on Math Benchmarks

Models & Research

The Engineer

13 Feb 2025 · 3 min read

DeepScaleR-1.5B-Preview outperforms OpenAI’s O1-Preview on math benchmarks, showcasing the potential of reinforcement learning in scaling model intelligence and surpassing competition with open-source advancements.

DeepScaleR, a project led by Michael Luo and Sijun Tan, introduces DeepScaleR-1.5B-Preview, a language model fine-tuned using reinforcement learning (RL) from the base model Deepseek-R1-Distilled-Qwen-1.5B. This 1.5 billion parameter model achieves impressive results on AIME2024, surpassing OpenAI’s O1-Preview with a 43.1% Pass@1 accuracy (+14.3% improvement over the base model). The team has open-sourced their dataset, code, and training logs to foster further advancements in scaling intelligence with RL.

Key Performance Metrics

| Model | AIME 2024 (Pass@1) | MATH 500 (Pass@1) | AMC 2023 (Pass@1) | Minerva Math (Pass@1) | Olympiad Bench (Pass@1) | Avg. Pass@1 | | --- | --- | --- | --- | --- | --- | --- | | DeepScaleR-1.5B-Preview | 43.1% | 87.8% | 73.6% | 30.2% | 50.0% | 57.0% | | DeepSeek-R1-Distill-Qwen-1.5B | 28.8% | 82.8% | 62.9% | 26.5% | 43.3% | 48.9% | | O1-Preview | 40.0% | 81.4% | - | - | - | - |

Technical Details and Methodology

Base Model and Training Setup

Base Model: Deepseek-R1-Distilled-Qwen-1.5B, a distilled version of the larger Qwen model.
Training Data: 40K high-quality math problems.
Compute Resources: Trained using 3,800 A100 GPU hours (approximately $4,500).

Reinforcement Learning Approach

The team leveraged RL to fine-tune the base model, focusing on improving its performance on competition-level math benchmarks. Key aspects of their approach include:

Iterative Lengthening Scheme: To manage computational costs and improve performance, they introduced an iterative lengthening scheme for context length:
- At step 1040, the context length was extended to 16K tokens.
- At step 1520, it was further extended to 24K tokens.

Reward Function: The reward function was designed to prioritize accurate solutions while penalizing incorrect or incomplete answers. This helped guide the model towards more precise and reliable reasoning.

Challenges and Solutions

One of the primary challenges in scaling RL is the high computational cost. Directly replicating DeepSeek-R1’s experiments, which involve context lengths of 32K tokens and around 8000 training steps, would require at least 70,000 A100 GPU hours for a 1.5B model. To mitigate this, the team employed several strategies:

Distilled Model: Using a distilled version of the larger Qwen model reduced the computational overhead.
Efficient Training Pipeline: Optimized training pipelines and efficient data loading techniques were implemented to maximize resource utilization.

Open Sourcing and Community Impact

The DeepScaleR project is committed to transparency and community involvement. They have open-sourced their dataset, code, and training logs to enable others to reproduce their results and build upon their work. This includes:

Website: agentica-project.com
GitHub Repository: github.com/agentica-project/deepscaler
Hugging Face Model: huggingface.co/agentica-org/DeepScaleR-1.5B-Preview
**H