Quality Over Quantity: Why AI Labs Will Spend More on High-Quality RL Tasks

Models & Research

The Engineer

21 Aug 2025 · 3 min read

As AI labs grapple with soaring computational costs, they are increasingly opting for fewer but higher-quality reinforcement learning tasks, boosting efficiency despite requiring more initial investment in task design.

When it comes to creating reinforcement learning (RL) tasks, practitioners face a fundamental tradeoff between quality and quantity. You can either invest significant engineering effort to create a small number of high-quality, hand-crafted tasks that provide rich reward signals, or you can use procedural generation to churn out a large number of lower-quality tasks with less effort per task. This decision is crucial because it directly impacts the efficiency and effectiveness of your training runs, especially as compute costs continue to rise.

The Quality vs. Quantity Tradeoff

High-Quality Tasks: These are meticulously designed and provide dense, informative reward signals. They require substantial engineering effort but offer better learning outcomes.
Low-Quality Tasks: These are easier to generate in large numbers but often result in sparse or weak reward signals, leading to less effective training.

Why Quality Will Prevail

Ege Erdil, Matthew Barnett, and Tamay Besiroglu predict that within a year, AI labs will favor quality over quantity when procuring RL environments. They argue that the increasing compute costs per RL run will make it inefficient to use low-quality tasks. Here’s why:

Compute Costs: The cost of running RL models is high and growing. For example, the API price for Grok 4 is $15 per million output tokens. This represents the opportunity cost of using compute resources for training instead of inference.
Task Complexity: Frontier RL tasks are becoming more complex, resulting in longer transcripts. SWE-bench Verified, a benchmark from over a year ago, had average transcript lengths of about 20,000 tokens. Current frontier tasks already result in transcripts around 100,000 tokens long. Epoch AI has observed that transcript lengths are growing at a rate of 5x per year. Extrapolating this trend suggests that within a year, transcript lengths will be around half a million tokens.
Group Size: In training runs like DeepSeek-R1, models are trained with a group size of 64. If each task uses an average of 500,000 tokens at $15 per million tokens, the cost per RL task is calculated as follows:
- (500,000 tokens × $15 × 64) ÷ 1M = $480 per RL task

The Financial Incentive

Given these factors, AI labs are likely to spend a few thousand dollars per RL task to ensure high-quality training. This is a significant increase from the current inefficiency threshold of around $500 per task. The reasoning is straightforward: spending more on high-quality tasks will prevent the waste of expensive compute resources on low-quality training runs.

Implementation Details

Transcript Length: Current trends suggest that transcript lengths are growing exponentially, making each RL task more computationally intensive.
Group Size and Parallelism: Training with a larger group size (e.g., 64) can help distribute the computational load but also increases the overall cost per task.
Cost Calculation: The formula for calculating the cost per RL task is straightforward and directly tied to the length of the transcripts and the API price of compute resources.

Conclusion

The shift towards high-quality RL tasks is inevitable as compute costs continue to rise. AI labs will need to prioritize quality over quantity to ensure that their training runs are both efficient and effective. By investing more in high-quality tasks, they can avoid wasting valuable computational resources on low-quality training data.