Achieving 90%+ GPT-2 Performance with Just 1 Billion Tokens: The Optimal Dataset Mix

Models & Research

The Engineer

17 Nov 2025 · 3 min read

Scientists at Hugging Face show it's possible to reach near-peak performance with GPT-2 using just 1 billion tokens-less than one-tenth the usual data volume-by refining dataset selection and training techniques.

Training large language models like GPT-2 typically requires trillions of tokens and substantial computational resources. However, a team at Hugging Face has demonstrated that you can achieve over 90% of the performance with just 1/10th of the data by carefully crafting your pre-training dataset. This article dives into their methodology, findings, and the optimal recipe for efficient pre-training.

The Problem: Smarter, Not Harder

Training large language models has turned into an arms race of scale. Modern models consume trillions of tokens during pre-training, often taking months to train and costing millions of dollars. The prevailing assumption is that more data leads to better models. But does all that data hold equal value?

Recent research suggests that dataset quality matters as much as quantity. This raises the question: Can we create a smaller, more effective dataset that achieves comparable performance? The team at Hugging Face aimed to do just that by crafting a 1 billion token dataset for training a GPT-2-sized model.

Our Approach: Systematic Dataset Mixing Experiments

To find the optimal dataset composition, the team conducted over 50 controlled experiments using a GPT-2 architecture with 70 million parameters. They tested different combinations of three data sources from their pre-training dataset collection:

finePDFs: High-quality PDF documents, curated for content quality and relevance.
DCLM-baseline: A diverse set of web texts, including news articles, blogs, and other online content.
FineWeb-Edu: Educational content from the web, focusing on structured and well-written material.

Key Findings

1. Static Dataset Mixing Outperforms Complex Curriculum Strategies

The team found that a static mixture of:

50% finePDFs
30% DCLM-baseline
20% FineWeb-Edu

consistently outperformed more complex curriculum learning strategies. This mix not only avoided catastrophic failures but also maintained excellent generalization across various tasks.

2. Catastrophic Failure Modes in Curriculum Learning

Curriculum learning, which involves gradually introducing more difficult data over time, can lead to catastrophic failure modes. The team observed that models trained with certain curriculum strategies would suddenly degrade in performance, often irreversibly. This highlights the importance of static mixing for stability and reliability.

3. The "Goldilocks Zone" for Synthetic Content

Synthetic content, generated by other language models or through data augmentation techniques, can be valuable but must be used judiciously. The team identified a "goldilocks zone" where synthetic content enhances performance without overwhelming the model with low-quality data. This balance is crucial for maintaining high performance and generalization.

Implementation Details

Model Architecture: GPT-2 with 70 million parameters.
Training Data: 1 billion tokens, composed as described above.
Training Time: Significantly reduced compared to models trained on 10x more data.
Performance Metrics: The model achieved over 90% of the performance metrics of the original GPT-2.

Conclusion

By carefully curating and mixing different types of training data, the team at Hugging Face demonstrated that it's possible to achieve high-performance language models with significantly less data. This approach not only reduces computational costs but also makes large-scale pre-training more accessible to a broader range of researchers and practitioners.