
Share
Scientists at Hugging Face show it's possible to reach near-peak performance with GPT-2 using just 1 billion tokens-less than one-tenth the usual data volume-by refining dataset selection and training techniques.
Training large language models like GPT-2 typically requires trillions of tokens and substantial computational resources. However, a team at Hugging Face has demonstrated that you can achieve over 90% of the performance with just 1/10th of the data by carefully crafting your pre-training dataset. This article dives into their methodology, findings, and the optimal recipe for efficient pre-training.
Training large language models has turned into an arms race of scale. Modern models consume trillions of tokens during pre-training, often taking months to train and costing millions of dollars. The prevailing assumption is that more data leads to better models. But does all that data hold equal value?
Recent research suggests that dataset quality matters as much as quantity. This raises the question: Can we create a smaller, more effective dataset that achieves comparable performance? The team at Hugging Face aimed to do just that by crafting a 1 billion token dataset for training a GPT-2-sized model.
To find the optimal dataset composition, the team conducted over 50 controlled experiments using a GPT-2 architecture with 70 million parameters. They tested different combinations of three data sources from their pre-training dataset collection:

The team found that a static mixture of:
consistently outperformed more complex curriculum learning strategies. This mix not only avoided catastrophic failures but also maintained excellent generalization across various tasks.
Curriculum learning, which involves gradually introducing more difficult data over time, can lead to catastrophic failure modes. The team observed that models trained with certain curriculum strategies would suddenly degrade in performance, often irreversibly. This highlights the importance of static mixing for stability and reliability.
Synthetic content, generated by other language models or through data augmentation techniques, can be valuable but must be used judiciously. The team identified a "goldilocks zone" where synthetic content enhances performance without overwhelming the model with low-quality data. This balance is crucial for maintaining high performance and generalization.
By carefully curating and mixing different types of training data, the team at Hugging Face demonstrated that it's possible to achieve high-performance language models with significantly less data. This approach not only reduces computational costs but also makes large-scale pre-training more accessible to a broader range of researchers and practitioners.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
17 November 2025
88 articles
Related Articles
Related Articles
More Stories