
Share
HuggingFace's FineWeb aims to bridge the gap in open-source language model training by offering a massive, high-quality text dataset derived from web crawls, surpassing existing benchmarks and boosting model performance.
The performance of large language models (LLMs) is heavily influenced by the quality and size of their pretraining datasets. However, state-of-the-art open LLMs like Llama 3 and Mixtral have pretraining datasets that are not publicly available, leaving a gap in our understanding of how these models achieve their high performance.
To address this, HuggingFace has released FineWeb (15 trillion tokens, 44TB disk space), a large-scale dataset derived from 96 CommonCrawl snapshots. FineWeb is designed to produce better-performing LLMs than other open pretraining datasets. This article delves into the technical details and design choices behind FineWeb, as well as its educational subset, FineWeb-Edu.
The raw data for FineWeb comes from 96 snapshots of CommonCrawl, a massive repository of web data. These snapshots provide a broad and diverse source of text, but they also come with challenges such as noise, duplicates, and low-quality content. The team at HuggingFace tackled these issues through a series of preprocessing steps.
Deduplication: To remove duplicate content, the team used both exact and near-duplicate detection techniques. Exact deduplication ensures that identical documents are not included multiple times, while near-duplicate detection identifies and removes documents that are very similar but not identical.
Filtering: Filtering is crucial to ensure high-quality data. The team applied a multi-step filtering process:
Sampling: To ensure a balanced dataset, the team used stratified sampling techniques. This involved dividing the data into different strata based on content type (e.g., news, blogs, scientific articles) and then sampling from each stratum to maintain diversity.

FineWeb has been shown to produce LLMs that outperform those trained on other open pretraining datasets. The team conducted extensive benchmarking to validate this claim:
FineWeb-Edu is a subset of FineWeb that focuses on high-quality educational content. It is available in two sizes:
The creation of FineWeb-Edu involved additional preprocessing steps to ensure the highest quality educational material:
Both FineWeb and FineWeb-Edu are released under the permissive ODC-By 1.0 license, making them freely available for research and development.
FineWeb represents a significant step forward in creating high-quality pretraining datasets for LLMs. By carefully documenting and ablatively testing their design choices, the HuggingFace team has provided valuable insights into the process of creating large-scale web datasets. The availability of FineWeb and FineWeb-Edu will undoubtedly contribute to the advancement of machine learning research and the development of better-performing language models.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
4 June 2024
88 articles
Related Articles
Related Articles
More Stories