FineWeb: Decanting the Web for High-Quality Text Data at Scale

Models & Research

The Engineer

4 Jun 2024 · 3 min read

HuggingFace's FineWeb aims to bridge the gap in open-source language model training by offering a massive, high-quality text dataset derived from web crawls, surpassing existing benchmarks and boosting model performance.

Introduction

The performance of large language models (LLMs) is heavily influenced by the quality and size of their pretraining datasets. However, state-of-the-art open LLMs like Llama 3 and Mixtral have pretraining datasets that are not publicly available, leaving a gap in our understanding of how these models achieve their high performance.

To address this, HuggingFace has released FineWeb (15 trillion tokens, 44TB disk space), a large-scale dataset derived from 96 CommonCrawl snapshots. FineWeb is designed to produce better-performing LLMs than other open pretraining datasets. This article delves into the technical details and design choices behind FineWeb, as well as its educational subset, FineWeb-Edu.

Web Data

Finding the Raw Data

The raw data for FineWeb comes from 96 snapshots of CommonCrawl, a massive repository of web data. These snapshots provide a broad and diverse source of text, but they also come with challenges such as noise, duplicates, and low-quality content. The team at HuggingFace tackled these issues through a series of preprocessing steps.

Preprocessing Steps

Deduplication: To remove duplicate content, the team used both exact and near-duplicate detection techniques. Exact deduplication ensures that identical documents are not included multiple times, while near-duplicate detection identifies and removes documents that are very similar but not identical.
Filtering: Filtering is crucial to ensure high-quality data. The team applied a multi-step filtering process:
- Language Identification: Only content in the target languages (primarily English) was retained.
- Quality Scores: Each document was assigned a quality score based on various metrics such as readability, sentence length, and grammatical correctness.
- Spam and Noise Removal: Documents that were identified as spam or contained excessive noise (e.g., boilerplate text, ads) were removed.
Sampling: To ensure a balanced dataset, the team used stratified sampling techniques. This involved dividing the data into different strata based on content type (e.g., news, blogs, scientific articles) and then sampling from each stratum to maintain diversity.

Dataset Performance

FineWeb has been shown to produce LLMs that outperform those trained on other open pretraining datasets. The team conducted extensive benchmarking to validate this claim:

Benchmarks: Models pretrained on FineWeb were evaluated on a variety of benchmarks, including GLUE, SuperGLUE, and MMLU. These models consistently outperformed their counterparts trained on other datasets.
Educational Value: To further assess the quality of the data, the team created a subset called FineWeb-Edu. This subset is specifically designed for educational content and has been shown to perform well on benchmarks like MMLU, ARC, and OpenBookQA.

FineWeb-Edu

FineWeb-Edu is a subset of FineWeb that focuses on high-quality educational content. It is available in two sizes:

1.3 trillion tokens: Very high educational content
5.4 trillion tokens: High educational content

The creation of FineWeb-Edu involved additional preprocessing steps to ensure the highest quality educational material:

Automated Annotations: The team used scalable automated techniques to annotate documents for educational value.
Content Selection: Documents were selected based on their relevance and quality for educational purposes.

Licensing

Both FineWeb and FineWeb-Edu are released under the permissive ODC-By 1.0 license, making them freely available for research and development.

Conclusion

FineWeb represents a significant step forward in creating high-quality pretraining datasets for LLMs. By carefully documenting and ablatively testing their design choices, the HuggingFace team has provided valuable insights into the process of creating large-scale web datasets. The availability of FineWeb and FineWeb-Edu will undoubtedly contribute to the advancement of machine learning research and the development of better-performing language models.