
Share
This project tackles the monumental task of classifying over 8.4 million PDFs from the Common Crawl dataset using a blend of LLMs and XGBoost, pushing the boundaries of document classification.
Classifying the entirety of the internet’s PDFs is no small feat, but it's a challenge that combines interesting technical problems with practical applications. In this article, we’ll dive into how I approached classifying the SafeDocs dataset using a mix of Large Language Models (LLMs), embeddings, and machine learning techniques like XGBoost.
The Common Crawl (CC) is a web archive that has been running since 2007, storing petabytes of data from the internet. However, when it comes to PDFs, CC truncates files to the first megabyte, which isn’t ideal for detailed analysis. Enter SafeDocs or CC-MAIN-2021-31-PDF-UNTRUNCATED, a dataset created by the DARPA SafeDocuments program. This corpus refetched all the PDFs from a Common Crawl snapshot to provide untruncated versions, resulting in a massive 8.4 million PDFs totaling 8TB when uncompressed.
The goal was to classify these PDFs into meaningful categories like "Math" for linear algebra documents or "Medicine" for anatomy textbooks. This classification can help researchers and practitioners quickly find relevant content without sifting through millions of files manually.
Given the sheer size of 8TB, I needed a more manageable starting point. The metadata for SafeDocs is available as a 8GB text file, which includes a crucial column: url. URLs provide valuable context about the origin and nature of the PDFs. For example, a URL from an educational institution is likely to contain academic content.
To generate labels for the PDFs, I followed a teacher-student approach inspired by the FineWeb-Edu project. Here’s a breakdown of the process:
Teacher Model (LLM):
Student Model:
To handle the text content efficiently, I used embedding models to convert text into numerical vectors. These embeddings capture semantic meaning and are crucial for training machine learning models. Here’s a simplified architecture:

Text Preprocessing:
Embedding Generation:
I experimented with several machine learning models to find the best classifier:
XGBoost:
Linear Regression:
The XGBoost model outperformed the Linear Regression model significantly. Here are some key metrics:
Accuracy:
Training Time:
To make the results more digestible, I created some visualizations using libraries like Matplotlib and Seaborn. These graphs helped in understanding the distribution of categories and the performance of different models.
Classifying 8.4 million PDFs is a challenging but rewarding task. By leveraging LLMs, embeddings, and machine learning techniques, we can extract valuable insights from vast datasets. The SafeDocs dataset provides an excellent opportunity for researchers and practitioners to explore and classify large volumes of unstructured data.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
20 August 2024
88 articles
Related Articles
Related Articles
More Stories