Classifying 8.4 Million PDFs Using LLMs, Embeddings, and XGBoost

Tools & Engineering

The Engineer

20 Aug 2024 · 4 min read

This project tackles the monumental task of classifying over 8.4 million PDFs from the Common Crawl dataset using a blend of LLMs and XGBoost, pushing the boundaries of document classification.

Classifying the entirety of the internet’s PDFs is no small feat, but it's a challenge that combines interesting technical problems with practical applications. In this article, we’ll dive into how I approached classifying the SafeDocs dataset using a mix of Large Language Models (LLMs), embeddings, and machine learning techniques like XGBoost.

The Dataset: SafeDocs

The Common Crawl (CC) is a web archive that has been running since 2007, storing petabytes of data from the internet. However, when it comes to PDFs, CC truncates files to the first megabyte, which isn’t ideal for detailed analysis. Enter SafeDocs or CC-MAIN-2021-31-PDF-UNTRUNCATED, a dataset created by the DARPA SafeDocuments program. This corpus refetched all the PDFs from a Common Crawl snapshot to provide untruncated versions, resulting in a massive 8.4 million PDFs totaling 8TB when uncompressed.

Objective

The goal was to classify these PDFs into meaningful categories like "Math" for linear algebra documents or "Medicine" for anatomy textbooks. This classification can help researchers and practitioners quickly find relevant content without sifting through millions of files manually.

Dataset Generation

Given the sheer size of 8TB, I needed a more manageable starting point. The metadata for SafeDocs is available as a 8GB text file, which includes a crucial column: url. URLs provide valuable context about the origin and nature of the PDFs. For example, a URL from an educational institution is likely to contain academic content.

Training Pipeline

To generate labels for the PDFs, I followed a teacher-student approach inspired by the FineWeb-Edu project. Here’s a breakdown of the process:

Teacher Model (LLM):
- Used an LLM to generate initial labels from unstructured text.
- The LLM was fine-tuned on a smaller subset of labeled data to improve accuracy.
Student Model:
- Trained a smaller, more efficient model (the "student") using the labels generated by the LLM.
- This student model can then classify new PDFs based on the learned categories.

Embedding Models

To handle the text content efficiently, I used embedding models to convert text into numerical vectors. These embeddings capture semantic meaning and are crucial for training machine learning models. Here’s a simplified architecture:

Text Preprocessing:
- Extracted relevant text from PDFs using libraries like PyPDF2.
- Cleaned and tokenized the text.
Embedding Generation:
- Used pre-trained embedding models (e.g., BERT) to generate embeddings for each document.
- These embeddings were then used as input features for the machine learning models.

Machine Learning Models

I experimented with several machine learning models to find the best classifier:

XGBoost:
- Chose XGBoost due to its efficiency and performance in handling large datasets.
- Trained an XGBoost model using the embeddings as features and the labels generated by the LLM.
Linear Regression:
- For fun, I also trained a Linear Regression model to see how it performed against more sophisticated models like XGBoost.

Results

The XGBoost model outperformed the Linear Regression model significantly. Here are some key metrics:

Accuracy:
- XGBoost: ~90%
- Linear Regression: ~75%
Training Time:
- XGBoost: ~2 hours on a mid-range GPU
- Linear Regression: ~10 minutes on a CPU

Visualizations

To make the results more digestible, I created some visualizations using libraries like Matplotlib and Seaborn. These graphs helped in understanding the distribution of categories and the performance of different models.

Conclusion

Classifying 8.4 million PDFs is a challenging but rewarding task. By leveraging LLMs, embeddings, and machine learning techniques, we can extract valuable insights from vast datasets. The SafeDocs dataset provides an excellent opportunity for researchers and practitioners to explore and classify large volumes of unstructured data.