Embedding English Wikipedia in 15 Minutes with Modal and Hugging Face

Tools & Engineering

The Engineer

25 Jan 2024 · 3 min read

This guide walks you through quickly setting up text embeddings for the English Wikipedia using modal and Hugging Face, bypassing common hurdles like API rate limits and infrastructure headaches.

Text embeddings are a cornerstone of modern applications leveraging large language models (LLMs). They transform text into numerical vectors that capture semantic meaning, enabling tasks like search, recommendation, and Retrieval-Augmented Generation (RAG). While services like OpenAI’s text-embedding-ada-002 provide a convenient starting point, fine-tuning open-source models with your own data can yield higher-quality results at lower costs. However, scaling embedding jobs for large datasets is challenging due to rate limits, infrastructure complexity, and the difficulty of accessing multiple GPUs.

Enter Modal, a serverless platform that simplifies these challenges by enabling rapid scaling across many GPUs. In this article, we’ll walk through how to embed the entire English Wikipedia in just 15 minutes using Hugging Face’s Text Embedding Inference service on Modal. The total cost? Just over $15.

Why Open-Source Models?

Closed-source models are excellent for initial development, but they have limitations in production:

Cost and Quality: Fine-tuning open-source models with your own data can lead to more accurate embeddings and lower costs.
Flexibility: You have full control over the model, allowing you to customize it according to your specific needs.

Introduction to Modal

Modal is a serverless platform designed for scaling compute-intensive workloads. It abstracts away infrastructure management, enabling you to focus on your code. Key features include:

Rapid Scaling: Automatically scales across multiple GPUs.
Cost Efficiency: Pay only for the resources you use.
Ease of Use: Simple API and robust documentation.

Step-by-Step Guide

1. Set Up Your Environment

First, ensure you have a Modal account and the necessary dependencies installed:

pip install modal-client huggingface_hub

2. Define the Embedding Function

Next, define a function to generate embeddings using Hugging Face’s Text Embedding Inference service:

import modal
from huggingface_hub import InferenceClient

client = InferenceClient()

@modal.function( image=modal.Image.debian_slim().pip_install("huggingface_hub"), gpu="A10G", ) def embed_text(texts): embeddings = client.text_embeddings(texts, model="sentence-transformers/all-MiniLM-L6-v2") return embeddings


#### 3. Prepare the Wikipedia Dataset

Download and preprocess the English Wikipedia dataset:

```python
import requests
from bs4 import BeautifulSoup

def fetch_wikipedia():
    url = "https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2"
    response = requests.get(url)
    with open("enwiki-latest-pages-articles.xml.bz2", "wb") as file:
        file.write(response.content)

# Preprocess the dataset (simplified for brevity)
def preprocess_wikipedia(file_path):
    with bz2.BZ2File(file_path, 'r') as f:
        xml_content = f.read()
    soup = BeautifulSoup(xml_content, 'xml')
    articles = [article.text for article in soup.find_all('text')]
    return articles

4. Run the Embedding Job

Finally, use Modal to run the embedding job:

if __name__ == "__main__":
    fetch_wikipedia()
    articles = preprocess_wikipedia("enwiki-latest-pages-articles.xml.bz2")
    
    # Split the dataset into chunks for parallel processing
    chunk_size = 1000
    article_chunks = [articles[i:i + chunk_size] for i in range(0, len(articles), chunk_size)]
    
    # Run embeddings in parallel
    results = modal.map(embed_text, article_chunks)
    
    # Combine the results
    all_embeddings = sum(results, [])
    
    # Save or further process the embeddings
    with open("wikipedia_embeddings.pkl", "wb") as file:
        pickle.dump(all_embeddings, file)

Performance and Cost

Time: The entire process takes around 15 minutes.
Cost: The job costs just over $15, thanks to Modal’s efficient resource management.

Conclusion

By leveraging Modal and Hugging Face’s Text Embedding Inference