Building a Transformer From Scratch: A Step-by-Step Guide with Jupyter Notebook

Models & Research

The Engineer

8 Jan 2024 · 3 min read

This guide demystifies the complex architecture of transformers, offering readers hands-on experience through interactive Jupyter notebook exercises that enhance understanding and retention.

In this article, we’ll walk through the process of creating and training a transformer model from scratch. We’ll break down each foundational element step by step, explaining what’s happening along the way. This guide is written in a Jupyter notebook, which you can download and use to run the code yourself as you follow along. Running the code and experimenting with it will help you grasp the concepts better than just reading.

Why This Matters

Transformers are a cornerstone of modern natural language processing (NLP). They have revolutionized tasks like text generation, translation, and sentiment analysis. By building one from scratch, you'll gain a deeper understanding of how these models work under the hood, which can be invaluable for optimizing them in real-world applications.

Getting Started

To get started, we need to download the mini Shakespeare dataset. This dataset is a small subset of Shakespeare's works and will serve as our training data.

!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

Let’s open the file and take a peek at its contents:

with open('input.txt') as f:
    text = f.read()
print('Length of input.txt (characters):', len(text))
print('First 500 characters:', text[:500])

This will output:

Length of input.txt (characters): 1115394
First 500 characters: First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First C

Key Components of a Transformer

A transformer model consists of several key components:

Tokenization: Converting text into tokens (e.g., words or characters).
Embedding Layer: Mapping tokens to dense vectors.
Positional Encoding: Adding information about the position of each token in the sequence.
Self-Attention Mechanism: Allowing the model to focus on different parts of the input.
Feed-Forward Neural Network: Processing the output of the self-attention mechanism.

Tokenization

Tokenization is the first step. We need to convert our text into a format that can be processed by the transformer. For simplicity, we'll use character-level tokenization:

chars = sorted(list(set(text)))
vocab_size = len(chars)
print('Vocabulary size:', vocab_size)

This will give us the unique characters in the dataset and their count.

Embedding Layer

Next, we create an embedding layer to map each character to a dense vector. The dimension of these vectors is a hyperparameter you can tune:

import torch
import torch.nn as nn

class CharEmbedding(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super(CharEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)

    def forward(self, x):
        return self.embedding(x)

Positional Encoding

Positional encoding is crucial for transformers since they lack a built-in mechanism to understand the order of tokens. We can add positional encodings to the embeddings:

class PositionalEncoding(nn.Module):
    def __init__(self, embed_dim, max_len=5000):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, embed_dim)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embed_dim, 2).float() * (-math.log(10000.0) / embed_dim))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:x.size(0), :]

Self-Attention Mechanism

The self-attention mechanism is the