
Share
This guide demystifies the complex architecture of transformers, offering readers hands-on experience through interactive Jupyter notebook exercises that enhance understanding and retention.
In this article, we’ll walk through the process of creating and training a transformer model from scratch. We’ll break down each foundational element step by step, explaining what’s happening along the way. This guide is written in a Jupyter notebook, which you can download and use to run the code yourself as you follow along. Running the code and experimenting with it will help you grasp the concepts better than just reading.
Transformers are a cornerstone of modern natural language processing (NLP). They have revolutionized tasks like text generation, translation, and sentiment analysis. By building one from scratch, you'll gain a deeper understanding of how these models work under the hood, which can be invaluable for optimizing them in real-world applications.
To get started, we need to download the mini Shakespeare dataset. This dataset is a small subset of Shakespeare's works and will serve as our training data.
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Let’s open the file and take a peek at its contents:
with open('input.txt') as f:
text = f.read()
print('Length of input.txt (characters):', len(text))
print('First 500 characters:', text[:500])
This will output:
Length of input.txt (characters): 1115394
First 500 characters: First Citizen:
Before we proceed any further, hear me speak.
All:
Speak, speak.
First C
A transformer model consists of several key components:

Tokenization is the first step. We need to convert our text into a format that can be processed by the transformer. For simplicity, we'll use character-level tokenization:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print('Vocabulary size:', vocab_size)
This will give us the unique characters in the dataset and their count.
Next, we create an embedding layer to map each character to a dense vector. The dimension of these vectors is a hyperparameter you can tune:
import torch
import torch.nn as nn
class CharEmbedding(nn.Module):
def __init__(self, vocab_size, embed_dim):
super(CharEmbedding, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
def forward(self, x):
return self.embedding(x)
Positional encoding is crucial for transformers since they lack a built-in mechanism to understand the order of tokens. We can add positional encodings to the embeddings:
class PositionalEncoding(nn.Module):
def __init__(self, embed_dim, max_len=5000):
super(PositionalEncoding, self).__init__()
pe = torch.zeros(max_len, embed_dim)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, embed_dim, 2).float() * (-math.log(10000.0) / embed_dim))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0).transpose(0, 1)
self.register_buffer('pe', pe)
def forward(self, x):
return x + self.pe[:x.size(0), :]
The self-attention mechanism is the
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
8 January 2024
133 articles
Related Articles
Related Articles
More Stories